Most modern stochastic optimization methods assume that the data samples are independently identically distributed. However, this assumption is often violated for reinforcement learning setup that deals with temporal-dependent data, coming from a Markov decision process (MDP). Furthermore, to learn reinforcement learning policies, the algorithms have to possess some knowledge about MDP's mixing time or its asymptotic behaviour. For MDPs with high-dimensional state spaces or ones with sparse rewards, mixing time could not be exactly estimated or even may be unknown, making most methods inapplicable. Fortunately, multi-level Monte Carlo approach, taking into account the nature of Markov Chains and letting control variance of the updates, have recently been popularised in the field. The employment of these technique enables the design of reinforcement learning algorithms that are not reliant on oracle knowledge of the mixing time or any assumptions regarding the rate of decay. In light of the aforementioned considerations, we propose an algorithm called MAdam, extending classical Adam for average-reward reinforcement learning. The method leverages non-convex optimization and does not require knowledge of the mixing time. We also provide the theoretical analysis of the optimization procedure and conduct experiments on challenging environments, indicating the qualitative performance of our approach.
Скачать статью на OpenReview (PDF, англ.): https://openreview.net/pdf?id=0HG6EQ0qBP
Александр Чернявский, Андрей Веприков, Владимир Солодкин, Александр Безносиков, Александр Панов. Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning // Первая международная конференция по вычислительной оптимизации ICOMP 2024 (Россия, Иннополис, 10–12 октября, 2024 г.).