Cooperative Multi-Agent Reinforcement Learning (MARL) focuses on developing strategies to effectively train multiple agents to learn and adapt policies collaboratively. Despite being a relatively new area of study, most MARL methods are grounded in well-established approaches used in single-agent deep learning tasks due to their proven effectiveness. In this paper, we focus on the exploration problem inherent to many MARL algorithms. These algorithms frequently introduce new hyperparameters and incorporate auxiliary components, like additional models, complicating the adaptation process of the underlying RL algorithm to better fit multi-agent environments. We aim to optimize a deep MARL algorithm with minimal modifications to the renowned QMIX approach. Our investigation into the exploitation-exploration dilemma reveals that the performance of cutting-edge MARL algorithms can be equaled by merely tweaking the $\epsilon$-greedy policy. This modification depends on the ratio of available joint actions to the number of agents. Moreover, we enhance the training aspect of the replay buffer to decorrelate experiences based on recurrent rollouts rather than episodes. The improved algorithm is not only easy to implement but also aligns with state-of-the-art methods without adding significant complexity.
Скачать статью на OpenReview (PDF, англ.): https://openreview.net/pdf?id=RzoxFLA966
Анатолий Борзилов, Алексей Скрынник, Александр Панов. Rethinking Exploration and Experience Exploitation in Value-Based Multi-Agent Reinforcement Learning // Первая международная конференция по вычислительной оптимизации ICOMP 2024 (Россия, Иннополис, 10–12 октября, 2024 г.).