Learning multi-agent cooperation via considering actions of teammates
IEEE Transactions on Neural Networks and Learning Systems, 2023•ieeexplore.ieee.org
Recently value-based centralized training with decentralized execution (CTDE) multi-agent
reinforcement learning (MARL) methods have achieved excellent performance in
cooperative tasks. However, the most representative method among these methods, Q-
network MIXing (QMIX), restricts the joint action values to be a monotonic mixing of each
agent's utilities. Furthermore, current methods cannot generalize to unseen environments or
different agent configurations, which is known as ad hoc team play situation. In this work, we …
reinforcement learning (MARL) methods have achieved excellent performance in
cooperative tasks. However, the most representative method among these methods, Q-
network MIXing (QMIX), restricts the joint action values to be a monotonic mixing of each
agent's utilities. Furthermore, current methods cannot generalize to unseen environments or
different agent configurations, which is known as ad hoc team play situation. In this work, we …
Recently value-based centralized training with decentralized execution (CTDE) multi-agent reinforcement learning (MARL) methods have achieved excellent performance in cooperative tasks. However, the most representative method among these methods, Q-network MIXing (QMIX), restricts the joint action values to be a monotonic mixing of each agent’s utilities. Furthermore, current methods cannot generalize to unseen environments or different agent configurations, which is known as ad hoc team play situation. In this work, we propose a novel values decomposition that considers both the return of an agent acting on its own and cooperating with other observable agents to address the nonmonotonic problem. Based on the decomposition, we propose a greedy action searching method that can improve exploration and is not affected by changes in observable agents or changes in the order of agents’ actions. In this way, our method can adapt to ad hoc team play situation. Furthermore, we utilize an auxiliary loss related to environmental cognition consistency and a modified prioritized experience replay (PER) buffer to assist training. Our extensive experimental results show that our method achieves significant performance improvements in both challenging monotonic and nonmonotonic domains, and can handle the ad hoc team play situation perfectly.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果