Lyapunov-based safe policy optimization for continuous control- 学术资源搜索

Lyapunov-based safe policy optimization for continuous control

Y Chow, O Nachum, A Faust… - arXiv preprint arXiv …, 2019 - arxiv.org

Y Chow, O Nachum, A Faust, E Duenez-Guzman, M Ghavamzadeh

arXiv preprint arXiv:1901.10031, 2019•arxiv.org

We study continuous action reinforcement learning problems in which it is crucial that the
agent interacts with the environment only through safe policies, ie,~ policies that do not take
the agent to undesirable situations. We formulate these problems as constrained Markov
decision processes (CMDPs) and present safe policy optimization algorithms that are based
on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient
(PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy …

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

arxiv.org

展开收起

被引用次数：292 相关文章所有 5 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果