Policy optimization with demonstrations

B Kang, Z Jie, J Feng - International conference on machine …, 2018 - proceedings.mlr.press
Policy Optimization from Demonstration (POfD) method, which can acquire knowledge from
demonstration … 1) We reformulate the policy optimization objective by adding a demonstration

Hybrid policy optimization from Imperfect demonstrations

H Yang, C Yu, S Chen - Advances in Neural Information …, 2024 - proceedings.neurips.cc
… HYbrid Policy Optimization (HYPO), which uses a small number of imperfect demonstrations
… The key idea is to train an offline guider policy using imitation learning in order to instruct an …

Guarded policy optimization with imperfect online demonstrations

Z Xue, Z Peng, Q Li, Z Liu, B Zhou - arXiv preprint arXiv:2303.01728, 2023 - arxiv.org
… In this work we develop a new guarded policy optimization … In contrast to the offline learning
from demonstrations, in this work we focus on the online deployment of teacher policies with …

Guided exploration with proximal policy optimization using a single demonstration

G Libardi, G De Fabritiis… - … Conference on Machine …, 2021 - proceedings.mlr.press
… the demonstrations … the policy only specifies a distribution over the action space. We force
the actions of the policy to equal the demonstration actions instead of sampling from the policy

Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

G Wang, F Wu, X Zhang, T Chen - arXiv preprint arXiv:2401.00162, 2023 - arxiv.org
Policy Optimization with Smooth Guidance (POSG) that leverages a small set of sparse-reward
demonstrations to … be indirectly estimated using offline demonstrations rather than directly …

Iterative Regularized Policy Optimization with Imperfect Demonstrations

G Xudong, F Dawei, K Xu, Y Zhai, C Yao… - … on Machine Learning, 2024 - openreview.net
… introduce the Iterative Regularized Policy Optimization (IRPO) method … demonstration
boosting to enhance demonstration quality. Specifically, iterative training capitalizes on the policy

Learning from demonstration: Provably efficient adversarial policy imitation with linear function approximation

Z Liu, Y Zhang, Z Fu, Z Yang… - … conference on machine …, 2022 - proceedings.mlr.press
… 2021) analyze the constrained pessimistic policy optimization with general function
approximation and with the partial coverage assumption of the dataset, then they specialize the case …

Learning from limited demonstrations

B Kim, A Farahmand, J Pineau… - Advances in Neural …, 2013 - proceedings.neurips.cc
… constraints which guide the optimization performed by Approximate Policy Iteration. We …
expert demonstrations. In all cases, we compare our algorithm with LeastSquare Policy Iteration (…

Generalize robot learning from demonstration to variant scenarios with evolutionary policy gradient

J Cao, W Liu, Y Liu, J Yang - Frontiers in Neurorobotics, 2020 - frontiersin.org
… Our EPG algorithm is derived and approximated from the optimization of perturbed policies
with policy gradient methods, by adding some heuristic of EAs. As EAs differ in how to …

Safe driving via expert guided policy optimization

Z Peng, Q Li, C Liu, B Zhou - Conference on Robot Learning, 2022 - proceedings.mlr.press
demonstration as two training sources. Following such a setting, we develop a novel Expert
Guided Policy Optimization (… of an expert policy to generate demonstration and a switch …