Policy optimization with demonstrations

B Kang, Z Jie, J Feng - International conference on machine …, 2018 - proceedings.mlr.press
Policy Optimization from Demonstration (POfD) method, which can acquire knowledge from
demonstration … 1) We reformulate the policy optimization objective by adding a demonstration

Hybrid policy optimization from Imperfect demonstrations

H Yang, C Yu, S Chen - Advances in Neural Information …, 2024 - proceedings.neurips.cc
… HYbrid Policy Optimization (HYPO), which uses a small number of imperfect demonstrations
… The key idea is to train an offline guider policy using imitation learning in order to instruct an …

Guided exploration with proximal policy optimization using a single demonstration

G Libardi, G De Fabritiis… - … Conference on Machine …, 2021 - proceedings.mlr.press
… the demonstrations … the policy only specifies a distribution over the action space. We force
the actions of the policy to equal the demonstration actions instead of sampling from the policy

Guarded policy optimization with imperfect online demonstrations

Z Xue, Z Peng, Q Li, Z Liu, B Zhou - arXiv preprint arXiv:2303.01728, 2023 - arxiv.org
… In this work we develop a new guarded policy optimization … In contrast to the offline learning
from demonstrations, in this work we focus on the online deployment of teacher policies with …

Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

G Wang, F Wu, X Zhang, T Chen - arXiv preprint arXiv:2401.00162, 2023 - arxiv.org
Policy Optimization with Smooth Guidance (POSG) that leverages a small set of sparse-reward
demonstrations to … be indirectly estimated using offline demonstrations rather than directly …

Iterative Regularized Policy Optimization with Imperfect Demonstrations

G Xudong, F Dawei, K Xu, Y Zhai, C Yao… - … on Machine Learning, 2024 - openreview.net
… introduce the Iterative Regularized Policy Optimization (IRPO) method … demonstration
boosting to enhance demonstration quality. Specifically, iterative training capitalizes on the policy

Learning from demonstration: Provably efficient adversarial policy imitation with linear function approximation

Z Liu, Y Zhang, Z Fu, Z Yang… - … conference on machine …, 2022 - proceedings.mlr.press
… 2021) analyze the constrained pessimistic policy optimization with general function
approximation and with the partial coverage assumption of the dataset, then they specialize the case …

Learning from limited demonstrations

B Kim, A Farahmand, J Pineau… - Advances in Neural …, 2013 - proceedings.neurips.cc
… constraints which guide the optimization performed by Approximate Policy Iteration. We …
expert demonstrations. In all cases, we compare our algorithm with LeastSquare Policy Iteration (…

Diffusion policy policy optimization

AZ Ren, J Lidard, LL Ankile, A Simeonov… - arXiv preprint arXiv …, 2024 - arxiv.org
… On the contrary, we show that for a Diffusion Policy pre-trained from expert
demonstrations, our methodology for fine-tuning via PG updates yields robust, high-performing …

Generalize robot learning from demonstration to variant scenarios with evolutionary policy gradient

J Cao, W Liu, Y Liu, J Yang - Frontiers in Neurorobotics, 2020 - frontiersin.org
… Our EPG algorithm is derived and approximated from the optimization of perturbed policies
with policy gradient methods, by adding some heuristic of EAs. As EAs differ in how to …