How to fine-tune the model: Unified model shift and model bias policy optimization

H Zhang, H Yu, J Zhao, D Zhang… - Advances in …, 2024 - proceedings.neurips.cc
Designing and deriving effective model-based reinforcement learning (MBRL) algorithms
with a performance improvement guarantee is challenging, mainly attributed to the high …

Query-policy misalignment in preference-based reinforcement learning

X Hu, J Li, X Zhan, QS Jia, YQ Zhang - arXiv preprint arXiv:2305.17400, 2023 - arxiv.org
Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents'
behavior with human desired outcomes, but is often restrained by costly human feedback …

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

X Wang, R Zheng, Y Sun, R Jia, W Wongkamjan… - arXiv preprint arXiv …, 2023 - arxiv.org
Dyna-style model-based reinforcement learning contains two phases: model rollouts to
generate sample for policy learning and real environment exploration using current policy …

Understanding world models through multi-step pruning policy via reinforcement learning

Z He, W Qiu, W Zhao, X Shao, Z Liu - Information Sciences, 2025 - Elsevier
In model-based reinforcement learning, the conventional approach to addressing world
model bias is to use gradient optimization methods. However, using a singular policy from …

Learning policy-aware models for model-based reinforcement learning via transition occupancy matching

YJ Ma, K Sivakumar, J Yan, O Bastani… - … for Dynamics and …, 2023 - proceedings.mlr.press
Standard model-based reinforcement learning (MBRL) approaches fit a transition model of
the environment to all past experience, but this wastes model capacity on data that is …

A Unified View on Solving Objective Mismatch in Model-Based Reinforcement Learning

R Wei, N Lambert, A McDonald, A Garcia… - arXiv preprint arXiv …, 2023 - arxiv.org
Model-based Reinforcement Learning (MBRL) aims to make agents more sample-efficient,
adaptive, and explainable by learning an explicit model of the environment. While the …

The primacy bias in Model-based RL

Z Qiao, J Lyu, X Li - arXiv preprint arXiv:2310.15017, 2023 - arxiv.org
The primacy bias in deep reinforcement learning (DRL), which refers to the agent's tendency
to overfit early data and lose the ability to learn from new data, can significantly decrease the …

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

C Li, R Jia, J Liu, Y Zhang, Y Niu, Y Yang, Y Liu… - ECAI 2023, 2023 - ebooks.iospress.nl
Abstract Model-based reinforcement learning (RL) has demonstrated remarkable successes
on a range of continuous control tasks due to its high sample efficiency. To save the …

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

Y Zhou, J Zhu, P Xu, X Liu, X Wang, D Koutra… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have significantly advanced various natural language
processing tasks, but deploying them remains computationally expensive. Knowledge …

Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning

X Wang, L Song, Y Tian, D Yu, B Peng, H Mi… - arXiv preprint arXiv …, 2024 - arxiv.org
Monte Carlo Tree Search (MCTS) has recently emerged as a powerful technique for
enhancing the reasoning capabilities of LLMs. Techniques such as SFT or DPO have …