Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

J Hong, A Dragan, S Levine - arXiv preprint arXiv:2411.05193, 2024 - arxiv.org
Value-based reinforcement learning (RL) can in principle learn effective policies for a wide
range of multi-turn problems, from games to dialogue to robotic control, including via offline …

TrajDeleter: Enabling Trajectory Forgetting in Offline Reinforcement Learning Agents

C Gong, K Li, J Yao, T Wang - arXiv preprint arXiv:2404.12530, 2024 - arxiv.org
Reinforcement learning (RL) trains an agent from experiences interacting with the
environment. In scenarios where online interactions are impractical, offline RL, which trains …

Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces

A Kar, R Singh - arXiv preprint arXiv:2410.19919, 2024 - arxiv.org
We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs
and develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms …

Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

L Xu, Z Jiang, J Wang, L Song, J Bian - arXiv preprint arXiv:2306.03680, 2023 - arxiv.org
Offline reinforcement learning (RL) methodologies enforce constraints on the policy to
adhere closely to the behavior policy, thereby stabilizing value learning and mitigating the …