Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
Value-based reinforcement learning (RL) can in principle learn effective policies for a wide
range of multi-turn problems, from games to dialogue to robotic control, including via offline …
range of multi-turn problems, from games to dialogue to robotic control, including via offline …
TrajDeleter: Enabling Trajectory Forgetting in Offline Reinforcement Learning Agents
Reinforcement learning (RL) trains an agent from experiences interacting with the
environment. In scenarios where online interactions are impractical, offline RL, which trains …
environment. In scenarios where online interactions are impractical, offline RL, which trains …
Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces
A Kar, R Singh - arXiv preprint arXiv:2410.19919, 2024 - arxiv.org
We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs
and develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms …
and develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms …
Mildly Constrained Evaluation Policy for Offline Reinforcement Learning
Offline reinforcement learning (RL) methodologies enforce constraints on the policy to
adhere closely to the behavior policy, thereby stabilizing value learning and mitigating the …
adhere closely to the behavior policy, thereby stabilizing value learning and mitigating the …