Reinforcement learning in healthcare: A survey
As a subfield of machine learning, reinforcement learning (RL) aims at optimizing decision
making by using interaction samples of an agent with its environment and the potentially …
making by using interaction samples of an agent with its environment and the potentially …
Direct preference optimization: Your language model is secretly a reward model
While large-scale unsupervised language models (LMs) learn broad world knowledge and
some reasoning skills, achieving precise control of their behavior is difficult due to the …
some reasoning skills, achieving precise control of their behavior is difficult due to the …
A general theoretical paradigm to understand learning from human preferences
The prevalent deployment of learning from human preferences through reinforcement
learning (RLHF) relies on two important approximations: the first assumes that pairwise …
learning (RLHF) relies on two important approximations: the first assumes that pairwise …
Principled reinforcement learning with human feedback from pairwise or k-wise comparisons
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
Nash learning from human feedback
Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm
for aligning large language models (LLMs) with human preferences. Typically, RLHF …
for aligning large language models (LLMs) with human preferences. Typically, RLHF …
Is rlhf more difficult than standard rl? a theoretical perspective
Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …
A survey of preference-based reinforcement learning methods
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a
suitably chosen reward function. However, designing such a reward function often requires …
suitably chosen reward function. However, designing such a reward function often requires …
Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where
instead of receiving a numeric reward at each step, the RL agent only receives preferences …
instead of receiving a numeric reward at each step, the RL agent only receives preferences …
Using human feedback to fine-tune diffusion models without any reward model
Using reinforcement learning with human feedback (RLHF) has shown significant promise in
fine-tuning diffusion models. Previous methods start by training a reward model that aligns …
fine-tuning diffusion models. Previous methods start by training a reward model that aligns …
A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …