Dueling rl: reinforcement learning with trajectory preferences
We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1)
preference over a trajectory pair instead of absolute rewards for them. The success of the
traditional RL framework crucially relies on the underlying agent-reward model, which,
however, depends on how accurately a system designer can express an appropriate reward
function and often a non-trivial task. The main novelty of our framework is the ability to learn …
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1)
preference over a trajectory pair instead of absolute rewards for them. The success of the
traditional RL framework crucially relies on the underlying agent-reward model, which,
however, depends on how accurately a system designer can express an appropriate reward
function and often a non-trivial task. The main novelty of our framework is the ability to learn …
Dueling rl: Reinforcement learning with trajectory preferences
We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit
(0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the
traditional reward-based RL framework crucially depends on how accurately a system
designer can express an appropriate reward function, which is often a non-trivial task. The
main novelty of the our framework is the ability to learn from preference-based trajectory …
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit
(0/1) preferences over a trajectory pair instead of absolute rewards for it. The success of the
traditional reward-based RL framework crucially depends on how accurately a system
designer can express an appropriate reward function, which is often a non-trivial task. The
main novelty of the our framework is the ability to learn from preference-based trajectory …
以上显示的是最相近的搜索结果。 查看全部搜索结果