Inverse preference learning: Preference-based rl without a reward function
Reward functions are difficult to design and often hard to align with human intent. Preference-
based Reinforcement Learning (RL) algorithms address these problems by learning reward …
based Reinforcement Learning (RL) algorithms address these problems by learning reward …
Chipformer: Transferable chip placement via offline decision transformer
Placement is a critical step in modern chip design, aiming to determine the positions of
circuit modules on the chip canvas. Recent works have shown that reinforcement learning …
circuit modules on the chip canvas. Recent works have shown that reinforcement learning …
Contrastive prefence learning: Learning from human feedback without rl
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …
A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Ceil: Generalized contextual imitation learning
In this paper, we present ContExtual Imitation Learning (CEIL), a general and broadly
applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight …
applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight …
Design from policies: Conservative test-time adaptation for offline policy optimization
In this work, we decouple the iterative bi-level offline RL (value estimation and policy
extraction) from the offline training phase, forming a non-iterative bi-level paradigm and …
extraction) from the offline training phase, forming a non-iterative bi-level paradigm and …
Direct preference-based policy optimization without reward modeling
Preference-based reinforcement learning (PbRL) is an approach that enables RL agents to
learn from preference, which is particularly useful when formulating a reward function is …
learn from preference, which is particularly useful when formulating a reward function is …
Beyond ood state actions: Supported cross-domain offline reinforcement learning
Offline reinforcement learning (RL) aims to learn a policy using only pre-collected and fixed
data. Although avoiding the time-consuming online interactions in RL, it poses challenges …
data. Although avoiding the time-consuming online interactions in RL, it poses challenges …
Clue: Calibrated latent guidance for offline reinforcement learning
Offline reinforcement learning (RL) aims to learn an optimal policy from pre-collected and
labeled datasets, which eliminates the time-consuming data collection in online RL …
labeled datasets, which eliminates the time-consuming data collection in online RL …
Flow to better: Offline preference-based reinforcement learning via preferred trajectory generation
Offline preference-based reinforcement learning (PbRL) offers an effective solution to
overcome the challenges associated with designing rewards and the high costs of online …
overcome the challenges associated with designing rewards and the high costs of online …