Beyond reward: Offline preference-guided policy optimization

J Hejna, D Sadigh - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Reward functions are difficult to design and often hard to align with human intent. Preference-
based Reinforcement Learning (RL) algorithms address these problems by learning reward …

被引用次数：39 相关文章所有 9 个版本

[PDF] mlr.press

Chipformer: Transferable chip placement via offline decision transformer

Y Lai, J Liu, Z Tang, B Wang, J Hao… - … on Machine Learning, 2023 - proceedings.mlr.press

Placement is a critical step in modern chip design, aiming to determine the positions of
circuit modules on the chip canvas. Recent works have shown that reinforcement learning …

被引用次数：36 相关文章所有 7 个版本

[PDF] arxiv.org

Contrastive prefence learning: Learning from human feedback without rl

J Hejna, R Rafailov, H Sikchi, C Finn, S Niekum… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular
paradigm for aligning models with human intent. Typically RLHF algorithms operate in two …

被引用次数：50 相关文章所有 5 个版本

[PDF] arxiv.org

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

被引用次数：101 相关文章所有 4 个版本

[PDF] neurips.cc

Ceil: Generalized contextual imitation learning

J Liu, L He, Y Kang, Z Zhuang… - Advances in Neural …, 2023 - proceedings.neurips.cc

In this paper, we present ContExtual Imitation Learning (CEIL), a general and broadly
applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight …

被引用次数：15 相关文章所有 5 个版本

[PDF] neurips.cc

Design from policies: Conservative test-time adaptation for offline policy optimization

J Liu, H Zhang, Z Zhuang, Y Kang… - Advances in Neural …, 2024 - proceedings.neurips.cc

In this work, we decouple the iterative bi-level offline RL (value estimation and policy
extraction) from the offline training phase, forming a non-iterative bi-level paradigm and …

被引用次数：10 相关文章所有 5 个版本

[PDF] neurips.cc

Direct preference-based policy optimization without reward modeling

G An, J Lee, X Zuo, N Kosaka… - Advances in Neural …, 2023 - proceedings.neurips.cc

Preference-based reinforcement learning (PbRL) is an approach that enables RL agents to
learn from preference, which is particularly useful when formulating a reward function is …

被引用次数：20 相关文章所有 5 个版本

[PDF] aaai.org

Beyond ood state actions: Supported cross-domain offline reinforcement learning

J Liu, Z Zhang, Z Wei, Z Zhuang, Y Kang… - Proceedings of the …, 2024 - ojs.aaai.org

Offline reinforcement learning (RL) aims to learn a policy using only pre-collected and fixed
data. Although avoiding the time-consuming online interactions in RL, it poses challenges …

被引用次数：12 相关文章所有 3 个版本

[PDF] mlr.press

Clue: Calibrated latent guidance for offline reinforcement learning

J Liu, L Zu, L He, D Wang - Conference on Robot Learning, 2023 - proceedings.mlr.press

Offline reinforcement learning (RL) aims to learn an optimal policy from pre-collected and
labeled datasets, which eliminates the time-consuming data collection in online RL …

被引用次数：7 相关文章所有 4 个版本

[PDF] openreview.net

Flow to better: Offline preference-based reinforcement learning via preferred trajectory generation

Z Zhang, Y Sun, J Ye, TS Liu, J Zhang… - The Twelfth International …, 2023 - openreview.net

Offline preference-based reinforcement learning (PbRL) offers an effective solution to
overcome the challenges associated with designing rewards and the high costs of online …

被引用次数：11 相关文章