A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Self-exploring language models: Active preference elicitation for online alignment

S Zhang, D Yu, H Sharma, H Zhong, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …

A Tutorial on Multi-Armed Bandit Applications for Large Language Models

D Bouneffouf, R Féraud - Proceedings of the 30th ACM SIGKDD …, 2024 - dl.acm.org
This tutorial offers a comprehensive guide on using multi-armed bandit (MAB) algorithms to
improve Large Language Models (LLMs). As Natural Language Processing (NLP) tasks …

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

T Xie, DJ Foster, A Krishnamurthy, C Rosset… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …

Prompt Optimization with Human Feedback

X Lin, Z Dai, A Verma, SK Ng, P Jaillet… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated remarkable performances in various
tasks. However, the performance of LLMs heavily depends on the input prompt, which has …

Can foundation models actively gather information in interactive environments to test hypotheses?

NR Ke, DP Sawyer, H Soyer, M Engelcke… - arXiv preprint arXiv …, 2024 - arxiv.org
While problem solving is a standard evaluation task for foundation models, a crucial
component of problem solving--actively and strategically gathering information to test …

Deep Bayesian Active Learning for Preference Modeling in Large Language Models

LC Melo, P Tigas, A Abate, Y Gal - arXiv preprint arXiv:2406.10023, 2024 - arxiv.org
Leveraging human preferences for steering the behavior of Large Language Models (LLMs)
has demonstrated notable success in recent years. Nonetheless, data selection and labeling …

Sample-Efficient Alignment for LLMs

Z Liu, C Chen, C Du, WS Lee, M Lin - arXiv preprint arXiv:2411.01493, 2024 - arxiv.org
We study methods for efficiently aligning large language models (LLMs) with human
preferences given budgeted online feedback. We first formulate the LLM alignment problem …

Temporal-Difference Variational Continual Learning

LC Melo, A Abate, Y Gal - arXiv preprint arXiv:2410.07812, 2024 - arxiv.org
A crucial capability of Machine Learning models in real-world applications is the ability to
continuously learn new tasks. This adaptability allows them to respond to potentially …

Exploration Unbound

D Arumugam, W Xu, B Van Roy - arXiv preprint arXiv:2407.12178, 2024 - arxiv.org
A sequential decision-making agent balances between exploring to gain new knowledge
about an environment and exploiting current knowledge to maximize immediate reward. For …