A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Self-exploring language models: Active preference elicitation for online alignment
Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …
Feedback (RLHF), has achieved significant success in aligning Large Language Models …
A Tutorial on Multi-Armed Bandit Applications for Large Language Models
D Bouneffouf, R Féraud - Proceedings of the 30th ACM SIGKDD …, 2024 - dl.acm.org
This tutorial offers a comprehensive guide on using multi-armed bandit (MAB) algorithms to
improve Large Language Models (LLMs). As Natural Language Processing (NLP) tasks …
improve Large Language Models (LLMs). As Natural Language Processing (NLP) tasks …
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …
language model alignment. We consider online exploration in RLHF, which exploits …
Prompt Optimization with Human Feedback
Large language models (LLMs) have demonstrated remarkable performances in various
tasks. However, the performance of LLMs heavily depends on the input prompt, which has …
tasks. However, the performance of LLMs heavily depends on the input prompt, which has …
Can foundation models actively gather information in interactive environments to test hypotheses?
While problem solving is a standard evaluation task for foundation models, a crucial
component of problem solving--actively and strategically gathering information to test …
component of problem solving--actively and strategically gathering information to test …
Deep Bayesian Active Learning for Preference Modeling in Large Language Models
Leveraging human preferences for steering the behavior of Large Language Models (LLMs)
has demonstrated notable success in recent years. Nonetheless, data selection and labeling …
has demonstrated notable success in recent years. Nonetheless, data selection and labeling …
Sample-Efficient Alignment for LLMs
We study methods for efficiently aligning large language models (LLMs) with human
preferences given budgeted online feedback. We first formulate the LLM alignment problem …
preferences given budgeted online feedback. We first formulate the LLM alignment problem …
Temporal-Difference Variational Continual Learning
A crucial capability of Machine Learning models in real-world applications is the ability to
continuously learn new tasks. This adaptability allows them to respond to potentially …
continuously learn new tasks. This adaptability allows them to respond to potentially …
Exploration Unbound
A sequential decision-making agent balances between exploring to gain new knowledge
about an environment and exploiting current knowledge to maximize immediate reward. For …
about an environment and exploiting current knowledge to maximize immediate reward. For …