Efficient exploration for llms

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

被引用次数：108 相关文章所有 4 个版本

[PDF] arxiv.org

Self-exploring language models: Active preference elicitation for online alignment

S Zhang, D Yu, H Sharma, H Zhong, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …

被引用次数：14 相关文章所有 3 个版本

A Tutorial on Multi-Armed Bandit Applications for Large Language Models

D Bouneffouf, R Féraud - Proceedings of the 30th ACM SIGKDD …, 2024 - dl.acm.org

This tutorial offers a comprehensive guide on using multi-armed bandit (MAB) algorithms to
improve Large Language Models (LLMs). As Natural Language Processing (NLP) tasks …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

T Xie, DJ Foster, A Krishnamurthy, C Rosset… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Prompt Optimization with Human Feedback

X Lin, Z Dai, A Verma, SK Ng, P Jaillet… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) have demonstrated remarkable performances in various
tasks. However, the performance of LLMs heavily depends on the input prompt, which has …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Can foundation models actively gather information in interactive environments to test hypotheses?

NR Ke, DP Sawyer, H Soyer, M Engelcke… - arXiv preprint arXiv …, 2024 - arxiv.org

While problem solving is a standard evaluation task for foundation models, a crucial
component of problem solving--actively and strategically gathering information to test …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Deep Bayesian Active Learning for Preference Modeling in Large Language Models

LC Melo, P Tigas, A Abate, Y Gal - arXiv preprint arXiv:2406.10023, 2024 - arxiv.org

Leveraging human preferences for steering the behavior of Large Language Models (LLMs)
has demonstrated notable success in recent years. Nonetheless, data selection and labeling …

被引用次数：2 相关文章

[PDF] arxiv.org

A survey of reinforcement learning from human feedback

Self-exploring language models: Active preference elicitation for online alignment

A Tutorial on Multi-Armed Bandit Applications for Large Language Models

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Prompt Optimization with Human Feedback

Can foundation models actively gather information in interactive environments to test hypotheses?

Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Sample-Efficient Alignment for LLMs

Temporal-Difference Variational Continual Learning

Exploration Unbound

高级搜索

引用