Dense reward for free in reinforcement learning from human feedback

AJ Chan, H Sun, S Holt, M van der Schaar - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has been credited as the key
advance that has allowed Large Language Models (LLMs) to effectively follow instructions …

Asynchronous proportional response dynamics: convergence in markets with adversarial scheduling

Y Kolumbus, M Levy, N Nisan - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract We study Proportional Response Dynamics (PRD) in linear Fisher markets, where
participants act asynchronously. We model this scenario as a sequential process in which at …

Doubly optimal no-regret online learning in strongly monotone games with bandit feedback

W Ba, T Lin, J Zhang, Z Zhou - arXiv preprint arXiv:2112.02856, 2021 - arxiv.org
We consider online no-regret learning in unknown games with bandit feedback, where each
player can only observe its reward at each time--determined by all players' current joint …

Off-policy reinforcement learning with delayed rewards

B Han, Z Ren, Z Wu, Y Zhou… - … Conference on Machine …, 2022 - proceedings.mlr.press
We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-
world tasks, instant rewards are often not readily accessible or even defined immediately …

Learning long-term reward redistribution via randomized return decomposition

Z Ren, R Guo, Y Zhou, J Peng - arXiv preprint arXiv:2111.13485, 2021 - arxiv.org
Many practical applications of reinforcement learning require agents to learn from sparse
and delayed rewards. It challenges the ability of agents to attribute their actions to future …

Asymptotic convergence and performance of multi-agent q-learning dynamics

AA Hussain, F Belardinelli, G Piliouras - arXiv preprint arXiv:2301.09619, 2023 - arxiv.org
Achieving convergence of multiple learning agents in general $ N $-player games is
imperative for the development of safe and reliable machine learning (ML) algorithms and …

A unified stochastic approximation framework for learning in games

P Mertikopoulos, YP Hsieh, V Cevher - Mathematical Programming, 2024 - Springer
We develop a flexible stochastic approximation framework for analyzing the long-run
behavior of learning in games (both continuous and finite). The proposed analysis template …

Multi-agent online optimization with delays: Asynchronicity, adaptivity, and optimism

YG Hsieh, F Iutzeler, J Malick… - Journal of Machine …, 2022 - jmlr.org
In this paper, we provide a general framework for studying multi-agent online learning
problems in the presence of delays and asynchronicities. Specifically, we propose and …

Payoff-based learning with matrix multiplicative weights in quantum games

K Lotidis, P Mertikopoulos… - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper, we study the problem of learning in quantum games-and other classes of
semidefinite games-with scalar, payoff-based feedback. For concreteness, we focus on the …

Asymptotically unbiased estimation for delayed feedback modeling via label correction

Y Chen, J Jin, H Zhao, P Wang, G Liu, J Xu… - Proceedings of the ACM …, 2022 - dl.acm.org
Alleviating the delayed feedback problem is of crucial importance for the conversion rate
(CVR) prediction in online advertising. Previous delayed feedback modeling methods using …