Dense reward for free in reinforcement learning from human feedback
Reinforcement Learning from Human Feedback (RLHF) has been credited as the key
advance that has allowed Large Language Models (LLMs) to effectively follow instructions …
advance that has allowed Large Language Models (LLMs) to effectively follow instructions …
Asynchronous proportional response dynamics: convergence in markets with adversarial scheduling
Y Kolumbus, M Levy, N Nisan - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract We study Proportional Response Dynamics (PRD) in linear Fisher markets, where
participants act asynchronously. We model this scenario as a sequential process in which at …
participants act asynchronously. We model this scenario as a sequential process in which at …
Doubly optimal no-regret online learning in strongly monotone games with bandit feedback
We consider online no-regret learning in unknown games with bandit feedback, where each
player can only observe its reward at each time--determined by all players' current joint …
player can only observe its reward at each time--determined by all players' current joint …
Off-policy reinforcement learning with delayed rewards
We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-
world tasks, instant rewards are often not readily accessible or even defined immediately …
world tasks, instant rewards are often not readily accessible or even defined immediately …
Learning long-term reward redistribution via randomized return decomposition
Many practical applications of reinforcement learning require agents to learn from sparse
and delayed rewards. It challenges the ability of agents to attribute their actions to future …
and delayed rewards. It challenges the ability of agents to attribute their actions to future …
Asymptotic convergence and performance of multi-agent q-learning dynamics
Achieving convergence of multiple learning agents in general $ N $-player games is
imperative for the development of safe and reliable machine learning (ML) algorithms and …
imperative for the development of safe and reliable machine learning (ML) algorithms and …
A unified stochastic approximation framework for learning in games
We develop a flexible stochastic approximation framework for analyzing the long-run
behavior of learning in games (both continuous and finite). The proposed analysis template …
behavior of learning in games (both continuous and finite). The proposed analysis template …
Multi-agent online optimization with delays: Asynchronicity, adaptivity, and optimism
In this paper, we provide a general framework for studying multi-agent online learning
problems in the presence of delays and asynchronicities. Specifically, we propose and …
problems in the presence of delays and asynchronicities. Specifically, we propose and …
Payoff-based learning with matrix multiplicative weights in quantum games
K Lotidis, P Mertikopoulos… - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper, we study the problem of learning in quantum games-and other classes of
semidefinite games-with scalar, payoff-based feedback. For concreteness, we focus on the …
semidefinite games-with scalar, payoff-based feedback. For concreteness, we focus on the …
Asymptotically unbiased estimation for delayed feedback modeling via label correction
Y Chen, J Jin, H Zhao, P Wang, G Liu, J Xu… - Proceedings of the ACM …, 2022 - dl.acm.org
Alleviating the delayed feedback problem is of crucial importance for the conversion rate
(CVR) prediction in online advertising. Previous delayed feedback modeling methods using …
(CVR) prediction in online advertising. Previous delayed feedback modeling methods using …