A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

[图书][B] Distributional reinforcement learning

MG Bellemare, W Dabney, M Rowland - 2023 - books.google.com
The first comprehensive guide to distributional reinforcement learning, providing a new
mathematical formalism for thinking about decisions from a probabilistic perspective …

Dynamic programming models for maximizing customer lifetime value: an overview

E AboElHamd, HM Shamma, M Saleh - Intelligent Systems and …, 2020 - Springer
Customer lifetime value (CLV) is the most reliable indicator in direct marketing for measuring
the profitability of the customers. This motivated the researchers to compete in building …

Quantile Markov decision processes

X Li, H Zhong, ML Brandeau - Operations research, 2022 - pubsonline.informs.org
The goal of a traditional Markov decision process (MDP) is to maximize expected cumulative
reward over a defined horizon (possibly infinite). In many applications, however, a decision …

Computational approaches for stochastic shortest path on succinct MDPs

K Chatterjee, H Fu, AK Goharshady, N Okati - arXiv preprint arXiv …, 2018 - arxiv.org
We consider the stochastic shortest path (SSP) problem for succinct Markov decision
processes (MDPs), where the MDP consists of a set of variables, and a set of …

Conditional value-at-risk for reachability and mean payoff in Markov decision processes

J Křetínský, T Meggendorfer - Proceedings of the 33rd Annual ACM …, 2018 - dl.acm.org
We present the conditional value-at-risk (CVaR) in the context of Markov chains and Markov
decision processes with reachability and mean-payoff objectives. CVaR quantifies risk by …

Risk-averse MDPs under reward ambiguity

H Ruan, Z Chen, CP Ho - arXiv preprint arXiv:2301.01045, 2023 - arxiv.org
We propose a distributionally robust return-risk model for Markov decision processes
(MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted …

[PDF][PDF] Maximizing customer lifetime value using dynamic programming: Theoretical and practical implications

E AboElHamd, HM Shamma, M Saleh - Academy of Marketing …, 2020 - academia.edu
Dynamic programming models play a significant role in maximizing customer lifetime value
(CLV), in different market types including B2B, B2C, C2B, C2C and B2B2C. This paper …

Learning Risk Preferences in Markov Decision Processes: an Application to the Fourth Down Decision in Football

N Sandholtz, L Wu, M Puterman, TCY Chan - arXiv preprint arXiv …, 2023 - arxiv.org
For decades, National Football League (NFL) coaches' observed fourth down decisions
have been largely inconsistent with prescriptions based on statistical models. In this paper …

Verification of Discrete-Time Markov Decision Processes

T Meggendorfer - 2021 - mediatum.ub.tum.de
In this thesis, we discuss the verification of discrete-time Markov decision processes (MDP).
First, we present two novel algorithms to efficiently compute mean-payoff queries on MDP …