A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
[图书][B] Distributional reinforcement learning
The first comprehensive guide to distributional reinforcement learning, providing a new
mathematical formalism for thinking about decisions from a probabilistic perspective …
mathematical formalism for thinking about decisions from a probabilistic perspective …
Dynamic programming models for maximizing customer lifetime value: an overview
Customer lifetime value (CLV) is the most reliable indicator in direct marketing for measuring
the profitability of the customers. This motivated the researchers to compete in building …
the profitability of the customers. This motivated the researchers to compete in building …
Quantile Markov decision processes
The goal of a traditional Markov decision process (MDP) is to maximize expected cumulative
reward over a defined horizon (possibly infinite). In many applications, however, a decision …
reward over a defined horizon (possibly infinite). In many applications, however, a decision …
Computational approaches for stochastic shortest path on succinct MDPs
We consider the stochastic shortest path (SSP) problem for succinct Markov decision
processes (MDPs), where the MDP consists of a set of variables, and a set of …
processes (MDPs), where the MDP consists of a set of variables, and a set of …
Conditional value-at-risk for reachability and mean payoff in Markov decision processes
J Křetínský, T Meggendorfer - Proceedings of the 33rd Annual ACM …, 2018 - dl.acm.org
We present the conditional value-at-risk (CVaR) in the context of Markov chains and Markov
decision processes with reachability and mean-payoff objectives. CVaR quantifies risk by …
decision processes with reachability and mean-payoff objectives. CVaR quantifies risk by …
Risk-averse MDPs under reward ambiguity
We propose a distributionally robust return-risk model for Markov decision processes
(MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted …
(MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted …
[PDF][PDF] Maximizing customer lifetime value using dynamic programming: Theoretical and practical implications
Dynamic programming models play a significant role in maximizing customer lifetime value
(CLV), in different market types including B2B, B2C, C2B, C2C and B2B2C. This paper …
(CLV), in different market types including B2B, B2C, C2B, C2C and B2B2C. This paper …
Learning Risk Preferences in Markov Decision Processes: an Application to the Fourth Down Decision in Football
For decades, National Football League (NFL) coaches' observed fourth down decisions
have been largely inconsistent with prescriptions based on statistical models. In this paper …
have been largely inconsistent with prescriptions based on statistical models. In this paper …
Verification of Discrete-Time Markov Decision Processes
T Meggendorfer - 2021 - mediatum.ub.tum.de
In this thesis, we discuss the verification of discrete-time Markov decision processes (MDP).
First, we present two novel algorithms to efficiently compute mean-payoff queries on MDP …
First, we present two novel algorithms to efficiently compute mean-payoff queries on MDP …