Variance-penalized Markov decision processes: dynamic programming and reinforcement learning techniques |
| |
Authors: | Abhijit Gosavi |
| |
Affiliation: | Engineering Management and Systems Engineering, Missouri University of Science and Technology, Rolla, MO, USA. |
| |
Abstract: | In control systems theory, the Markov decision process (MDP) is a widely used optimization model involving selection of the optimal action in each state visited by a discrete-event system driven by Markov chains. The classical MDP model is suitable for an agent/decision-maker interested in maximizing expected revenues, but does not account for minimizing variability in the revenues. An MDP model in which the agent can maximize the revenues while simultaneously controlling the variance in the revenues is proposed. This work is rooted in machine learning/neural network concepts, where updating is based on system feedback and step sizes. First, a Bellman equation for the problem is proposed. Thereafter, convergent dynamic programming and reinforcement learning techniques for solving the MDP are provided along with encouraging numerical results on a small MDP and a preventive maintenance problem. |
| |
Keywords: | variance-penalized MDPs dynamic programming risk penalties reinforcement learning Bellman equation |
|
|