首页 | 本学科首页   官方微博 | 高级检索  
     


Variance-penalized Markov decision processes: dynamic programming and reinforcement learning techniques
Authors:Abhijit Gosavi
Affiliation:Engineering Management and Systems Engineering, Missouri University of Science and Technology, Rolla, MO, USA.
Abstract:In control systems theory, the Markov decision process (MDP) is a widely used optimization model involving selection of the optimal action in each state visited by a discrete-event system driven by Markov chains. The classical MDP model is suitable for an agent/decision-maker interested in maximizing expected revenues, but does not account for minimizing variability in the revenues. An MDP model in which the agent can maximize the revenues while simultaneously controlling the variance in the revenues is proposed. This work is rooted in machine learning/neural network concepts, where updating is based on system feedback and step sizes. First, a Bellman equation for the problem is proposed. Thereafter, convergent dynamic programming and reinforcement learning techniques for solving the MDP are provided along with encouraging numerical results on a small MDP and a preventive maintenance problem.
Keywords:variance-penalized MDPs  dynamic programming  risk penalties  reinforcement learning  Bellman equation
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号