首页 | 本学科首页   官方微博 | 高级检索  
     


Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes
Authors:Mohammed Shahid Abdulla  Shalabh Bhatnagar
Affiliation:(1) Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India
Abstract:This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).
Contact Information Shalabh Bhatnagar (Corresponding author)Email:
Keywords:Actor-critic algorithms  Two timescale stochastic approximation  Markov decision processes  Policy iteration  Simultaneous perturbation stochastic approximation  Normalized Hadamard matrices  Reinforcement learning  TD-learning
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号