首页 | 本学科首页   官方微博 | 高级检索  
     

基于重抽样优选缓存经验回放机制的深度强化学习方法
引用本文:陈希亮,曹雷,李晨溪,徐志雄,何明.基于重抽样优选缓存经验回放机制的深度强化学习方法[J].控制与决策,2018,33(4):600-606.
作者姓名:陈希亮  曹雷  李晨溪  徐志雄  何明
作者单位:解放军理工大学指挥信息系统学院,南京210007,解放军理工大学指挥信息系统学院,南京210007,解放军理工大学指挥信息系统学院,南京210007,解放军理工大学指挥信息系统学院,南京210007,解放军理工大学指挥信息系统学院,南京210007
基金项目:国家自然科学基金项目(61301159, 61303267);国家重点研发计划项目(2016YFC0800606);江苏省自然科学基金项目(BK20150721, BK20161469).
摘    要:针对深度强化学习算法中经验缓存机制构建问题,提出一种基于TD误差的重抽样优选缓存机制;针对该机制存在的训练集坍塌现象,提出基于排行的分层抽样算法进行改进,并结合该机制对已有的几种典型基于DQN的深度强化学习算法进行改进.通过对Open AI Gym平台上Cart Port学习控制问题的仿真实验对比分析表明,优选机制能够提升训练样本的质量,实现对值函数的有效逼近,具有良好的学习效率和泛化性能,收敛速度和训练性能均有明显提升.

关 键 词:深度强化学习  缓存回放  重抽样

Deep reinforcement learning via good choice resampling experience replay memory
CHEN Xi-liang,CAO Lei,LI Chen-xi,XU Zhi-xiong and HE Ming.Deep reinforcement learning via good choice resampling experience replay memory[J].Control and Decision,2018,33(4):600-606.
Authors:CHEN Xi-liang  CAO Lei  LI Chen-xi  XU Zhi-xiong and HE Ming
Affiliation:Institute of Command Information Systems, PLA University of Science and Technology,Nanjing 210007,China,Institute of Command Information Systems, PLA University of Science and Technology,Nanjing 210007,China,Institute of Command Information Systems, PLA University of Science and Technology,Nanjing 210007,China,Institute of Command Information Systems, PLA University of Science and Technology,Nanjing 210007,China and Institute of Command Information Systems, PLA University of Science and Technology,Nanjing 210007,China
Abstract:In order to build a good experience memory mechanism for deep reinforcement learning, a kind of resample choosing optimal memory cache construction method based on TD error is proposed. Ranking based algorithms on stratified sampling are also developed to avoid the collapse of training data set. Combined with this mechanism, several typical depth based onreinforcement learning algorithms based on DQN(deep Q-networks) are improved. Through the simulation on the control problem of Cart Port on Open AI Gym, experimental results show that the optimization mechanism improves the quality of training samples, and it can effectively enhance the learning value function, and has good learning efficiency and generalization performance. The convergence speed and training performance are improved significantly.
Keywords:
点击此处可从《控制与决策》浏览原始摘要信息
点击此处可从《控制与决策》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号