首页 | 本学科首页   官方微博 | 高级检索  
     

采用分类经验回放的深度确定性策略梯度方法EI北大核心CSCD
引用本文:时圣苗,刘全.采用分类经验回放的深度确定性策略梯度方法EI北大核心CSCD[J].自动化学报,2022,48(7):1816-1823.
作者姓名:时圣苗  刘全
作者单位:1.苏州大学计算机科学与技术学院 苏州 215006
基金项目:国家自然科学基金(61772355,61702055,61876217,62176175);
摘    要:深度确定性策略梯度(Deep deterministic policy gradient,DDPG)方法在连续控制任务中取得了良好的性能表现.为进一步提高深度确定性策略梯度方法中经验回放机制的效率,提出分类经验回放方法,并采用两种方式对经验样本分类:基于时序差分误差样本分类的深度确定性策略梯度方法(DDPG with temporal difference-error classification,TDCDDPG)和基于立即奖赏样本分类的深度确定性策略梯度方法(DDPG with reward classification,RC-DDPG).在TDCDDPG和RC-DDPG方法中,分别使用两个经验缓冲池,对产生的经验样本按照重要性程度分类存储,网络模型训练时通过选取较多重要性程度高的样本加快模型学习.在连续控制任务中对分类经验回放方法进行测试,实验结果表明,与随机选取经验样本的深度确定性策略梯度方法相比,TDC-DDPG和RC-DDPG方法具有更好的性能.

关 键 词:连续控制任务  深度确定性策略梯度  经验回放  分类经验回放
收稿时间:2019-05-24

Deep Deterministic Policy Gradient With Classified Experience Replay
Affiliation:1.School of Computer Science and Technology, Soochow University, Suzhou 2150062.Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 2150063.Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 1300124.Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000
Abstract:The deep deterministic policy gradient (DDPG) algorithm achieves good performance in continuous control tasks. In order to further improve the efficiency of the experience replay mechanism in the DDPG algorithm, a method of classifying the experience replay is proposed, where transitions are classified in two branches: deep deterministic policy gradient with temporal difference-error classification (TDC-DDPG) and deep deterministic policy gradient with reward classification (RC-DDPG). In both methods, two replay buffers are introduced respectively to classify the transitions according to the degree of importance. Learning can be speeded up in network model training period by selecting a greater number of transitions with higher importance. The classification experience replay method has been tested in a series of continuous control tasks and experimental results show that the TDC-DDPG and RC-DDPG methods have better performance than the DDPG method with random selection of transitions.
Keywords:
本文献已被 维普 等数据库收录!
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号