首页 | 本学科首页   官方微博 | 高级检索  
     

基于二阶时序差分误差的双网络DQN算法
引用本文:陈建平,周鑫,傅启明,高振,付保川,吴宏杰.基于二阶时序差分误差的双网络DQN算法[J].计算机工程,2020,46(5):78-85,93.
作者姓名:陈建平  周鑫  傅启明  高振  付保川  吴宏杰
作者单位:苏州科技大学电子与信息工程学院,江苏苏州215009;苏州科技大学江苏省建筑智慧节能重点实验室,江苏苏州215009;苏州科技大学电子与信息工程学院,江苏苏州215009
基金项目:江苏省重点研发计划;江苏省研究生科研与实践创新计划项目;国家自然科学基金
摘    要:针对深度Q网络(DQN)算法因过估计导致收敛稳定性差的问题,在传统时序差分(TD)的基础上提出N阶TD误差的概念,设计基于二阶TD误差的双网络DQN算法。构造基于二阶TD误差的值函数更新公式,同时结合DQN算法建立双网络模型,得到两个同构的值函数网络分别用于表示先后两轮的值函数,协同更新网络参数,以提高DQN算法中值函数估计的稳定性。基于Open AI Gym平台的实验结果表明,在解决Mountain Car和Cart Pole问题方面,该算法较经典DQN算法具有更好的收敛稳定性。

关 键 词:深度强化学习  马尔科夫决策过程  深度Q网络  二阶时序差分误差  梯度下降

Dual Network DQN Algorithm Based on Second-order Temporal Difference Error
CHEN Jianping,ZHOU Xin,FU Qiming,GAO Zhen,FU Baochuan,WU Hongjie.Dual Network DQN Algorithm Based on Second-order Temporal Difference Error[J].Computer Engineering,2020,46(5):78-85,93.
Authors:CHEN Jianping  ZHOU Xin  FU Qiming  GAO Zhen  FU Baochuan  WU Hongjie
Affiliation:(School of Electronic and Information Engineering,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China;Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency,Suzhou University of Science and Technology,Suzhou,Jiangsu 215009,China)
Abstract:Aiming at the problem of poor convergence stability caused by overestimation of Depth Q-Network(DQN)algorithm,on the basis of traditional Temporal Difference(TD),the concept of n-order TD error is proposed and a dual-network DQN algorithm based on second-order TD error is designed.A value function updating formula based on second-order TD error is constructed.Meanwhile,a two-network model is established in combination with DQN algorithm,and two isomorphic value function networks are obtained,which are respectively used to represent the value functions of two successive rounds,and the network parameters are cooperatively updated to improve the stability of value function estimation in DQN algorithm.Experimental results based on the Open AI Gym platform show that,the proposed algorithm has better convergence stability compared with the classical DQN algorithm in solving the Mountain Car and Cart Pole problems.
Keywords:Deep Reinforcement Learning(DRL)  Markov Decision Process(MDP)  Deep Q-Network(DQN)  second-order Temporal Difference(TD)error  gradient descent
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号