首页 | 官方网站   微博 | 高级检索  
     

基于优势学习的深度Q网络
引用本文:夏宗涛,秦进.基于优势学习的深度Q网络[J].计算机工程与应用,2019,55(20):101-106.
作者姓名:夏宗涛  秦进
作者单位:贵州大学 计算机科学与技术学院,贵阳,550025;贵州大学 计算机科学与技术学院,贵阳,550025
摘    要:强化学习问题中,同一状态下不同动作所对应的状态-动作值存在差距过小的现象,Q-Learning 算法采用MAX进行动作选择时会出现过估计问题,且结合了Q-Learning 的深度Q网络(Deep Q Net)同样存在过估计问题。为了缓解深度Q网络中存在的过估计问题,提出一种基于优势学习的深度Q网络,通过优势学习的方法构造一个更正项,利用目标值网络对更正项进行建模,同时与深度Q网络的评估函数进行求和作为新的评估函数。当选择的动作是最优动作时,更正项为零,不对评估函数的值进行改动,当选择的动作不是最优动作时,更正项的值为负,降低了非最优动作的评估值。和传统的深度Q网络相比,基于优势学习的深度Q网络在Playing Atari 2600 的控制问题breakout、seaquest、phoenix、amidar 中取得了更高的平均奖赏值,在krull、seaquest 中取得了更加稳定的策略。

关 键 词:强化学习  优势学习  深度Q网络  过估计问题

Deep Q Net Based on Advantage Learning
XIA Zongtao,QIN Jin.Deep Q Net Based on Advantage Learning[J].Computer Engineering and Applications,2019,55(20):101-106.
Authors:XIA Zongtao  QIN Jin
Affiliation:College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
Abstract:In the reinforcement learning problem, the different state-action value corresponding to the different action in the same state may be too small, Q-Learning algorithm will have overestimation problem when using MAX to select an action, and Deep Q Net(DQN) which combined with Q-Learning also has overestimation problem, In order to alleviate the overestimation problem in deep Q net, a deep Q net based on advantage learning is proposed. A correction item is constructed by the method of advantage learning, and modeling this correction by target value network, summing up the evaluation function Q of the deep Q net with the correction item as a new evaluation function. When the selected action is the optimal action, the correction item is zero, and the value of evaluation function and the Q is not changed. when the selected action is not the optimal action, the value of the correction is negative, and the value of the non optimal action is reduced. Compared with the traditional deep Q net, the deep Q net based on advantage learning has achieved a higher average reward in the Playing Atari 2600 control problems, such as breakout, seaquest, phoenix, amidar and a more stable strategy has been achieved in krull and seaquest.
Keywords:reinforcement learning  advantage learning  Deep Q Net(DQN)  overestimation  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号