首页 | 本学科首页   官方微博 | 高级检索  
     

基于样本效率优化的深度强化学习方法综述
引用本文:张峻伟,吕帅,张正昊,于佳玉,龚晓宇.基于样本效率优化的深度强化学习方法综述[J].软件学报,2022,33(11):4217-4238.
作者姓名:张峻伟  吕帅  张正昊  于佳玉  龚晓宇
作者单位:符号计算与知识工程教育部重点实验室(吉林大学), 吉林 长春 130012;吉林大学 计算机科学与技术学院, 吉林 长春 130012;符号计算与知识工程教育部重点实验室(吉林大学), 吉林 长春 130012;吉林大学 软件学院, 吉林 长春 130012
基金项目:国家重点研发计划(2017YFB1003103);国家自然科学基金(61300049);吉林省自然科学基金(20180101053JC)
摘    要:深度强化学习将深度学习的表示能力和强化学习的决策能力结合,因在复杂控制任务中效果显著而掀起研究热潮.以是否用Bellman方程为基准,将无模型深度强化学习方法分为Q值函数方法和策略梯度方法,并从模型构建方式、优化历程和方法评估等方面对两类方法分别进行了介绍.针对深度强化学习方法中样本效率低的问题进行讨论,根据两类方法的模型特性,说明了Q值函数方法过高估计问题和策略梯度方法采样无偏性约束分别是两类方法样本效率受限的主要原因.从增强探索效率和提高样本利用率两个角度,根据近年来的研究热点和趋势归纳出各类可行的优化方法,分析相关方法的优势和仍存在的问题,并对比其适用范围和优化效果.最后提出增强样本效率优化方法的通用性、探究两类方法间优化机制的迁移和提高理论完备性作为未来的研究方向.

关 键 词:深度强化学习  Q值函数方法  策略梯度方法  样本效率  探索与利用
收稿时间:2020/11/11 0:00:00
修稿时间:2021/1/18 0:00:00

Survey on Deep Reinforcement Learning Methods Based on Sample Efficiency Optimization
Affiliation:Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China;College of Computer Science and Technology, Jilin University, Changchun 130012, China;Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China;College of Software, Jilin University, Changchun 130012, China
Abstract:Deep reinforcement learning combines the representation ability of deep learning with the decision-making ability of reinforcement learning, which has induced great research interest due to its remarkable effect in complex control tasks. This study classifies the model-free deep reinforcement learning methods into Q-value function methods and policy gradient methods by considering whether the Bellman equation is used, and introduces the two kinds of methods from the aspects of model structure, optimization process, and evaluation, respectively. Toward the low sample efficiency problem in deep reinforcement learning, this study illustrates that the over- estimation problem in Q-value function methods and the unbiased sampling constraint in policy gradient methods are the main factors that affect the sample efficiency according to model structure. Then, from the perspectives of enhancing the exploration efficiency and improving the sample exploitation rate, this study summarizes various feasible optimization methods according to the recent research hotspots and trends, analyzes advantages together with existing problems of related methods, and compares them according to the scope of application and optimization effect. Finally, this study proposes to enhance the generality of optimization methods, explore migration of optimization mechanisms between the two kinds of methods, and improve theoretical completeness as future research directions.
Keywords:deep reinforcement learning  Q-value function method  policy gradient method  sample efficiency  exploration and
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号