基于自回归预测模型的深度注意力强化学习方法 Novel Deep Reinforcement Learning Algorithm Based on Attention-based Value Function and Autoregressive Environment Model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于自回归预测模型的深度注意力强化学习方法

引用本文：	梁星星,冯旸赫,黄金才,王琦,马扬,刘忠.基于自回归预测模型的深度注意力强化学习方法[J].软件学报,2020,31(4):948-966.

作者姓名：	梁星星冯旸赫黄金才王琦马扬刘忠

作者单位：	国防科技大学系统工程学院,湖南长沙410072;国防科技大学系统工程学院,湖南长沙410072;国防科技大学系统工程学院,湖南长沙410072;国防科技大学系统工程学院,湖南长沙410072;国防科技大学系统工程学院,湖南长沙410072;国防科技大学系统工程学院,湖南长沙410072

基金项目：	国家自然科学基金（71701205）；

摘要：	近年来,深度强化学习在各种决策、规划问题中展示了强大的智能性和良好的普适性,出现了诸如AlphaGo、OpenAI Five、Alpha Star等成功案例.然而,传统深度强化学习对计算资源的重度依赖及低效的数据利用率严重限制了其在复杂现实任务中的应用.传统的基于模型的强化学习算法通过学习环境的潜在动态性,可充分利用样本信息,有效提升数据利用率,加快模型训练速度,但如何快速建立准确的环境模型是基于模型的强化学习面临的难题.结合基于模型和无模型两类强化学习的优势,提出了一种基于时序自回归预测模型的深度注意力强化学习方法.利用自编码模型压缩表示潜在状态空间,结合自回归模型建立环境预测模型,基于注意力机制结合预测模型估计每个决策状态的值函数,通过端到端的方式统一训练各算法模块,实现高效的训练.通过CartPole-V0等经典控制任务的实验结果表明,该模型能够高效地建立环境预测模型,并有效结合基于模型和无模型两类强化学习方法,实现样本的高效利用.最后,针对导弹突防智能规划问题进行了算法实证研究,应用结果表明,采用所提出的学习模型可在特定场景取得优于传统突防规划的效果.
关键词：	注意力机制深度强化学习 actor-critic算法变分自动编码混合密度网络-循环神经网络
收稿时间：	2019/5/31 0:00:00
修稿时间：	2019/7/29 0:00:00
Novel Deep Reinforcement Learning Algorithm Based on Attention-based Value Function and Autoregressive Environment Model

LIANG Xing-Xing,FENG Yang-He,HUANG Jin-Cai,WANG Qi,MA Yang,LIU Zhong.Novel Deep Reinforcement Learning Algorithm Based on Attention-based Value Function and Autoregressive Environment Model[J].Journal of Software,2020,31(4):948-966.

Authors:	LIANG Xing-Xing FENG Yang-He HUANG Jin-Cai WANG Qi MA Yang LIU Zhong

Affiliation:	College of Systems Engineering, National University of Defense Technology, Changsha 410072, China,College of Systems Engineering, National University of Defense Technology, Changsha 410072, China,College of Systems Engineering, National University of Defense Technology, Changsha 410072, China,College of Systems Engineering, National University of Defense Technology, Changsha 410072, China,College of Systems Engineering, National University of Defense Technology, Changsha 410072, China and College of Systems Engineering, National University of Defense Technology, Changsha 410072, China

Abstract:	Recently deep reinforcement learning (DRL) is believed to be promising in continuous decision-making and intelligent scheduling problems, and some examples such as AlphaGo, OpenAI Five and Alpha Star have demonstrated the great generalization capability of the paradigm. However, the inefficient utility of collected experience dataset in DRL restricts the universal extension to more practical scenarios and complicated tasks. As the auxiliary, the model-based reinforcement learning can well capture the dynamics of environment and bring the reduction in experience sampling. In this paper, we aggregate the model-based and model-free reinforcement learning algorithms to formulate an end-to-end framework, where the autoregressive environment model is constructed, and attention layer is incorporated to forecast state value function. Experiments on classical CartPole-V0 and so on witness the effectiveness of proposed framework in simulating environment and advancing utility of dataset. Finally, penetration mission as the practical instantiation is successfully completed with our framework.

Keywords:	attention mechanism deep reinforcement learning actor-critic algorithm variational auto-encoder(VAE) mixture density network-recurrent neural network(MDN-RNN)
本文献已被维普万方数据等数据库收录！
	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏