基于多动作并行异步深度确定性策略梯度的选矿运行指标决策方法 Multi-action parallel asynchronous depth deterministic strategy gradient based decision-making approach of operational indices for mineral processing期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于多动作并行异步深度确定性策略梯度的选矿运行指标决策方法

引用本文：	李悄然,丁进良.基于多动作并行异步深度确定性策略梯度的选矿运行指标决策方法[J].控制与决策,2022,37(8):1989-1996.

作者姓名：	李悄然丁进良

作者单位：	东北大学流程工业综合自动化国家重点实验室,沈阳 110004

基金项目：	国家重点研发计划课题(2018YFB1701104)；辽宁省科技技术项目(2020JH1/10100008).

摘要：	为了解决深度确定性策略梯度算法探索能力不足的问题,提出一种多动作并行异步深度确定性策略梯度(MPADDPG)算法,并用于选矿运行指标强化学习决策.该算法使用多个actor网络,进行不同的初始化和训练,不同程度地提升了探索能力,同时通过扩展具有确定性策略梯度结构的评论家体系,揭示了探索与利用之间的关系.该算法使用多个DDPG代替单一DDPG,可以减轻一个DDPG性能不佳的影响,提高学习稳定性;同时通过使用并行异步结构,提高数据利用效率,加快了网络收敛速度;最后, actor通过影响critic的更新而得到更好的策略梯度.通过选矿过程运行指标决策的实验结果验证了所提出算法的有效性.
关键词：	选矿运行指标决策多动作并行异步深度确定性策略梯度
Multi-action parallel asynchronous depth deterministic strategy gradient based decision-making approach of operational indices for mineral processing

LI Qiao-ran,DING Jin-liang.Multi-action parallel asynchronous depth deterministic strategy gradient based decision-making approach of operational indices for mineral processing[J].Control and Decision,2022,37(8):1989-1996.

Authors:	LI Qiao-ran DING Jin-liang

Affiliation:	State Key Laboratory of Synthetical Automation for Process Industries,Northeastern University,Shenyang 110004,China

Abstract:	In order to solve the problem of insufficient exploration ability of the deep deterministic strategy gradient algorithm, a multi-action parallel asynchronous deep deterministic policy gradient(DDPG) algorithm is proposed for the decision-making approach of operational indices in mineral processing based on reinforcement learning. This algorithm uses multiple actor networks for different initialization and training, which greatly increases the exploration to different degrees. The relationship between exploration and utilization is revealed by extending the critical architecture of deterministic selection policy. This algorithm uses multiple DDPGs instead of a single DDPG, which can alleviate the poor performance of one DDPG and improve the learning stability. And it also improves the data utilization efficiency and speeds up the network convergence by using parallel asynchronous structure. Finally, the actor gets better strategy gradient by influencing critic''s update. The effectiveness of the proposed approach has been verified by experiment results on decision-making of the operational indices in mineral processing.

Keywords:

	点击此处可从《控制与决策》浏览原始摘要信息
	点击此处可从《控制与决策》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏