双裁切近端策略优化算法 Proximal Policy Optimization with Double Clipping Boundaries期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

双裁切近端策略优化算法

引用本文：	张骏,王红成.双裁切近端策略优化算法[J].计算机系统应用,2023,32(4):177-186.

作者姓名：	张骏王红成

作者单位：	东莞理工学院电子工程与智能化学院, 东莞 523808;东莞理工学院计算机科学与技术学院, 东莞 523808

基金项目：	广东省普通高校重点科研平台和项目(2020ZDZX3075)

摘要：	近端策略优化(proximal policy optimization, PPO)是一种稳定的深度强化学习算法,该算法的关键点之一是使用裁切后的代理目标限制更新步长.实验发现当使用经验最优的裁切系数时, KL散度(Kullback-Leibler divergence)无法被确立上界,这有悖于置信域优化理论.本文提出一种改进的双裁切近端策略优化算法(proximal policy optimization with double clipping boundaries, PPO-DC).该算法通过基于概率的两段裁切边界调整KL散度,将参数限制在置信域内,以保证样本数据得到充分利用.在多个连续控制任务中, PPO-DC算法取得了好于其他算法的性能.
关键词：	强化学习策略梯度近端策略优化裁切机制
收稿时间：	2022/8/23 0:00:00
修稿时间：	2022/9/27 0:00:00
Proximal Policy Optimization with Double Clipping Boundaries

ZHANG Jun,WANG Hong-Cheng.Proximal Policy Optimization with Double Clipping Boundaries[J].Computer Systems& Applications,2023,32(4):177-186.

Authors:	ZHANG Jun WANG Hong-Cheng

Affiliation:	School of Electrical Engineering and Intelligentization, Dongguan University of Technology, Dongguan 523808, China;School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China

Abstract:	Proximal policy optimization (PPO) is a stable deep reinforcement learning algorithm. The key process of the algorithm is to use clipped surrogate targets to limit step size updates. Experiments have found that when a clipping coefficient with optimal experience is employed, the upper bound of Kullback-Leibler (KL) divergence cannot be determined. This phenomenon is against the optimization theory of trust region. In this study, an improved PPO with double clipping boundaries (PPO-DC) algorithm is proposed. The algorithm adjusts the KL divergence based on two probability-based clipping boundaries and limits parameters to the trust region, so as to ensure that the sample data are fully utilized. In several continuous control tasks, the PPO-DC algorithm achieves better performance than other algorithms.

Keywords:	reinforcement learning policy gradient (PG) proximal policy optimization (PPO) clipping mechanism

	点击此处可从《计算机系统应用》浏览原始摘要信息
	点击此处可从《计算机系统应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏