首页 | 本学科首页   官方微博 | 高级检索  
     

基于最佳子策略记忆的强化探索策略
引用本文:周瑞朋,秦进.基于最佳子策略记忆的强化探索策略[J].计算机工程,2022,48(2):106-112.
作者姓名:周瑞朋  秦进
作者单位:贵州大学 计算机科学与技术学院, 贵阳 550025
基金项目:国家自然科学基金(61562009);;贵州省科学技术基金(黔科合支撑[2020]3Y004号);
摘    要:现有强化学习探索策略存在过度探索的问题,导致智能体收敛速度减慢。通过设计一个基于奖励排序的存储表(M表)和ε-greedy改进算法,提出基于最佳子策略记忆的强化探索策略。将奖励值大于零的样本以子策略的形式存入M表,使其基于奖励降序排序,在整个训练过程中,使用与表中相似且奖励值较高的样本以子策略形式替换表中子策略,从而在表中形成一个能有效产生目前最优奖励的动作集合,提高探索的针对性,而不是随机探索。同时,在ε-greedy算法基础上按一定的概率分配,使智能体通过使用M表探索得到MEG探索策略。基于此,智能体在一定概率下将当前状态与M表中子策略匹配,若相似,则将表中与其相似的子策略对应动作反馈给智能体,智能体执行该动作。实验结果表明,该策略能够有效缓解过度探索现象,与DQN系列算法和非DQN系列的A2C算法相比,其在Playing Atari 2600游戏的控制问题中获得了更高的平均奖励值。

关 键 词:强化学习  过度探索  MEG探索  相似度  最佳子策略  
收稿时间:2020-12-04
修稿时间:2021-01-28

Reinforcement Exploration Strategy Based on Best Sub-Strategy Memory
ZHOU Ruipeng,QIN Jin.Reinforcement Exploration Strategy Based on Best Sub-Strategy Memory[J].Computer Engineering,2022,48(2):106-112.
Authors:ZHOU Ruipeng  QIN Jin
Affiliation:College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
Abstract:Existing reinforcement learning exploration strategies are limited by over exploration, resulting in the slow convergence of agents.To address this issue, in this study, a storage table(M table) is designed and the ε-greedy algorithm is improved upon to propose a reinforcement exploration strategy based on best sub-strategy memory.The samples with reward values greater than zero are stored in the M table in the form of sub-strategies, which are then sorted in descending order based on the reward.During the training process, samples with similar and higher reward values are used to replace the sub-strategies in the table, to form an action set that can effectively produce the current optimal reward in the table, while making the exploration process more relevant rather than random.Additionally, based on the ε-greedy algorithm, the sub-strategies are distributed according to a certain probability, such that the agent can obtain the M-Epsilon-Greedy(MEG) exploration strategy by using the M table.Under this strategy, the agent matches the current state with the sub-strategy in the M table for a certain probability, whereby in the case of a match, the corresponding action of the sub-strategy in the table is fed back to the agent, and the agent executes the action.Experimental results indicate that this strategy can effectively alleviate the phenomenon of over exploration.Compared with the DQN series algorithm and non-DQN series A2C algorithm, a higher reward value is obtained in the control problem of Playing Atari 2600 game using the proposed strategy.
Keywords:reinforcement learning  excessive exploration  M-Epsilon-Greedy(MEG)exploration  similarity  best sub-strategy
本文献已被 维普 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号