首页 | 本学科首页   官方微博 | 高级检索  
     

基于内在奖励的技能获取和组合方法
引用本文:赵英,秦进.基于内在奖励的技能获取和组合方法[J].计算机应用研究,2022,39(12).
作者姓名:赵英  秦进
作者单位:贵州大学计算机科学与技术学院,贵州大学计算机科学与技术学院
基金项目:贵州省科学技术基金资助项目(黔科合基础[2020]1Y275);贵州省科技计划项目(黔科合基础[2019]1130号)
摘    要:现有的内在奖励随着agent不断探索环境而逐渐消失,导致了agent无法利用内在奖励信号去指引agent寻找最优策略。为了解决这个问题,提出了一种基于内在奖励的技能获取和组合方法。该方法首先在agent与环境交互过程中寻找积极状态,在积极状态中筛选子目标;其次从初始状态到达子目标,子目标到达终止状态所产生的一条轨迹中发现技能,对技能中出现一个或者两个以上的子目标进行组合;最后用初始状态到子目标的距离和初始状态到子目标的累积奖励值对技能进行评估。该方法在Mujoco环境中取得了较高的平均奖励值,尤其是在外在奖励延迟的情况下,也能取得较好的平均奖励值。说明该方法提出的子目标和技能可以有效地解决内在奖励消失后,agent无法利用内在奖励信号学习最优策略的问题。

关 键 词:积极状态    子目标    技能    技能评估
收稿时间:2022/4/4 0:00:00
修稿时间:2022/11/18 0:00:00

Intrinsic reward-based skill acquisition and combination approach
zhaoying and qinjin.Intrinsic reward-based skill acquisition and combination approach[J].Application Research of Computers,2022,39(12).
Authors:zhaoying and qinjin
Affiliation:School of Computer Science and Technology, Guizhou University,
Abstract:The existing intrinsic reward gradually disappears as the agent continues to explore the environment, which makes the agent unable to use the intrinsic reward signal to guide the agent to find the optimal strategy. This paper proposed an intrinsic reward-based skill acquisition and combination approach to address this issue. The method firstly searched for a positive state in the interaction process between the agent and the environment, and selected the subgoal in the positive state. Secondly, it found the skill from a trajectory generated by the initial state to the subgoal and the subgoal to the termination state, and combined one or two or more subgoals in the skill. Finally, it used the skill to evaluate the distance from the initial state to the subgoal and the cumulative reward value from the initial state to the subgoal. The method achieves a high average reward value in the Mujoco environment, especially when the extrinsic reward is delayed. It shows that the subgoal and skills of the proposed method can effectively solve the problem that the agent cannot use the intrinsic reward signal to learn the optimal policy after the intrinsic reward disappears.
Keywords:positive state  subgoal  skill  skill assessment
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号