首页 | 本学科首页   官方微博 | 高级检索  
     

动态环境中的分层强化学习
引用本文:沈晶,程晓北,刘海波,顾国昌,张国印. 动态环境中的分层强化学习[J]. 控制理论与应用, 2008, 25(1): 71-74
作者姓名:沈晶  程晓北  刘海波  顾国昌  张国印
作者单位:哈尔滨工程大学,计算机科学与技术学院,黑龙江,哈尔滨,150001;哈尔滨工程大学,计算机科学与技术学院,黑龙江,哈尔滨,150001;哈尔滨工程大学,计算机科学与技术学院,黑龙江,哈尔滨,150001;哈尔滨工程大学,计算机科学与技术学院,黑龙江,哈尔滨,150001;哈尔滨工程大学,计算机科学与技术学院,黑龙江,哈尔滨,150001
基金项目:中国博士后科学基金,哈尔滨工程大学校科研和教改项目
摘    要:现有的强化学习方法都不能很好地处理动态环境中的学习问题,当环境变化时需要重新学习最优策略,若环境变化的时间间隔小于策略收敛时间,学习算法则不能收敛.本文在Option分层强化学习方法的基础上提出一种适应动态环境的分层强化学习方法,该方法利用学习的分层特性,仅关注分层任务子目标状态及当前Option内部环境状态的变化,将策略更新过程限制在规模较小的局部空间或维数较低的高层空间上,从而加快学习速度.以二维动态栅格空间内两点间最短路径规划为背景进行了仿真实验,实验结果表明,该方法策略学习速度明显高于以往的方法,且学习算法收敛性对环境变化频率的依赖性有所降低.

关 键 词:分层强化学习  动态环境  Option  策略更新
文章编号:1000-8152(2007)05-0071-04
收稿时间:2005-12-16
修稿时间:2007-01-05

Hierarchical reinforcement learning in dynamic environment
SHEN Jing,CHENG Xiao-bei,LIU Hai-bo,GU Guo-chang,ZHANG Guo-yin. Hierarchical reinforcement learning in dynamic environment[J]. Control Theory & Applications, 2008, 25(1): 71-74
Authors:SHEN Jing  CHENG Xiao-bei  LIU Hai-bo  GU Guo-chang  ZHANG Guo-yin
Affiliation:School of Computer Science and Technology, Harbin Engineering University, Harbin Heilongjiang 150001, China
Abstract:The existing reinforcement learning approaches cannot satisfactorily solve the learning problems in dynamic environment. The optimal strategy must be re-learned when environment changes. The learning algorithm cannot converge to optimal strategy if the interval between the changes is shorter than the duration of strategy converging. In this paper, a hierarchical reinforcement learning approach adapting to dynamic environments is presented based on the Option hierarchical reinforcement learning. According to the hierarchical characteristic of learning, the approach only takes into account the changes taking place in the sub-goal states of hierarchical tasks or the environment states of current Option. So the process of strategy update is limited in a small-scale local space or a low dimension high-level space. Consequently, the process of strategy update is accelerated. The experiments with shortest path planning in a two-dimensional dynamic grid space show that the presented approach is obviously faster than the existing approach in strategy update. Additionally the dependency of convergence of the learning algorithm on the frequency of environment change is reduced.
Keywords:Option
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《控制理论与应用》浏览原始摘要信息
点击此处可从《控制理论与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号