首页 | 本学科首页   官方微博 | 高级检索  
     

基于MetrOPOlis准则的Q-学习算法研究
引用本文:郭茂祖,王亚东,孙华梅,刘扬.基于MetrOPOlis准则的Q-学习算法研究[J].计算机研究与发展,2002,39(6):684-688.
作者姓名:郭茂祖  王亚东  孙华梅  刘扬
作者单位:1. 哈尔滨工业大学计算机科学与技术学院,哈尔滨,150001;哈尔滨工业大学管理学院,哈尔滨,150001
2. 哈尔滨工业大学计算机科学与技术学院,哈尔滨,150001
3. 哈尔滨工业大学管理学院,哈尔滨,150001
基金项目:本课题得到国家“八六三”高技术研究发展计划(200lAA115550),国家自然科学基金(70071008),中国博士后科学基金资助
摘    要:探索与扩张是Q-学习算法中动作选取的关键问题,一味地扩张使智能体很快地陷入局部最优,虽然探索可以跳出局部最优并加速学习,而过多的探索将影响算法的性能,通过把Q-学习中寻求成策略表示为组合优化问题中最优解的搜索,将模拟退火算法的Mketropolis准则用于Q-学习中探索和扩张之间的折衷处理,提出基于Metropolis准则的Q-学习算法SA-Q-learning,通过比较,它具有更快的收敛速度,而且避免了过多探索引起的算法性能下降。

关 键 词:机器学习  Metropolis准则  Q-学习算法

RESEARCH ON Q-LEARNING ALGORITHM BASED ON METROPOLIS CRITERION
GUO Mao-Zu,Wang Ya-Dong,SUN Hua-Mei,and LIU Yang.RESEARCH ON Q-LEARNING ALGORITHM BASED ON METROPOLIS CRITERION[J].Journal of Computer Research and Development,2002,39(6):684-688.
Authors:GUO Mao-Zu  Wang Ya-Dong  SUN Hua-Mei  and LIU Yang
Abstract:The balance between exploration and exploitation is one of the key problems when action selection is performed in Q-learning. Pure exploitations will cause the agent to reach the local optimization quickly, whereas excessive explorations will degenerate the performance of the Q-learning algorithm even if they can accelerate learning process and can avoid the local optimization. In this paper, finding the optimum policy in Q-learning is described as searching optimum solution in combinatorial optimization. Then Metropolis criterion of simulated annealing algorithm is introduced in the balance between exploration and exploitation of Q-learning, and the Q-learning algorithm based on Metropolis criterion, SA-Q-learning, is correspondingly presented. Finally, tests show that SA-Q-learning converges more quickly than Q-learning, and can avoid the degeneracy in performance due to excessive explorations.
Keywords:reinforcement learning  Q-learning  Metropolis criterion  exploration  exploitation
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号