首页 | 本学科首页   官方微博 | 高级检索  
     

强化学习算法中启发式回报函数的设计及其收敛性分析
引用本文:魏英姿,赵明扬.强化学习算法中启发式回报函数的设计及其收敛性分析[J].计算机科学,2005,32(3):190-193.
作者姓名:魏英姿  赵明扬
作者单位:中国科学院沈阳自动化所机器人学重点实验室,沈阳,110016;沈阳理工大学,沈阳,110168;中国科学院研究生,北京,100039;中国科学院沈阳自动化所机器人学重点实验室,沈阳,110016
基金项目:中国科学院先进制造基地创新基金(F010120),973计划课题(2002CB312200)
摘    要:(中国科学院沈阳自动化所机器人学重点实验室沈阳110016)

关 键 词:强化学习  回报函数  马尔可夫决策过程  策略  收敛性

Design and Convergence Analysis of a Heuristic Reward Function for Reinforcement Learning Algorithms
WEI Ying-Zi,ZHAO Ming-Yang.Design and Convergence Analysis of a Heuristic Reward Function for Reinforcement Learning Algorithms[J].Computer Science,2005,32(3):190-193.
Authors:WEI Ying-Zi  ZHAO Ming-Yang
Affiliation:WEI Ying-Zi,ZHAO Ming-Yang Robotics Laboratory,Shenyang Institute of Automation,Chinese Academy of Sciences,Shenyang 110016 Shenyang University of Technology,Shenyang 110168 Graduate School of the Chinese Academy of Sciences. Beijing 100039
Abstract:The reward function has become the critical component for its effect of evaluating the action and guiding the reinforcement learning (RL) process. According to the distribution of rewards in the space of states, reward func- tions can have two basic forms, dense and sparse. Their effects act on RL algorithm performance differently. Sparse reward functions are more difficult to learn a value function for than dense ones. The idea of designing a heuristic re- ward function is proposed in this paper. The practice of the heuristic reward function in RL consists of supplying addi- tional rewards to a learning system, beyond those supplied by the underlying Marko Decision Process (MDP). We can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential func- tion applied to those states. The additional reward function F, based on a reward for transitions between states, is a difference of conservative potentials. The additional training reward F will provide more heuristic information and be used to guide the learning system to progress rapidly. The gradient inherent in heuristic reward functions tends to give more leverage when learning the value function. The proof of convergence of Q-value iteration is presented under a more general model of MDP, too. The heuristic reward function helps to implement an efficient reinforcement learn- ing system on a real-time control or scheduling system.
Keywords:Reinforcement learning  Reward function  Markov decisior process  Policy  Convergence
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号