首页 | 本学科首页   官方微博 | 高级检索  
     

基于改进ELM的递归最小二乘时序差分强化学习算法及其应用
引用本文:徐圆,黄兵明,贺彦林.基于改进ELM的递归最小二乘时序差分强化学习算法及其应用[J].化工学报,2017,68(3):916-924.
作者姓名:徐圆  黄兵明  贺彦林
作者单位:北京化工大学信息科学与技术学院, 北京 100029
基金项目:国家自然科学基金项目(61573051,61472021);软件开发环境国家重点实验室开放课题(SKLSDE-2015KF-01);中央高校基本科研业务费专项资金项目(PT1613-05)。
摘    要:针对值函数逼近算法对精度及计算时间等要求,提出了一种基于改进极限学习机的递归最小二乘时序差分强化学习算法。首先,将递推方法引入到最小二乘时序差分强化学习算法中消去最小二乘中的矩阵求逆过程,形成递推最小二乘时序差分强化学习算法,减少算法的复杂度及其计算量。其次,考虑到LSTD(0)算法收敛速度慢,加入资格迹增加样本利用率提高收敛速度的算法,形成LSTD(λ)算法,以保证在经历过相同数量的轨迹后能收敛于真实值。同时,考虑到大部分强化学习问题的值函数是单调的,而传统ELM方法通常运用具有双侧抑制特性的Sigmoid激活函数,增大了计算成本,提出采用具有单侧抑制特性的Softplus激活函数代替传统Sigmoid函数,以减少计算量提高运算速度,使得该算法在提高精度的同时提高了计算速度。通过与传统基于径向基函数的最小二乘强化学习算法和基于极限学习机的最小二乘TD算法在广义Hop-world问题的对比实验,比较结果证明了所提出算法在满足精度的条件下有效提高了计算速度,甚至某些条件下精度比其他两种算法更高。

关 键 词:强化学习  激活函数  递归最小二乘算法  函数逼近  广义Hop-world问题  
收稿时间:2016-11-03
修稿时间:2016-11-08

Recursive least-squares TD (λ) learning algorithm based on improved extreme learning machine
XU Yuan,HUANG Bingming,HE Yanlin.Recursive least-squares TD (λ) learning algorithm based on improved extreme learning machine[J].Journal of Chemical Industry and Engineering(China),2017,68(3):916-924.
Authors:XU Yuan  HUANG Bingming  HE Yanlin
Affiliation:School of Information Science & Technology, Beijing University of Chemical Technology, Beijing 100029, China
Abstract:To meet the requirements on accuracy and computational time of value approximation algorithms, a recursive least-squares temporal difference reinforcement learning algorithm with eligibility traces based on improved extreme learning machine (RLSTD(λ)-IELM) was proposed. First, a recursive least-squares temporal difference reinforcement learning (RLSTD) was created by introducing recursive method into least-squares temporal difference reinforcement learning algorithm (LSTD), in order to eliminate matrix inversion process in least-squares algorithm and to reduce complexity and computation of the proposed algorithm. Then, eligibility trace was introduced into RLSTD algorithm to form the recursive least-squares temporal difference reinforcement learning algorithm with eligibility trace (RLSTD(λ)), in order to solve issues of slow convergence speed of LSTD(0) and low efficiency of experience exploitation. Furthermore, since value function in most reinforcement learning problem was monotonic, a single suppressed approximation Softplus function was used to replace sigmoid activation function in the extreme learning machine network in order to reduce computation load and improve computing speed. The experiment result on generalized Hop-world problem demonstrated that the proposed algorithm RLSTD(λ)-IELM had faster computing speed than the least-squares temporal difference learning algorithm based on extreme learning machine (LSTD-ELM), and better accuracy than the least-squares temporal difference learning algorithm based on radial basis functions (LSTD-RBF).
Keywords:reinforcement learning  activation function  recursive least-squares methods  function approximation  generalized Hop-world problem  
点击此处可从《化工学报》浏览原始摘要信息
点击此处可从《化工学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号