首页 | 本学科首页   官方微博 | 高级检索  
     

基于参数探索的期望最大化策略搜索
引用本文:程玉虎,冯涣婷,王雪松.基于参数探索的期望最大化策略搜索[J].自动化学报,2012,38(1):38-45.
作者姓名:程玉虎  冯涣婷  王雪松
作者单位:1.中国矿业大学信息与电气工程学院 徐州 221116
基金项目:国家自然科学基金(60804022,60974050,61072094);教育部新世纪优秀人才支持计划(NCET-08-0836,NCET-10-0765);霍英东教育基金会青年教师基金(121066)资助~~
摘    要:针对随机探索易于导致梯度估计方差过大的问题,提出一种基于参数探索的期望最大化(Expectation-maximization,EM)策略搜索方法.首先,将策略定义为控制器参数的一个概率分布.然后,根据定义的概率分布直接在控制器参数空间进行多次采样以收集样本.在每一幕样本的收集过程中,由于选择的动作均是确定的,因此可以减小采样带来的方差,从而减小梯度估计方差.最后,基于收集到的样本,通过最大化期望回报函数的下界来迭代地更新策略参数.为减少采样耗时和降低采样成本,此处利用重要采样技术以重复使用策略更新过程中收集的样本.两个连续空间控制问题的仿真结果表明,与基于动作随机探索的策略搜索强化学习方法相比,本文所提方法不仅学到的策略最优,而且加快了算法收敛速度,具有较好的学习性能.

关 键 词:策略搜索    强化学习    参数空间    探索    期望最大化    重要采样
收稿时间:2011-5-24
修稿时间:2011-8-30

Expectation-maximization Policy Search with Parameter-based Exploration
CHENG Yu-Hu,FENG Huan-Ting,WANG Xue-Song.Expectation-maximization Policy Search with Parameter-based Exploration[J].Acta Automatica Sinica,2012,38(1):38-45.
Authors:CHENG Yu-Hu  FENG Huan-Ting  WANG Xue-Song
Affiliation:1.School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou 221116
Abstract:In order to reduce large variance of gradient estimation resulted from stochastic exploration strategy, a kind of expectation-maximization policy search reinforcement learning with parameter-based exploration is proposed. At first, a probability distribution over the parameters of a controller is used to define a policy. Secondly, samples are collected by directly sampling in the controller parameter space according to the probability distribution for several times. During the sample-collection procedure of each episode, because the selected actions are deterministic, sampling from the defined policy leads to a small variance in the samples, which can reduce the variance of gradient estimation. At last, based on the collected samples, policy parameters are iteratively updated by maximizing the lower bound of the expected return function. In order to reduce the time-consumption and to lower the cost of sampling, an importance sampling technique is used to repeatedly use samples collected from policy update process. Simulation results on two continuous-space control problems illustrate that the proposed policy search method can not only obtain the most optimal policy but also improve the convergence speed as compared with several policy search reinforcement learning methods with action-based stochastic exploration, thus has a better learning performance.
Keywords:Policy search  reinforcement learning  parameter space  exploration  expectation-maximization (EM)  importance sampling
本文献已被 CNKI 等数据库收录!
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号