概率近似正确的强化学习算法解决连续状态空间控制问题 Probably approximately correct reinforcement learning solving continuous-state control problem期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

概率近似正确的强化学习算法解决连续状态空间控制问题

引用本文：	朱圆恒,赵冬斌.概率近似正确的强化学习算法解决连续状态空间控制问题[J].控制理论与应用,2016,33(12):1603-1613.

作者姓名：	朱圆恒赵冬斌

作者单位：	中国科学院自动化研究所,中国科学院自动化研究所

基金项目：	国家自然科学基金项目(61273136, 61573353, 61533017, 61603382), 复杂系统管理与控制国家重点实验室优秀人才基金项目资助.

摘要：	在线学习时长是强化学习算法的一个重要指标.传统在线强化学习算法如Q学习、状态–动作–奖励–状态–动作(state-action-reward-state-action,SARSA)等算法不能从理论分析角度给出定量的在线学习时长上界.本文引入概率近似正确(probably approximately correct,PAC)原理,为连续时间确定性系统设计基于数据的在线强化学习算法.这类算法有效记录在线数据,同时考虑强化学习算法对状态空间探索的需求,能够在有限在线学习时间内输出近似最优的控制.我们提出算法的两种实现方式,分别使用状态离散化和kd树(k-dimensional树)技术,存储数据和计算在线策略.最后我们将提出的两个算法应用在双连杆机械臂运动控制上,观察算法的效果并进行比较.
关键词：	强化学习概率近似正确 kd树双连杆机械臂
收稿时间：	2016/7/14 0:00:00
修稿时间：	2016/10/10 0:00:00
Probably approximately correct reinforcement learning solving continuous-state control problem

ZHU Yuan-heng and ZHAO Dong-bin.Probably approximately correct reinforcement learning solving continuous-state control problem[J].Control Theory & Applications,2016,33(12):1603-1613.

Authors:	ZHU Yuan-heng and ZHAO Dong-bin

Affiliation:	Institute of Automation, Chinese Academy of Sciences,Institute of Automation, Chinese Academy of Sciences

Abstract:	One important factor of reinforcement learning (RL) algorithms is the online learning time. Conventional algorithms such Q-learning and state-action-reward-state-action (SARSA) can not give the quantitative analysis on the upper bound of the online learning time. In this paper, we employ the idea of probably approximately correct (PAC) and design the data-driven online RL algorithm for continuous-time deterministic systems. This class of algorithms ef?ciently record online observations and keep in mind the exploration required by online RL. They are capable to learn the near- optimal policy within a ?nite time length. Two algorithms are developed, separately based on state discretization and kd-tree technique, which are used to store data and compute online policies. Both algorithms are applied to the two-link manipulator to observe the performance.

Keywords:	reinforcement learning probably approximately correct kd-tree two-link manipulator
本文献已被 CNKI 等数据库收录！
	点击此处可从《控制理论与应用》浏览原始摘要信息
	点击此处可从《控制理论与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏