首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 343 毫秒
1.
自适应RBF网络Q学习控制   总被引:1,自引:0,他引:1  
利用RBF网络逼近连续空间的Q值函数,实现连续空间的Q学习.RBF网络输入为状态-动作对,输出为该状态-动作对的Q值.状态由系统的状态转移特性确定,动作由优化网络输出得到的贪婪动作与服从高斯分布的噪声干扰动作两部分叠加而成.利用RNA算法和梯度下降法自适应调整网络的结构和参数.倒立摆平衡控制的实验结果验证了该方法的有效性.  相似文献   

2.
基于协同最小二乘支持向量机的Q学习   总被引:5,自引:0,他引:5  
针对强化学习系统收敛速度慢的问题, 提出一种适用于连续状态、离散动作空间的基于协同最小二乘支持向量机的Q学习. 该Q学习系统由一个最小二乘支持向量回归机(Least squares support vector regression machine, LS-SVRM)和一个最小二乘支持向量分类机(Least squares support vector classification machine, LS-SVCM)构成. LS-SVRM用于逼近状态--动作对到值函数的映射, LS-SVCM则用于逼近连续状态空间到离散动作空间的映射, 并为LS-SVRM提供实时、动态的知识或建议(建议动作值)以促进值函数的学习. 小车爬山最短时间控制仿真结果表明, 与基于单一LS-SVRM的Q学习系统相比, 该方法加快了系统的学习收敛速度, 具有较好的学习性能.  相似文献   

3.
唐昊  杨羊  戴飞  谭琦 《控制与决策》2019,34(7):1456-1462
研究一类多品种工件到达的传送带给料加工站系统(CSPS)的前视距离(Look-ahead)优化控制问题,以提高系统的工作效率.在工件品种数增加的情况下,系统状态规模会呈现指数性增长,考虑传统$ Q $学习在面对大规模离散状态空间所面临的维数灾难,且难以直接处理前视距离为连续化变量的问题,引入了RBF网络来逼近$ Q $值函数,网络的输入为状态行动对,输出为该状态行动对的$ Q $值.给出RBF-$ Q $学习算法,并应用于多品种CSPS系统的优化控制中,实现了连续行动空间的$ Q $学习.针对不同的品种数情况进行仿真分析,仿真结果表明,RBF-$ Q $学习算法可以对多品种CSPS系统性能进行有效优化,并且提高学习速度.  相似文献   

4.
自适应模糊RBF神经网络的多智能体机器人强化学习   总被引:3,自引:0,他引:3  
多机器人环境中的学习,由于机器人所处的环境是连续状态,连续动作,而且包含多个机器人,因此学习空间巨大,直接应用Q学习算法难以获得满意的结果。文章研究中针对多智能体机器人系统的学习问题,提出自适应模糊RBF神经网络强化学习算法,网络本身具有模糊推理能力、较强的函数逼近能力以及泛化能力,因此,实现了人类专家知识与机器学习方法的结合,减少学习问题的复杂度;实现连续状态空间与动作空间的策略学习。  相似文献   

5.
强化学习问题中,同一状态下不同动作所对应的状态-动作值存在差距过小的现象,Q-Learning算法采用MAX进行动作选择时会出现过估计问题,且结合了Q-Learning的深度Q网络(Deep Q Net)同样存在过估计问题。为了缓解深度Q网络中存在的过估计问题,提出一种基于优势学习的深度Q网络,通过优势学习的方法构造一个更正项,利用目标值网络对更正项进行建模,同时与深度Q网络的评估函数进行求和作为新的评估函数。当选择的动作是最优动作时,更正项为零,不对评估函数的值进行改动,当选择的动作不是最优动作时,更正项的值为负,降低了非最优动作的评估值。和传统的深度Q网络相比,基于优势学习的深度Q网络在Playing Atari 2600的控制问题breakout、seaquest、phoenix、amidar中取得了更高的平均奖赏值,在krull、seaquest中取得了更加稳定的策略。  相似文献   

6.
周勇  刘锋 《微机发展》2008,18(4):63-66
模拟机器人足球比赛(Robot World Cup,RoboCup)作为多Agent系统的一个理想的实验平台,已经成为人工智能的研究热点。传统的Q学习已被有效地应用于处理RoboCup中传球策略问题,但是它仅能简单地离散化连续的状态、动作空间。提出将神经网络应用于Q学习,系统只需学习部分状态-动作的Q值即可获得近似连续的Q值,就可以有效地提高泛化能力。然后将改进的Q学习应用于优化传球策略,最后在RobCup中实现测试了该算法,实验结果表明改进的Q学习在RoboCup传球策略中的应用,可以有效提高传球的成功率。  相似文献   

7.
针对传统强化学习方法因对状态空间进行离散化而无法保证无人机在复杂应用场景中航迹精度的问题,使用最小二乘策略迭代(Least-Squares Policy Iteration,LSPI)算法开展连续状态航迹规划问题研究。该算法采用带参线性函数逼近器近似表示动作值函数,无需进行空间离散化,提高了航迹精度,并基于样本数据离线计算策略,直接对策略进行评价和改进。与Q学习算法的对比仿真实验结果表明LSPI算法规划出的三维航迹更为平滑,有利于飞机实际飞行。  相似文献   

8.
针对两轮机器人的运动平衡控制问题,提出一种基于Q学习的生长细胞结构(GCS)网络的仿生学习算法;GCS网络除了具有SOM网络的竞争机制外,它还可以通过新神经元的不断生长,自组织地进行演化,Q学习算法是一种无模型强化学习算法,它可以改善学习能力,但是它只适用于状态离散化的控制系统中;将GCS网络的生长特性应用到Q学习算法中,通过网络输出的获胜神经元的信息来优化Q值,实现了状态连续系统的无模型控制,并且在两轮机器人上做了仿真实验;结果表明,当神经元数为12个时,机器人才开始受控,但是机器人本体的倾角振荡角度过大,位移不受控制;当神经元数增加到25个时,机器人本体的倾角在很小的角度范围内波动(大约0.2°),位移大约在0.05m的位置达到平衡,机器人的运动平衡达到了很好的控制效果。  相似文献   

9.
一种模糊强化学习算法及其在RoboCup中的应用   总被引:1,自引:0,他引:1  
传统的强化学习算法只能解决离散状态空间和动作空间的学习问题。论文提出一种模糊强化学习算法,通过模糊推理系统将连续的状态空间映射到连续的动作空间,然后通过学习得到一个完整的规则库。这个规则库为Agent的行为选择提供了先验知识,通过这个规则库可以实现动态规划。作者在RoboCup环境中验证了这个算法,实现了踢球策略的优化。  相似文献   

10.
基于神经网络的连续状态空间Q学习已应用在机器人导航领域。针对神经网络易陷入局部极小,提出了将支持向量机与Q学习相结合的移动机器人导航方法。首先以研制的CASIA-I移动机器人和它的工作环境为实验平台,确定出Q学习的回报函数;然后利用支持向量机对Q学习的状态——动作对的Q值进行在线估计,同时,为了提高估计速度,引入滚动时间窗机制;最后对所提方法进行了实验,实验结果表明所提方法能够使机器人无碰撞的到达目的地。  相似文献   

11.
This paper presents a new adaptive segmentation of continuous state space based on vector quantization algorithm such as Linde–Buzo–Gray for high-dimensional continuous state spaces. The objective of adaptive state space partitioning is to develop the efficiency of learning reward values with an accumulation of state transition vector in a single-agent environment. We constructed our single-agent model in continuous state and discrete actions spaces using Q-learning function. Moreover, the study of the resulting state space partition reveals a Voronoi tessellation. In addition, the experimental results show that this proposed method can partition the continuous state space appropriately into Voronoi regions according to not only the number of actions, but also achieve a good performance of reward-based learning tasks compared with other approaches such as square partition lattice on discrete state space.  相似文献   

12.
一种自适应模糊Actor-Critic学习   总被引:1,自引:0,他引:1  
提出一种基于模糊RBF网络的自适应模糊Actor—Critic学习.采用一个模糊RBF神经网络同时逼近Actor的动作函数和Critic的值函数,解决状态空间泛化中易出现的“维数灾”问题.模糊RBF网络能够根据环境状态和被控对象特性的变化进行网络结构和参数的自适应学习,使得网络结构更加紧凑,整个模糊Actor—Critic学习具有泛化性能好、控制结构简单和学习效率高的特点.MountainCar的仿真结果验证了所提方法的有效性.  相似文献   

13.
Continuous-Action Q-Learning   总被引:1,自引:0,他引:1  
This paper presents a Q-learning method that works in continuous domains. Other characteristics of our approach are the use of an incremental topology preserving map (ITPM) to partition the input space, and the incorporation of bias to initialize the learning process. A unit of the ITPM represents a limited region of the input space and maps it onto the Q-values of M possible discrete actions. The resulting continuous action is an average of the discrete actions of the winning unit weighted by their Q-values. Then, TD() updates the Q-values of the discrete actions according to their contribution. Units are created incrementally and their associated Q-values are initialized by means of domain knowledge. Experimental results in robotics domains show the superiority of the proposed continuous-action Q-learning over the standard discrete-action version in terms of both asymptotic performance and speed of learning. The paper also reports a comparison of discounted-reward against average-reward Q-learning in an infinite horizon robotics task.  相似文献   

14.
Quad-Q-learning     
Develops the theory of quad-Q-learning which is a learning algorithm that evolved from Q-learning. Quad-Q-learning is applicable to problems that can be solved by "divide and conquer" techniques. Quad-Q-learning concerns an autonomous agent that learns without supervision to act optimally to achieve specified goals. The learning agent acts in an environment that can be characterized by a state. In the Q-learning environment, when an action is taken, a reward is received and a single new state results. The objective of Q-learning is to learn a policy function that maps states to actions so as to maximize a function of the rewards such as the sum of rewards. However, with respect to quad-Q-learning, when an action is taken from a state either an immediate reward and no new state results, or no reward is received and four new states result from taking that action. The environment in which quad-Q-learning operates can thus be viewed as a hierarchy of states where lower level states are the children of higher level states. The hierarchical aspect of quad-Q-learning leads to a bottom up view of learning that improves the efficiency of learning at higher levels in the hierarchy. The objective of quad-Q-learning is to maximize the sum of rewards obtained from each of the environments that result as actions are taken. Two versions of quad-Q-learning are discussed; these are discrete state and mixed discrete and continuous state quad-Q-learning. The discrete state version is only applicable to problems with small numbers of states. Scaling up to problems with practical numbers of states requires a continuous state learning method. Continuous state learning can be accomplished using functional approximation methods. Application of quad-Q-learning to image compression is briefly described.  相似文献   

15.
One of the difficulties encountered in the application of reinforcement learning methods to real-world problems is their limited ability to cope with large-scale or continuous spaces. In order to solve the curse of the dimensionality problem, resulting from making continuous state or action spaces discrete, a new fuzzy Actor-Critic reinforcement learning network (FACRLN) based on a fuzzy radial basis function (FRBF) neural network is proposed. The architecture of FACRLN is realized by a four-layer FRBF neural network that is used to approximate both the action value function of the Actor and the state value function of the Critic simultaneously. The Actor and the Critic networks share the input, rule and normalized layers of the FRBF network, which can reduce the demands for storage space from the learning system and avoid repeated computations for the outputs of the rule units. Moreover, the FRBF network is able to adjust its structure and parameters in an adaptive way with a novel self-organizing approach according to the complexity of the task and the progress in learning, which ensures an economic size of the network. Experimental studies concerning a cart-pole balancing control illustrate the performance and applicability of the proposed FACRLN.  相似文献   

16.
强化学习在足球机器人基本动作学习中的应用   总被引:1,自引:0,他引:1  
主要研究了强化学习算法及其在机器人足球比赛技术动作学习问题中的应用.强化学习的状态空间 和动作空间过大或变量连续,往往导致学习的速度过慢甚至难于收敛.针对这一问题,提出了基于T-S 模型模糊 神经网络的强化学习方法,能够有效地实现强化学习状态空间到动作空间的映射.此外,使用提出的强化学习方 法设计了足球机器人的技术动作,研究了在不需要专家知识和环境模型情况下机器人的行为学习问题.最后,通 过实验证明了所研究方法的有效性,其能够满足机器人足球比赛的需要.  相似文献   

17.
基于节点生长k-均值聚类算法的强化学习方法   总被引:3,自引:0,他引:3  
处理连续状态强化学习问题,主要方法有两类:参数化的函数逼近和自适应离散划分.在分析了现有对连续状态空间进行自适应划分方法的优缺点的基础上,提出了一种基于节点生长k均值聚类算法的划分方法,分别给出了在离散动作和连续动作两种情况下该强化学习方法的算法步骤.在离散动作的MountainCar问题和连续动作的双积分问题上进行仿真实验.实验结果表明,该方法能够根据状态在连续空间的分布,自动调整划分的精度,实现对于连续状态空间的自适应划分,并学习到最佳策略.  相似文献   

18.
Online tuning of fuzzy inference systems using dynamic fuzzy Q-learning   总被引:1,自引:0,他引:1  
This paper presents a dynamic fuzzy Q-learning (DFQL) method that is capable of tuning fuzzy inference systems (FIS) online. A novel online self-organizing learning algorithm is developed so that structure and parameters identification are accomplished automatically and simultaneously based only on Q-learning. Self-organizing fuzzy inference is introduced to calculate actions and Q-functions so as to enable us to deal with continuous-valued states and actions. Fuzzy rules provide a natural mean of incorporating the bias components for rapid reinforcement learning. Experimental results and comparative studies with the fuzzy Q-learning (FQL) and continuous-action Q-learning in the wall-following task of mobile robots demonstrate that the proposed DFQL method is superior.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号