首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
基于神经网络的连续状态空间Q学习已应用在机器人导航领域。针对神经网络易陷入局部极小,提出了将支持向量机与Q学习相结合的移动机器人导航方法。首先以研制的CASIA-I移动机器人和它的工作环境为实验平台,确定出Q学习的回报函数;然后利用支持向量机对Q学习的状态——动作对的Q值进行在线估计,同时,为了提高估计速度,引入滚动时间窗机制;最后对所提方法进行了实验,实验结果表明所提方法能够使机器人无碰撞的到达目的地。  相似文献   

2.
Reinforcement learning (RL) has been applied to many fields and applications, but there are still some dilemmas between exploration and exploitation strategy for action selection policy. The well-known areas of reinforcement learning are the Q-learning and the Sarsa algorithms, but they possess different characteristics. Generally speaking, the Sarsa algorithm has faster convergence characteristics, while the Q-learning algorithm has a better final performance. However, Sarsa algorithm is easily stuck in the local minimum and Q-learning needs longer time to learn. Most literatures investigated the action selection policy. Instead of studying an action selection strategy, this paper focuses on how to combine Q-learning with the Sarsa algorithm, and presents a new method, called backward Q-learning, which can be implemented in the Sarsa algorithm and Q-learning. The backward Q-learning algorithm directly tunes the Q-values, and then the Q-values will indirectly affect the action selection policy. Therefore, the proposed RL algorithms can enhance learning speed and improve final performance. Finally, three experimental results including cliff walk, mountain car, and cart–pole balancing control system are utilized to verify the feasibility and effectiveness of the proposed scheme. All the simulations illustrate that the backward Q-learning based RL algorithm outperforms the well-known Q-learning and the Sarsa algorithm.  相似文献   

3.
唐昊  裴荣  周雷  谭琦 《自动化学报》2014,40(5):901-908
单站点传送带给料加工站(Conveyor-serviced production station,CSPS)系统中,可运用强化学习对状态——行动空间进行有效探索,以搜索近似最优的前视距离控制策略.但是多站点CSPS系统的协同控制问题中,系统状态空间的大小会随着站点个数的增加和缓存库容量的增加而成指数形式(或几何级数)增长,从而导致维数灾,影响学习算法的收敛速度和优化效果.为此,本文在站点局域信息交互机制的基础上引入状态聚类的方法,以减小每个站点学习空间的大小和复杂性.首先,将多个站点看作相对独立的学习主体,且各自仅考虑邻近下游站点的缓存库的状态并纳入其性能值学习过程;其次,将原状态空间划分成多个不相交的子集,每个子集用一个抽象状态表示,然后,建立基于状态聚类的多站点反馈式Q学习算法.通过该方法,可在抽象状态空间上对各站点的前视距离策略进行优化学习,以寻求整个系统的生产率最大.仿真实验结果说明,与一般的多站点反馈式Q学习方法相比,基于状态聚类的多站点反馈式Q学习方法不仅具有收敛速度快的优点,而且还在一定程度上提高了系统生产率.  相似文献   

4.
We propose two algorithms for Q-learning that use the two-timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments using the proposed algorithms on an application of routing in communication networks are presented on a few different settings.  相似文献   

5.
Devin Schwab  Soumya Ray 《Machine Learning》2017,106(9-10):1569-1598
In this work, we build upon the observation that offline reinforcement learning (RL) is synergistic with task hierarchies that decompose large Markov decision processes (MDPs). Task hierarchies can allow more efficient sample collection from large MDPs, while offline algorithms can learn better policies than the so-called “recursively optimal” or even hierarchically optimal policies learned by standard hierarchical RL algorithms. To enable this synergy, we study sample collection strategies for offline RL that are consistent with a provided task hierarchy while still providing good exploration of the state-action space. We show that naïve extensions of uniform random sampling do not work well in this case and design a strategy that has provably good convergence properties. We also augment the initial set of samples using additional information from the task hierarchy, such as state abstraction. We use the augmented set of samples to learn a policy offline. Given a capable offline RL algorithm, this policy is then guaranteed to have a value greater than or equal to the value of the hierarchically optimal policy. We evaluate our approach on several domains and show that samples generated using a task hierarchy with a suitable strategy allow significantly more sample-efficient convergence than standard offline RL. Further, our approach also shows more sample-efficient convergence to policies with value greater than or equal to hierarchically optimal policies found through an online hierarchical RL approach.  相似文献   

6.
周勇  刘锋 《微机发展》2008,18(4):63-66
模拟机器人足球比赛(Robot World Cup,RoboCup)作为多Agent系统的一个理想的实验平台,已经成为人工智能的研究热点。传统的Q学习已被有效地应用于处理RoboCup中传球策略问题,但是它仅能简单地离散化连续的状态、动作空间。提出将神经网络应用于Q学习,系统只需学习部分状态-动作的Q值即可获得近似连续的Q值,就可以有效地提高泛化能力。然后将改进的Q学习应用于优化传球策略,最后在RobCup中实现测试了该算法,实验结果表明改进的Q学习在RoboCup传球策略中的应用,可以有效提高传球的成功率。  相似文献   

7.
Reinforcement learning (RL) is a popular method for solving the path planning problem of autonomous mobile robots in unknown environments. However, the primary difficulty faced by learning robots using the RL method is that they learn too slowly in obstacle-dense environments. To more efficiently solve the path planning problem of autonomous mobile robots in such environments, this paper presents a novel approach in which the robot’s learning process is divided into two phases. The first one is to accelerate the learning process for obtaining an optimal policy by developing the well-known Dyna-Q algorithm that trains the robot in learning actions for avoiding obstacles when following the vector direction. In this phase, the robot’s position is represented as a uniform grid. At each time step, the robot performs an action to move to one of its eight adjacent cells, so the path obtained from the optimal policy may be longer than the true shortest path. The second one is to train the robot in learning a collision-free smooth path for decreasing the number of the heading changes of the robot. The simulation results show that the proposed approach is efficient for the path planning problem of autonomous mobile robots in unknown environments with dense obstacles.  相似文献   

8.
We propose an integrated technique of genetic programming (GP) and reinforcement learning (RL) to enable a real robot to adapt its actions to a real environment. Our technique does not require a precise simulator because learning is achieved through the real robot. In addition, our technique makes it possible for real robots to learn effective actions. Based on this proposed technique, we acquire common programs, using GP, which are applicable to various types of robots. Through this acquired program, we execute RL in a real robot. With our method, the robot can adapt to its own operational characteristics and learn effective actions. In this paper, we show experimental results from two different robots: a four-legged robot "AIBO" and a humanoid robot "HOAP-1." We present results showing that both effectively solved the box-moving task; the end result demonstrates that our proposed technique performs better than the traditional Q-learning method.  相似文献   

9.
This paper develops an adaptive fuzzy controller for robot manipulators using a Markov game formulation. The Markov game framework offers a promising platform for robust control of robot manipulators in the presence of bounded external disturbances and unknown parameter variations. We propose fuzzy Markov games as an adaptation of fuzzy Q-learning (FQL) to a continuous-action variation of Markov games, wherein the reinforcement signal is used to tune online the conclusion part of a fuzzy Markov game controller. The proposed Markov game-adaptive fuzzy controller uses a simple fuzzy inference system (FIS), is computationally efficient, generates a swift control, and requires no exact dynamics of the robot system. To illustrate the superiority of Markov game-adaptive fuzzy control, we compare the performance of the controller against a) the Markov game-based robust neural controller, b) the reinforcement learning (RL)-adaptive fuzzy controller, c) the FQL controller, d) the Hinfin theory-based robust neural game controller, and e) a standard RL-based robust neural controller, on two highly nonlinear robot arm control problems of i) a standard two-link rigid robot arm and ii) a 2-DOF SCARA robot manipulator. The proposed Markov game-adaptive fuzzy controller outperformed other controllers in terms of tracking errors and control torque requirements, over different desired trajectories. The results also demonstrate the viability of FISs for accelerating learning in Markov games and extending Markov game-based control to continuous state-action space problems.  相似文献   

10.
A problem related to the use of reinforcement learning (RL) algorithms on real robot applications is the difficulty of measuring the learning level reached after some experience. Among the different RL algorithms, the Q-learning is the most widely used in accomplishing robotic tasks. The aim of this work is to a priori evaluate the optimal Q-values for problems where it is possible to compute the distance between the current state and the goal state of the system. Starting from the Q-learning updating formula the equations for the maximum Q-weights, for optimal and non-optimal actions, have been computed considering delayed and immediate rewards. Deterministic and non deterministic grid-world environments have been also considered to test in simulations the obtained equations. Besides the convergence rates of the Q-learning algorithm have been compared using different learning rate parameters.  相似文献   

11.
This paper investigates reinforcement learning problems where a stochastic time delay is present in the reinforcement signal, but the delay is unknown to the learning agent. This work posits that the agent may receive individual reinforcements out of order, which is a relaxation of an important assumption in previous works from the literature. To that end, a stochastic time delay is introduced into a mobile robot line-following application. The main contribution of this work is to provide a novel stochastic approximation algorithm, which is an extension of Q-learning, for the time-delayed reinforcement problem. The paper includes a proof of convergence as well as grid world simulation results from MATLAB, results of line-following simulations within the Cyberbotics Webots mobile robot simulator, and finally, experimental results using an e-Puck mobile robot to follow a real track despite the presence of large, stochastic time delays in its reinforcement signal.  相似文献   

12.
由于强大的自主学习能力, 强化学习方法逐渐成为机器人导航问题的研究热点, 但是复杂的未知环境对算法的运行效率和收敛速度提出了考验。提出一种新的机器人导航Q学习算法, 首先用三个离散的变量来定义环境状态空间, 然后分别设计了两部分奖赏函数, 结合对导航达到目标有利的知识来启发引导机器人的学习过程。实验在Simbad仿真平台上进行, 结果表明本文提出的算法很好地完成了机器人在未知环境中的导航任务, 收敛性能也有其优越性。  相似文献   

13.
Kernel-based least squares policy iteration for reinforcement learning.   总被引:4,自引:0,他引:4  
In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.  相似文献   

14.
针对现有机器人路径规划强化学习算法收敛速度慢的问题,提出了一种基于人工势能场的移动机器人强化学习初始化方法.将机器人工作环境虚拟化为一个人工势能场,利用先验知识确定场中每点的势能值,它代表最优策略可获得的最大累积回报.例如障碍物区域势能值为零,目标点的势能值为全局最大.然后定义Q初始值为当前点的立即回报加上后继点的最大折算累积回报.改进算法通过Q值初始化,使得学习过程收敛速度更快,收敛过程更稳定.最后利用机器人在栅格地图中的路径对所提出的改进算法进行验证,结果表明该方法提高了初始阶段的学习效率,改善了算法性能.  相似文献   

15.
基于状态-动作图测地高斯基的策略迭代强化学习   总被引:3,自引:2,他引:1  
在策略迭代强化学习中,基函数构造是影响动作值函数逼近精度的一个重要因素.为了给动作值函数逼近提供合适的基函数,提出一种基于状态-动作图测地高斯基的策略迭代强化学习方法.首先,根据离策略方法建立马尔可夫决策过程的状态-动作图论描述;然后,在状态-动作图上定义测地高斯核函数,利用基于近似线性相关的核稀疏方法自动选择测地高斯...  相似文献   

16.
Mahadevan  Sridhar 《Machine Learning》1996,22(1-3):159-195
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.  相似文献   

17.
传统Q算法对于机器人回报函数的定义较为宽泛,导致机器人的学习效率不高。为解决该问题,给出一种回报详细分类Q(RDC-Q)学习算法。综合机器人各个传感器的返回值,依据机器人距离障碍物的远近把机器人的状态划分为20个奖励状态和15个惩罚状态,对机器人每个时刻所获得的回报值按其状态的安全等级分类,使机器人趋向于安全等级更高的状态,从而帮助机器人更快更好地学习。通过在一个障碍物密集的环境中进行仿真实验,证明该算法收敛速度相对传统回报Q算法有明显提高。  相似文献   

18.
This article demonstrates that Q-learning can be accelerated by appropriately specifying initial Q-values using dynamic wave expansion neural network. In our method, the neural network has the same topography as robot work space. Each neuron corresponds to a certain discrete state. Every neuron of the network will reach an equilibrium state according to the initial environment information. The activity of the special neuron denotes the maximum cumulative reward by following the optimal policy from the corresponding state when the network is stable. Then the initial Q-values are defined as the immediate reward plus the maximum cumulative reward by following the optimal policy beginning at the succeeding state. In this way, we create a mapping between the known environment information and the initial values of Q-table based on neural network. The prior knowledge can be incorporated into the learning system, and give robots a better learning foundation. Results of experiments in a grid world problem show that neural network-based Q-learning enables a robot to acquire an optimal policy with better learning performance compared to conventional Q-learning and potential field-based Qlearning.  相似文献   

19.
未知环境下基于有先验知识的滚动Q学习机器人路径规划   总被引:1,自引:0,他引:1  
胡俊  朱庆保 《控制与决策》2010,25(9):1364-1368
提出一种未知环境下基于有先验知识的滚动Q学习机器人路径规划算法.该算法在对Q值初始化时加入对环境的先验知识作为搜索启发信息,以避免学习初期的盲目性,可以提高收敛速度.同时,以滚动学习的方法解决大规模环境下机器人视野域范围有限以及因Q学习的状态空间增大而产生的维数灾难等问题.仿真实验结果表明,应用该算法,机器人可在复杂的未知环境中快速地规划出一条从起点到终点的优化避障路径,效果令人满意.  相似文献   

20.
The robot soccer game has been proposed as a benchmark problem for the artificial intelligence and robotic researches. Decision-making system is the most important part of the robot soccer system. As the environment is dynamic and complex, one of the reinforcement learning (RL) method named FNN-RL is employed in learning the decision-making strategy. The FNN-RL system consists of the fuzzy neural network (FNN) and RL. RL is used for structure identification and parameters tuning of FNN. On the other hand, the curse of dimensionality problem of RL can be solved by the function approximation characteristics of FNN. Furthermore, the residual algorithm is used to calculate the gradient of the FNN-RL method in order to guarantee the convergence and rapidity of learning. The complex decision-making task is divided into multiple learning subtasks that include dynamic role assignment, action selection, and action implementation. They constitute a hierarchical learning system. We apply the proposed FNN-RL method to the soccer agents who attempt to learn each subtask at the various layers. The effectiveness of the proposed method is demonstrated by the simulation and the real experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号