期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Backward Q-learning: The combination of Sarsa algorithm and Q-learning

Yin-Hao Wang Tzuu-Hseng S. Li Chih-Jui Lin 《Engineering Applications of Artificial Intelligence》2013,26(9):2184-2193

Reinforcement learning (RL) has been applied to many fields and applications, but there are still some dilemmas between exploration and exploitation strategy for action selection policy. The well-known areas of reinforcement learning are the Q-learning and the Sarsa algorithms, but they possess different characteristics. Generally speaking, the Sarsa algorithm has faster convergence characteristics, while the Q-learning algorithm has a better final performance. However, Sarsa algorithm is easily stuck in the local minimum and Q-learning needs longer time to learn. Most literatures investigated the action selection policy. Instead of studying an action selection strategy, this paper focuses on how to combine Q-learning with the Sarsa algorithm, and presents a new method, called backward Q-learning, which can be implemented in the Sarsa algorithm and Q-learning. The backward Q-learning algorithm directly tunes the Q-values, and then the Q-values will indirectly affect the action selection policy. Therefore, the proposed RL algorithms can enhance learning speed and improve final performance. Finally, three experimental results including cliff walk, mountain car, and cart–pole balancing control system are utilized to verify the feasibility and effectiveness of the proposed scheme. All the simulations illustrate that the backward Q-learning based RL algorithm outperforms the well-known Q-learning and the Sarsa algorithm. 相似文献

2.

概率近似正确的强化学习算法解决连续状态空间控制问题

朱圆恒赵冬斌《控制理论与应用》2016,33(12):1603-1613

在线学习时长是强化学习算法的一个重要指标.传统在线强化学习算法如Q学习、状态–动作–奖励–状态–动作(state-action-reward-state-action,SARSA)等算法不能从理论分析角度给出定量的在线学习时长上界.本文引入概率近似正确(probably approximately correct,PAC)原理,为连续时间确定性系统设计基于数据的在线强化学习算法.这类算法有效记录在线数据,同时考虑强化学习算法对状态空间探索的需求,能够在有限在线学习时间内输出近似最优的控制.我们提出算法的两种实现方式,分别使用状态离散化和kd树(k-dimensional树)技术,存储数据和计算在线策略.最后我们将提出的两个算法应用在双连杆机械臂运动控制上,观察算法的效果并进行比较. 相似文献

3.

Average Reward Reinforcement Learning: Foundations,Algorithms, and Empirical Results 总被引：12，自引：0，他引：12

Mahadevan Sridhar 《Machine Learning》1996,22(1-3):159-195

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains. 相似文献

4.

Characterizing reinforcement learning methods through parameterized learning problems

Shivaram Kalyanakrishnan Peter Stone 《Machine Learning》2011,84(1-2):205-247

The field of reinforcement learning (RL) has been energized in the past few decades by elegant theoretical results indicating under what conditions, and how quickly, certain algorithms are guaranteed to converge to optimal policies. However, in practical problems, these conditions are seldom met. When we cannot achieve optimality, the performance of RL algorithms must be measured empirically. Consequently, in order to meaningfully differentiate learning methods, it becomes necessary to characterize their performance on different problems, taking into account factors such as state estimation, exploration, function approximation, and constraints on computation and memory. To this end, we propose parameterized learning problems, in which such factors can be controlled systematically and their effects on learning methods characterized through targeted studies. Apart from providing very precise control of the parameters that affect learning, our parameterized learning problems enable benchmarking against optimal behavior; their relatively small sizes facilitate extensive experimentation. Based on a survey of existing RL applications, in this article, we focus our attention on two predominant, ??first order?? factors: partial observability and function approximation. We design an appropriate parameterized learning problem, through which we compare two qualitatively distinct classes of algorithms: on-line value function-based methods and policy search methods. Empirical comparisons among various methods within each of these classes project Sarsa(??) and Q-learning(??) as winners among the former, and CMA-ES as the winner in the latter. Comparing Sarsa(??) and CMA-ES further on relevant problem instances, our study highlights regions of the problem space favoring their contrasting approaches. Short run-times for our experiments allow for an extensive search procedure that provides additional insights on relationships between method-specific parameters??such as eligibility traces, initial weights, and population sizes??and problem instances. 相似文献

5.

Multiple mini-robots navigation using a collaborative multiagent reinforcement learning framework

Piyabhum Chaysri Kostas Vlachos 《Advanced Robotics》2020,34(13):902-916

In this work we investigate the use of a reinforcement learning (RL) framework for the autonomous navigation of a group of mini-robots in a multi-agent collaborative environment. Each mini-robot is driven by inertial forces provided by two vibration motors that are controlled by a simple and efficient low-level speed controller. The action of the RL agent is the direction of each mini-robot, and it is based on the position of each mini-robot, the distance between them and the sign of the distance gradient between each mini-robot and the nearest one. Each mini-robot is considered a moving obstacle that must be avoided by the others. We propose suitable state space and reward function that result in an efficient collaborative RL framework. The classical and the double Q-learning algorithms are employed, where the latter is considered to learn optimal policies of mini-robots that offers more stable and reliable learning process. A simulation environment is created, using the ROS framework, that include a group of four mini-robots. The dynamic model of each mini-robot and of the vibration motors is also included. Several application scenarios are simulated and the results are presented to demonstrate the performance of the proposed approach. 相似文献

6.

A unified analysis of value-function-based reinforcement- learning algorithms 总被引：4，自引：0，他引：4

Szepesvári C Littman ML 《Neural computation》1999,11(8):2017-2059

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning. 相似文献

7.

Ensemble Algorithms in Reinforcement Learning 总被引：1，自引：0，他引：1

Wiering M.A. van Hasselt H. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2008,38(4):930-936

This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: $Q$ -learning, Sarsa, actor–critic (AC), $QV$-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms. 相似文献

8.

Reinforcement learning and optimal adaptive control: An overview and implementation examples

Said G. Khan Guido Herrmann Frank L. Lewis Tony Pipe Chris Melhuish 《Annual Reviews in Control》2012,36(1):42-59

This paper provides an overview of the reinforcement learning and optimal adaptive control literature and its application to robotics. Reinforcement learning is bridging the gap between traditional optimal control, adaptive control and bio-inspired learning techniques borrowed from animals. This work is highlighting some of the key techniques presented by well known researchers from the combined areas of reinforcement learning and optimal control theory. At the end, an example of an implementation of a novel model-free Q-learning based discrete optimal adaptive controller for a humanoid robot arm is presented. The controller uses a novel adaptive dynamic programming (ADP) reinforcement learning (RL) approach to develop an optimal policy on-line. The RL joint space tracking controller was implemented for two links (shoulder flexion and elbow flexion joints) of the arm of the humanoid Bristol-Elumotion-Robotic-Torso II (BERT II) torso. The constrained case (joint limits) of the RL scheme was tested for a single link (elbow flexion) of the BERT II arm by modifying the cost function to deal with the extra nonlinearity due to the joint constraints. 相似文献

9.

Robust control under worst‐case uncertainty for unknown nonlinear systems using modified reinforcement learning

Adolfo Perrusquía Wen Yu 《国际强度与非线性控制杂志
》2020,30(7):2920-2936

Reinforcement learning (RL) is an effective method for the design of robust controllers of unknown nonlinear systems. Normal RLs for robust control, such as actor‐critic (AC) algorithms, depend on the estimation accuracy. Uncertainty in the worst case requires a large state‐action space, this causes overestimation and computational problems. In this article, the RL method is modified with the k‐nearest neighbor and the double Q‐learning algorithm. The modified RL does not need the neural estimator as AC and can stabilize the unknown nonlinear system under the worst‐case uncertainty. The convergence property of the proposed RL method is analyzed. The simulations and the experimental results show that our modified RLs are much more robust compared with the classic controllers, such as the proportional‐integral‐derivative, the sliding mode, and the optimal linear quadratic regulator controllers. 相似文献

10.

基于状态聚类的多站点CSPS系统的协同控制方法

唐昊裴荣周雷谭琦《自动化学报》2014,40(5):901-908

单站点传送带给料加工站（Conveyor-serviced production station,CSPS）系统中,可运用强化学习对状态——行动空间进行有效探索,以搜索近似最优的前视距离控制策略.但是多站点CSPS系统的协同控制问题中,系统状态空间的大小会随着站点个数的增加和缓存库容量的增加而成指数形式（或几何级数）增长,从而导致维数灾,影响学习算法的收敛速度和优化效果.为此,本文在站点局域信息交互机制的基础上引入状态聚类的方法,以减小每个站点学习空间的大小和复杂性.首先,将多个站点看作相对独立的学习主体,且各自仅考虑邻近下游站点的缓存库的状态并纳入其性能值学习过程;其次,将原状态空间划分成多个不相交的子集,每个子集用一个抽象状态表示,然后,建立基于状态聚类的多站点反馈式Q学习算法.通过该方法,可在抽象状态空间上对各站点的前视距离策略进行优化学习,以寻求整个系统的生产率最大.仿真实验结果说明,与一般的多站点反馈式Q学习方法相比,基于状态聚类的多站点反馈式Q学习方法不仅具有收敛速度快的优点,而且还在一定程度上提高了系统生产率. 相似文献

11.

A topological reinforcement learning agent for navigation

Arthur P. S. Braga Aluízio F. R. Araújo 《Neural computing & applications》2003,12(3-4):220-236

This article proposes a reinforcement learning procedure for mobile robot navigation using a latent-like learning schema. Latent learning refers to learning that occurs in the absence of reinforcement signals and is not apparent until reinforcement is introduced. This concept considers that part of a task can be learned before the agent receives any indication of how to perform such a task. In the proposed topological reinforcement learning agent (TRLA), a topological map is used to perform the latent learning. The propagation of the reinforcement signal throughout the topological neighborhoods of the map permits the estimation of a value function which takes in average less trials and with less updatings per trial than six of the main temporal difference reinforcement learning algorithms: Q-learning, SARSA, Q(λ)-learning, SARSA(λ), Dyna-Q and fast Q(λ)-learning. The RL agents were tested in four different environments designed to consider a growing level of complexity in accomplishing navigation tasks. The tests suggested that the TRLA chooses shorter trajectories (in the number of steps) and/or requires less value function updatings in each trial than the other six reinforcement learning (RL) algorithms. 相似文献

12.

样本有限关联值递归Q学习算法及其收敛性证明 总被引：5，自引：0，他引：5

殷苌茗陈焕文谢丽娟《计算机研究与发展》2002,39(9):1064-1070

一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决问题，求解最优决策一般有两种途径：一种是求最大奖赏方法，另一种最求最优费用方法，利用求解最优费用函数的方法给出了一种新的Q学习算法，Q学习算法是求解信息不完全Markov决策问题的一种有效激励学习方法。Watkins提出了Q学习的基本算法，尽管他证明了在满足一定条件下Q值学习的迭代公式的收敛性，但是在他给出的算法中，没有考虑到在迭代过程中初始状态与初始动作的选取对后继学习的影响，因此提出的关联值递归Q学习算法改进了原来的Q学习算法，并且这种算法有比较好的收敛性质，从求解最优费用函数的方法出发，给出了Q学习的关联值递归算法，这种方法的建立可以使得动态规划（DP）算法中的许多结论直接应用到Q学习的研究中来。相似文献

13.

改进的模糊Q学习方法及其在RoboCup中的应用

张驰韩光胜《计算机仿真》2005,22(5):189-192

为了在multi-agent系统中实现agent之间的竞争与协作,该文提出了一种新的在线学习方法,即：改进的模糊Q学习方法,在这种方法中,agent通过增强学习方法来调节模糊推理系统,进而获得最优的模糊规则。为了改善学习的时间,Q学习方法中的奖励值并不是固定的,而是根据状态而变化。将改进的模糊Q学习方法应用到RoboCup仿真环境中,使智能体通过在线学习获得跑位技巧。并通过实验证明厂该方法的有效性。相似文献

14.

Learning classifier systems from a reinforcement learning perspective

P. L. Lanzi 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2002,6(3-4):162-170

We analyze learning classifier systems in the light of tabular reinforcement learning. We note that although genetic algorithms are the most distinctive feature of learning classifier systems, it is not clear whether genetic algorithms are important to learning classifiers systems. In fact, there are models which are strongly based on evolutionary computation (e.g., Wilson's XCS) and others which do not exploit evolutionary computation at all (e.g., Stolzmann's ACS). To find some clarifications, we try to develop learning classifier systems “from scratch”, i.e., starting from one of the most known reinforcement learning technique, Q-learning. We first consider thebasics of reinforcement learning: a problem modeled as a Markov decision process and tabular Q-learning. We introduce a formal framework to define a general purpose rule-based representation which we use to implement tabular Q-learning. We formally define generalization within rules and discuss the possible approaches to extend our rule-based Q-learning with generalization capabilities. We suggest that genetic algorithms are probably the most general approach for adding generalization although they might be not the only solution. 相似文献

15.

A reinforcement learning with switching controllers for a continuous action space

Masato Nagayoshi Hajime Murao Hisashi Tamaki 《Artificial Life and Robotics》2010,15(1):97-100

Reinforcement learning (RL) attracts much attention as a technique for realizing computational intelligence such as adaptive and autonomous decentralized systems. In general, however, it is not easy to put RL to practical use. This difficulty includes the problem of designing a suitable action space for an agent, i.e., satisfying two requirements in trade-off: (i) to keep the characteristics (or structure) of an original search space as much as possible in order to seek strategies that lie close to the optimal, and (ii) to reduce the search space as much as possible in order to expedite the learning process. In order to design a suitable action space adaptively, in this article, we propose a RL model with switching controllers based on Q-learning and an actor-critic to mimic the process of an infant’s motor development in which gross motor skills develop before fine motor skills. Then a method for switching controllers is constructed by introducing and referring to the “entropy.” Further, through computational experiments by using a path-planning problem with continuous action space, the validity and potential of the proposed method have been confirmed. 相似文献

16.

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Xin Xu Chunming Liu Dewen Hu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1055-1070

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI. 相似文献

17.

SSPQL: Stochastic shortest path-based Q-learning

Woo Young Kwon Il Hong Suh Sanghoon Lee 《International Journal of Control, Automation and Systems》2011,9(2):328-338

Reinforcement learning (RL) has been widely used as a mechanism for autonomous robots to learn state-action pairs by interacting with their environment. However, most RL methods usually suffer from slow convergence when deriving an optimum policy in practical applications. To solve this problem, a stochastic shortest path-based Q-learning (SSPQL) is proposed, combining a stochastic shortest path-finding method with Q-learning, a well-known model-free RL method. The rationale is, if a robot has an internal state-transition model which is incrementally learnt, then the robot can infer the local optimum policy by using a stochastic shortest path-finding method. By increasing state-action pair values comprising of these local optimum policies, a robot can then reach a goal quickly and as a result, this process can enhance convergence speed. To demonstrate the validity of this proposed learning approach, several experimental results are presented in this paper. 相似文献

18.

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms 总被引：14，自引：0，他引：14

Singh Satinder Jaakkola Tommi Littman Michael L. Szepesvári Csaba 《Machine Learning》2000,38(3):287-308

An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies. 相似文献

19.

基于每阶段平均费用最优的激励学习算法 总被引：4，自引：0，他引：4

殷苌茗陈焕文谢丽娟《计算机应用》2002,22(4):25-27

文中利用求解最优费用函数的方法给出了一种新的激励学习算法，即基于每阶段平均费用最优的激励学习算法。这种学习算法是求解信息不完全Markov决策问题的一种有效激励学习方法，它从求解分阶段最优平均费用函数的方法出发，分析了最优解的存在性，分阶段最优平均费用函数与初始状态的关系以及与之相关的Bellman方程。这种方法的建立，可以使得动态规划（DP）算法中的许多结论直接应用到激励学习的研究中来。相似文献

20.

小脑模型关节控制器网络在传送带给料生产加工站学习优化控制中的应用

周雷孔凤唐昊张建军《控制理论与应用》2011,28(11):1665-1670

研究单站点传送带给料生产加工站（conveyor-serviced production station,CSPS）系统的前视（look-ahead）距离最优控制问题,以提高系统的工作效率.论文运用半Markov决策过程对CSPS优化控制问题进行建模.考虑传统Q学习难以直接处理CSPS系统前视距离为连续变量的优化控制问题,将小脑模型关节控制器网络的Q值函数逼近与在线学习技术相结合,给出了在线Q学习及模型无关的在线策略迭代算法.仿真结果表明,文中算法提高了学习速度和优化精度. 相似文献