首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A common approach to learning from delayed rewards is to use temporal difference (TD) methods for predicting future reinforcement values. They are parameterized by a recency factor λ which determines whether and how the outcomes from several consecutive time steps contribute to a single prediction update. TD(λ > 0) has been found to usually yield noticeably faster learning than TD(0), but its standard eligibility traces implementation is associated with some well known deficiencies, in particular significantly increased computation expense. This article investigates theoretically two possible ways of implementing TD(λ) without eligibility traces, both proposed by prior work. One is the TTD procedure, which efficiently approximates the effects of eligibility traces by the use of truncated TD(λ) returns. The other is experience replay, which relies on replaying TD prediction updates backwards in time. We provide novel theoretical results related to the former and present an original analysis of the effects of two variations of the latter. The ultimate effect of these investigations is a unified view of the apparently different computational techniques. This contributes to the TD(λ) research in general, by highlighting interesting relationships between several TD-based algorithms and facilitating their further analysis.  相似文献   

2.
Technical Update: Least-Squares Temporal Difference Learning   总被引:2,自引:0,他引:2  
Boyan  Justin A. 《Machine Learning》2002,49(2-3):233-246
TD./ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD./ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency.This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.  相似文献   

3.
Practical Issues in Temporal Difference Learning   总被引:18,自引:10,他引:8  
This paper examines whether temporal difference methods for training connectionist networks, such as Sutton's TD() algorithm, can be successfully applied to complex real-world problems. A number of important practical issues are identified and discussed from a general theoretical perspective. These practical issues are then examined in the context of a case study in which TD() is applied to learning the game of backgammon from the outcome of self-play. This is apparently the first application of this algorithm to a complex non-trivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact surpasses comparable networks trained on a massive human expert data set. This indicates that TD learning may work better in practice than one would expect based on current theory, and it suggests that further analysis of TD methods, as well as applications in other complex domains, may be worth investigating.  相似文献   

4.
Q()-learning uses TD()-methods to accelerate Q-learning. The update complexity of previous online Q() implementations based on lookup tables is bounded by the size of the state/action space. Our faster algorithm's update complexity is bounded by the number of actions. The method is based on the observation that Q-value updates may be postponed until they are needed.  相似文献   

5.
In this paper, we propose a novel approach to identify states with similar subpolicies and show how they can be integrated into the reinforcement learning framework to improve learning performance. The method utilizes a specialized tree structure to identify common action sequences of states, which are derived from possible optimal policies, and defines a similarity function between two states based on the number of such sequences. Using this similarity function, updates on the action-value function of a state are reflected onto all similar states. This allows experience that is acquired during learning to be applied to a broader context. The effectiveness of the method is demonstrated empirically.  相似文献   

6.
Incremental Multi-Step Q-Learning   总被引:23,自引:0,他引:23  
Peng  Jing  Williams  Ronald J. 《Machine Learning》1996,22(1-3):283-290
This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD() return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q()-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.  相似文献   

7.
Continuous-Action Q-Learning   总被引:1,自引:0,他引:1  
This paper presents a Q-learning method that works in continuous domains. Other characteristics of our approach are the use of an incremental topology preserving map (ITPM) to partition the input space, and the incorporation of bias to initialize the learning process. A unit of the ITPM represents a limited region of the input space and maps it onto the Q-values of M possible discrete actions. The resulting continuous action is an average of the discrete actions of the winning unit weighted by their Q-values. Then, TD() updates the Q-values of the discrete actions according to their contribution. Units are created incrementally and their associated Q-values are initialized by means of domain knowledge. Experimental results in robotics domains show the superiority of the proposed continuous-action Q-learning over the standard discrete-action version in terms of both asymptotic performance and speed of learning. The paper also reports a comparison of discounted-reward against average-reward Q-learning in an infinite horizon robotics task.  相似文献   

8.
Linear Least-Squares Algorithms for Temporal Difference Learning   总被引:8,自引:2,他引:6  
We introduce two new temporal difference (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Square TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Suttons TD() algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, TD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on TD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.  相似文献   

9.
We describe a program for the display and exploration of complex, domain-specific information: ytracc, an interactive grammar debugging tool for compiler writers. The ytracc system provides the designer of a yacc grammar a method of tracing a parser as it uses the grammar, ytracc captures the states of the parse as it is carried out. The captured parse can then be replayed forwards or backwards, step-by-step, or subtree-by-subtree, as defined by the non-terminals of the grammar. The tool has been successfully used by students as an assistant in an advanced undergraduate compiler construction class, and we use the tool in our everyday work.  相似文献   

10.
Reinforcement learning in continuous time and space   总被引:2,自引:0,他引:2  
Doya K 《Neural computation》2000,12(1):219-245
This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(lambda) algorithms are shown. For policy improvement, two methods-a continuous actor-critic method and a value-gradient-based greedy policy-are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.  相似文献   

11.
To develop a nonverbal communication channel between an operator and a system, we built a tracking system called the Adaptive Visual Attentive Tracker (AVAT) to track and zoom in to the operator's behavioral sequence which represents his/her intention. In our system, hidden Markov models (HMMs) first roughly model the gesture pattern. Then, the state transition probabilities in HMMs are used to assign as the rewards in temporal difference (TD) learning. Later, the TD learning method is utilized to adjust the action model of the tracker for its situated behaviors in the tracking task. Identification of the hand sign gesture context through wavelet analysis autonomously provides a reward value for optimizing AVAT's action patterns. Experimental results of tracking the operator's hand sign action sequences during her natural walking motion with higher accuracy are shown which demonstrate the effectiveness of the proposed HMM-based TD learning algorithm of AVAT. During TD learning experiments, the exploring randomly chosen actions sometimes exceed the predefined state area, and thus involuntarily enlarge the domain of states. We describe a method utilizing HMMs with continuous observation distribution to detect whether the state would be split to make a new state. The generation of new states brings the ability of enlarging the predefined area of states.  相似文献   

12.
连续空间增量最近邻时域差分学习   总被引:1,自引:1,他引:0  
针对连续空间强化学习问题,提出一种基于局部加权学习的增量最近邻时域差分(TD)学习框架。通过增量方式在线选取部分已观测状态构建实例词典,采用新观测状态的范围最近邻实例逼近其值函数与策略,并结合TD算法对词典中各实例的值函数和资格迹迭代更新。就框架各主要组成部分给出多种设计方案,并对其收敛性进行理论分析。对24种方案组合进行仿真验证的实验结果表明, SNDN组合具有较好的学习性能和计算效率。  相似文献   

13.
This article clarifies the relation between the learning curve and the algebraic geometrical structure of an unidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to lambda(1) log n - (m(1) - 1) log log n + constant, where n is the number of training samples and lambda(1) and m(1) are the rational number and the natural number, which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate lambda(1) and m(1) based on the resolution of singularities in algebraic geometry. In regular statistical models, 2lambda(1) is equal to the number of parameters and m(1) = 1, whereas in nonregular models, such as multilayer networks, 2lambda(1) is not larger than the number of parameters and m(1) > or = 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the nonidentifiable learning machines are better models than the regular ones if Bayesian ensemble learning is applied.  相似文献   

14.
针对传统PID控制器无法在线自整定参数的不足,提出了一种基于执行器一评估器(Actor-Critic,AC)学习的自适应PID控制器结构与学习算法.该控制器利用AC学习实现PID参数的自适应整定,采用一个径向基函数网络同时对Actor的策略函数和Critic的值函数进行逼近.径向基函数网络的输入为系统误差、误差的一次差分和二次差分,Actor实现系统状态到PID参数的映射,Critic则对Actor的输出进行评判并且生成时序差分(temporaldifference,TD)误差信号.基于AC学习的体系结构和TD误差性能指标,给出了控制器设计的步骤流程图.两个仿真实验表明:与传统的PID控制器相比,基于AC学习的PID控制器在响应速度和自适应能力方面要优于传统PID控制器.  相似文献   

15.

In this paper, we propose a temporal difference (TD) learning method, called integral TD learning that efficiently finds solutions to continuous-time (CT) linear quadratic regulation (LQR) problems in an online fashion where system matrix A is unknown. The idea originates from a computational reinforcement learning method known as TD(0), which is the simplest TD method in a finite Markov decision process. For the proposed integral TD method, we mathematically analyze the positive definiteness of the updated value functions, monotone convergence conditions, and stability properties concerning the locations of the closed-loop poles in terms of the learning rate and the discount factor. The proposed method includes the existing value iteration method for CT LQR problems as a special case. Finally, numerical simulations are carried out to verify the effectiveness of the proposed method and further investigate the aforementioned mathematical properties.

  相似文献   

16.
强化学习是一种重要的机器学习方法。为了提高强化学习过程的收敛速度和减少学习过程值函数估计的误差,提出了基于递推最小二乘法的多步时序差分学习算法(RLS-TD(λ))。证明了在满足一定条件下,该算法的权值将以概率1收敛到唯一解,并且得出和证明了值函数估计值的误差应满足的关系式。迷宫实验表明,与RLS-TD(0)算法相比,该算法能加快学习过程的收敛,与传统的TD(λ)算法相比,该算法减少了值函数估计误差,从而提高了精度。  相似文献   

17.
Werfel J  Xie X  Seung HS 《Neural computation》2005,17(12):2699-2718
Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are sometimes used to overcome these difficulties. We analyze three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. Learning speed is defined as the rate of exponential decay in the learning curves. When the scalar parameter that controls the size of weight updates is chosen to maximize learning speed, node perturbation is slower than direct gradient descent by a factor equal to the number of output units; weight perturbation is slower still by an additional factor equal to the number of input units. Parallel perturbation allows faster learning than sequential perturbation, by a factor that does not depend on network size. We also characterize how uncertainty in quantities used in the stochastic updates affects the learning curves. This study suggests that in practice, weight perturbation may be slow for large networks, and node perturbation can have performance comparable to that of direct gradient descent when there are few output units. However, these statements depend on the specifics of the learning problem, such as the input distribution and the target function, and are not universally applicable.  相似文献   

18.
The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed.  相似文献   

19.
Hidden Markov model (HMM) classifier design is considered for the analysis of sequential data, incorporating both labeled and unlabeled data for training; the balance between the use of labeled and unlabeled data is controlled by an allocation parameter lambda isin (0, 1), where lambda = 0 corresponds to purely supervised HMM learning (based only on the labeled data) and lambda = 1 corresponds to unsupervised HMM-based clustering (based only on the unlabeled data). The associated estimation problem can typically be reduced to solving a set of fixed-point equations in the form of a "natural-parameter homotopy." This paper applies a homotopy method to track a continuous path of solutions, starting from a local supervised solution (lambda = 0) to a local unsupervised solution (lambda = 1). The homotopy method is guaranteed to track with probability one from lambda = 0 to lambda = 1 if the lambda = 0 solution is unique; this condition is not satisfied for the HMM since the maximum likelihood supervised solution (lambda = 0) is characterized by many local optima. A modified form of the homotopy map for HMMs assures a track from lambda = 0 to lambda = 1. Following this track leads to a formulation for selecting lambda isin (0, 1) for a semisupervised solution and it also provides a tool for selection from among multiple local-optimal supervised solutions. The results of applying the proposed method to measured and synthetic sequential data verify its robustness and feasibility compared to the conventional EM approach for semisupervised HMM training.  相似文献   

20.
陈子璇  章宗长  潘致远  张琳婧 《软件学报》2021,32(11):3496-3511
近年来,如何生成具有泛化能力的策略已成为深度强化学习领域的热点问题之一,并涌现出了许多相关的研究成果,其中的一个代表性工作为广义值迭代网络.广义值迭代网络是一种可作用于非规则图形的规划网络模型.它利用一种特殊的图形卷积算子来近似地表示状态转移矩阵,使得其在学习到非规则图形的结构信息后,可通过值迭代过程进行规划,从而在具有非规则图形结构的任务中产生具有泛化能力的策略.然而,由于没有考虑根据状态重要性来合理分配规划时间,广义值迭代网络中的每一轮迭代都需要在整个状态空间的所有状态上同步执行.当状态空间较大时,这样的同步更新会降低网络的规划性能.用异步更新的思想来进一步研究广义值迭代网络.通过在值迭代过程中定义状态优先级并执行异步值更新,提出了一种新型的异步规划网络模型——广义异步值迭代网络.在未知的非规则结构任务中,与广义值迭代网络相比,广义异步值迭代网络具有更高效且更有效的规划过程.进一步地,改进了广义值迭代网络中的强化学习算法及图形卷积算子,并通过在非规则图形和真实地图中的路径规划实验验证了改进方法的有效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号