首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Technical Update: Least-Squares Temporal Difference Learning   总被引:2,自引:0,他引:2  
Boyan  Justin A. 《Machine Learning》2002,49(2-3):233-246
TD./ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD./ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency.This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.  相似文献   

2.
Q()-learning uses TD()-methods to accelerate Q-learning. The update complexity of previous online Q() implementations based on lookup tables is bounded by the size of the state/action space. Our faster algorithm's update complexity is bounded by the number of actions. The method is based on the observation that Q-value updates may be postponed until they are needed.  相似文献   

3.
We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the -policy iteration method of Bertsekas and Ioffe. The second method is the LSTD() algorithm recently proposed by Boyan, which for =0 coincides with the linear least-squares temporal-difference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(), with probability 1, for every [0, 1].  相似文献   

4.
Learning to Play Chess Using Temporal Differences   总被引:4,自引:0,他引:4  
Baxter  Jonathan  Tridgell  Andrew  Weaver  Lex 《Machine Learning》2000,40(3):243-263
In this paper we present TDLEAF(), a variation on the TD() algorithm that enables it to be used in conjunction with game-tree search. We present some experiments in which our chess program KnightCap used TDLEAF() to learn its evaluation function while playing on Internet chess servers. The main success we report is that KnightCap improved from a 1650 rating to a 2150 rating in just 308 games and 3 days of play. As a reference, a rating of 1650 corresponds to about level B human play (on a scale from E (1000) to A (1800)), while 2150 is human master level. We discuss some of the reasons for this success, principle among them being the use of on-line, rather than self-play. We also investigate whether TDLEAF() can yield better results in the domain of backgammon, where TD() has previously yielded striking success.  相似文献   

5.
Continuous-Action Q-Learning   总被引:1,自引:0,他引:1  
This paper presents a Q-learning method that works in continuous domains. Other characteristics of our approach are the use of an incremental topology preserving map (ITPM) to partition the input space, and the incorporation of bias to initialize the learning process. A unit of the ITPM represents a limited region of the input space and maps it onto the Q-values of M possible discrete actions. The resulting continuous action is an average of the discrete actions of the winning unit weighted by their Q-values. Then, TD() updates the Q-values of the discrete actions according to their contribution. Units are created incrementally and their associated Q-values are initialized by means of domain knowledge. Experimental results in robotics domains show the superiority of the proposed continuous-action Q-learning over the standard discrete-action version in terms of both asymptotic performance and speed of learning. The paper also reports a comparison of discounted-reward against average-reward Q-learning in an infinite horizon robotics task.  相似文献   

6.
We present a new definition of optimality intervals for the parametric right-hand side linear programming (parametric RHS LP) Problem () = min{c t x¦Ax =b + ¯b,x 0}. We then show that an optimality interval consists either of a breakpoint or the open interval between two consecutive breakpoints of the continuous piecewise linear convex function (). As a consequence, the optimality intervals form a partition of the closed interval {; ¦()¦ < }. Based on these optimality intervals, we also introduce an algorithm for solving the parametric RHS LP problem which requires an LP solver as a subroutine. If a polynomial-time LP solver is used to implement this subroutine, we obtain a substantial improvement on the complexity of those parametric RHS LP instances which exhibit degeneracy. When the number of breakpoints of () is polynomial in terms of the size of the parametric problem, we show that the latter can be solved in polynomial time.This research was partially funded by the United States Navy-Office of Naval Research under Contract N00014-87-K-0202. Its financial support is gratefully acknowledged.  相似文献   

7.
This paper presents aut, a modern Automath checker. It is a straightforward re-implementation of the Zandleven Automath checker from the seventies. It was implemented about five years ago, in the programming language C. It accepts both the AUT-68 and AUT-QE dialects of Automath. This program was written to restore a damaged version of Jutting's translation of Landau's Grundlagen. Some notable features: It is fast. On a 1 GHz machine it will check the full Jutting formalization (736 K of nonwhitespace Automath source) in 0.6 seconds. Its implementation of -terms does not use named variables or de Bruijn indices (the two common approaches) but instead uses a graph representation. In this representation variables are represented by pointers to a binder. The program can compile an Automath text into one big Automath single line-style -term. It outputs such a term using de Bruijn indices. (These -terms cannot be checked by modern systems like Coq or Agda, because the -typed -calculi of de Bruijn are different from the -typed -calculi of modern type theory.)The source of aut is freely available on the Web at the address .  相似文献   

8.
Certain tasks, such as formal program development and theorem proving, fundamentally rely upon the manipulation of higher-order objects such as functions and predicates. Computing tools intended to assist in performing these tasks are at present inadequate in both the amount of knowledge they contain (i.e., the level of support they provide) and in their ability to learn (i.e., their capacity to enhance that support over time). The application of a relevant machine learning technique—explanation-based generalization (EBG)—has thus far been limited to first-order problem representations. We extend EBG to generalize higher-order values, thereby enabling its application to higher-order problem encodings.Logic programming provides a uniform framework in which all aspects of explanation-based generalization and learning may be defined and carried out. First-order Horn logics (e.g., Prolog) are not, however, well suited to higher-order applications. Instead, we employ Prolog, a higher-order logic programming language, as our basic framework for realizing higher-order EBG. In order to capture the distinction between domain theory and training instance upon which EBG relies, we extend Prolog with the necessity operator of modal logic. We develop a meta-interpreter realizing EBG for the extended language, Prolog, and provide examples of higher-order EBG.  相似文献   

9.
Continuation passing style (CPS) translations of typed -calculi have numerous applications. However, the range of these applications has been confined by the fact that CPS translations are known for non-dependent type systems only, thus excluding well-known systems like the calculus of constructions (CC) and the logical frameworks (LF). This paper presents techniques for CPS translating systems with dependent types, with an emphasis on pure type-theoretical applications.In the first part of the paper we review several lines of work in which the need for CPS translations of dependent type systems has arisen, and discuss the difficulties involved with CPS translating such systems. One way of overcoming these difficulties is to work with so-called domain-free type systems. Thus, instead of Barendregt's -cube we shall consider the domain-free -cube, and instead of traditional pure type systems, we shall consider domain-free pure type systems.We therefore begin the second part by reviewing the domain-free -cube, which includes domain-free versions of CC and LF, and then present CPS translations for all the systems of the domain-free -cube. We also introduce Direct Style (DS) (i.e., inverse CPS) translations for all the systems of the domain-free -cube; such DS translations, which have been used in a number of applications, were previously formulated for untyped and simply-typed languages only.In the third part we review domain-free pure type systems and generalize the CPS translations of the domain-free -cube to a large class of domain-free pure type systems which includes most of the systems that appear in the literature, including those of the domain-free -cube. Many translations that appear in the literature arise as special cases of ours.In the fourth part of the paper we present two approaches to CPS translations of traditional pure type systems. The first, indirect, technique lifts the CPS translation of domain-free pure type systems to the analogous class of traditional pure type systems by using results that relate derivations in domain-free and traditional pure type systems. The second, direct, approach translates derivations, requiring a certain order on derivations to be well-founded. Both techniques yield translations for most of the systems that appear in the literature, including those of Barendregt's -cube.  相似文献   

10.
We introduce a calculus which is a direct extension of both the and the calculi. We give a simple type system for it, that encompasses both Curry's type inference for the -calculus, and Milner's sorting for the -calculus as particular cases of typing. We observe that the various continuation passing style transformations for -terms, written in our calculus, actually correspond to encodings already given by Milner and others for evaluation strategies of -terms into the -calculus. Furthermore, the associated sortings correspond to well-known double negation translations on types. Finally we provide an adequate CPS transform from our calculus to the -calculus. This shows that the latter may be regarded as an assembly language, while our calculus seems to provide a better programming notation for higher-order concurrency. We conclude by discussing some alternative design decisions.  相似文献   

11.
When interpolating incomplete data, one can choose a parametric model, or opt for a more general approach and use a non-parametric model which allows a very large class of interpolants. A popular non-parametric model for interpolating various types of data is based on regularization, which looks for an interpolant that is both close to the data and also smooth in some sense. Formally, this interpolant is obtained by minimizing an error functional which is the weighted sum of a fidelity term and a smoothness term.The classical approach to regularization is: select optimal weights (also called hyperparameters) that should be assigned to these two terms, and minimize the resulting error functional.However, using only the optimal weights does not guarantee that the chosen function will be optimal in some sense, such as the maximum likelihood criterion, or the minimal square error criterion. For that, we have to consider all possible weights.The approach suggested here is to use the full probability distribution on the space of admissible functions, as opposed to the probability induced by using a single combination of weights. The reason is as follows: the weight actually determines the probability space in which we are working. For a given weight , the probability of a function f is proportional to exp(– f2 uu du) (for the case of a function with one variable). For each different , there is a different solution to the restoration problem; denote it by f. Now, if we had known , it would not be necessary to use all the weights; however, all we are given are some noisy measurements of f, and we do not know the correct . Therefore, the mathematically correct solution is to calculate, for every , the probability that f was sampled from a space whose probability is determined by , and average the different f's weighted by these probabilities. The same argument holds for the noise variance, which is also unknown.Three basic problems are addressed is this work: Computing the MAP estimate, that is, the function f maximizing Pr(f/D) when the data D is given. This problem is reduced to a one-dimensional optimization problem. Computing the MSE estimate. This function is defined at each point x as f(x)Pr(f/D) f. This problem is reduced to computing a one-dimensional integral.In the general setting, the MAP estimate is not equal to the MSE estimate. Computing the pointwise uncertainty associated with the MSE solution. This problem is reduced to computing three one-dimensional integrals.  相似文献   

12.
A text is a triple=(, 1, 2) such that is a labeling function, and 1 and 2 are linear orders on the domain of ; hence may be seen as a word (, 1) together with an additional linear order 2 on the domain of . The order 2 is used to give to the word (, 1) itsindividual hierarchical representation (syntactic structure) which may be a tree but it may be also more general than a tree. In this paper we introducecontext-free grammars for texts and investigate their basic properties. Since each text has its own individual structure, the role of such a grammar should be that of a definition of a pattern common to all individual texts. This leads to the notion of ashapely context-free text grammar also investigated in this paper.  相似文献   

13.
Coordinating Multiple Agents via Reinforcement Learning   总被引:2,自引:0,他引:2  
In this paper, we attempt to use reinforcement learning techniques to solve agent coordination problems in task-oriented environments. The Fuzzy Subjective Task Structure model (FSTS) is presented to model the general agent coordination. We show that an agent coordination problem modeled in FSTS is a Decision-Theoretic Planning (DTP) problem, to which reinforcement learning can be applied. Two learning algorithms, coarse-grained and fine-grained, are proposed to address agents coordination behavior at two different levels. The coarse-grained algorithm operates at one level and tackle hard system constraints, and the fine-grained at another level and for soft constraints. We argue that it is important to explicitly model and explore coordination-specific (particularly system constraints) information, which underpins the two algorithms and attributes to the effectiveness of the algorithms. The algorithms are formally proved to converge and experimentally shown to be effective.  相似文献   

14.
For compact Euclidean bodiesP, Q, we define (P, Q) to be the smallest ratior/s wherer > 0,s > 0 satisfy . HeresQ denotes a scaling ofQ by the factors, andQ,Q are some translates ofQ. This function gives us a new distance function between bodies which, unlike previously studied measures, is invariant under affine transformations. If homothetic bodies are identified, the logarithm of this function is a metric. (Two bodies arehomothetic if one can be obtained from the other by scaling and translation.)For integerk 3, define (k) to be the minimum value such that for each convex polygonP there exists a convexk-gonQ with (P, Q) (k). Among other results, we prove that 2.118 ... <-(3) 2.25 and (k) = 1 + (k –2). We give anO(n 2 log2 n)-time algorithm which, for any input convexn-gonP, finds a triangleT that minimizes (T, P) among triangles. However, in linear time we can find a trianglet with (t, P)<-2.25.Our study is motivated by the attempt to reduce the complexity of the polygon containment problem, and also the motion-planning problem. In each case we describe algorithms which run faster when certain implicitslackness parameters of the input are bounded away from 1. These algorithms illustrate a new algorithmic paradigm in computational geometry for coping with complexity.Work of all authors was partially supported by the ESPRIT II Basic Research Actions Program of the EC under Contract No. 3075 (project ALCOM). Rudolf Fleischer and Kurt Mehlhorn acknowledge also DFG (Grant SPP Me 620/6). Chee Yap acknowledges also DFG (Grant Be 142/46-1) and NSF (Grants DCR-84-01898 and CCR-87-03458). This research was performed when Günter Rote and Chee Yap were at the Freie Universität Berlin.  相似文献   

15.
This paper examines the transformation of call-by-need terms into continuation-passing style (CPS). It begins by presenting a simple transformation of call-by-need terms into program graphs and a reducer for such graphs. From this, an informal derivation is carried out, resulting in a translation from terms into self-reducing program graphs, where the graphs are represented as CPS terms involving storage operations. Though informal, the derivation proceeds in simple steps, and the resulting translation is taken to be our canonical CPS transformation for call-by-need terms.In order to define the CPS transformation more formally, two alternative presentations are given. The first takes the form of a continuation semantics for the call-by-need language. The second presentation follows Danvy and Hatcliff's two-stage decomposition of the call-by-name CPS transformation, resulting in a similar two-stage CPS transformation for call-by-need.Finally, a number of practical matters are considered, including an improvement to eliminate the so-called administrative redexes, as well as to avoid unnecessary memoization and take advantage of strictness information. These improvements make it feasible to consider potential applications in compilers for call-by-need programming languages.Supported in part by the National Science Foundation under PYI grant #CCR-9057567, with matching funds from Bell Northern Research.Supported by an AT&T Ph.D. scholarship.  相似文献   

16.
Linear Least-Squares Algorithms for Temporal Difference Learning   总被引:8,自引:2,他引:6  
We introduce two new temporal difference (TD) algorithms based on the theory of linear least-squares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Square TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Suttons TD() algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, TD, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on TD. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.  相似文献   

17.
The method described in Ref. 1 does not always correctly establish the bonds between the variables. In fact, during the reduction to normal form of the -formula ((ty)(yy)) and of all those in which the same variable that is free in the left subformula of an application occurs bound in the right subformula, this variable is wrongly considered as bound. To prevent this, it is necessary to modify the levels assigned to the formulas in the -generation. We therefore give the correct -generation statements and the correct algorithm of the -generation.See Ref. 1.  相似文献   

18.
In many distributed-memory parallel computers and high-speed communication networks, the exact structure of the underlying communication network may be ignored. These systems assume that the network creates a complete communication graph between the processors, in which passing messages is associated with communication latencies. In this paper we explore the impact of communication latencies on the design of broadcasting algorithms for fully connected message-passing systems. For this purpose, we introduce thepostal model that incorporates a communication latency parameter 1. This parameter measures the inverse of the ratio between the time it takes an originator of a message to send the message and the time that passes until the recipient of the message receives it. We present an optimal algorithm for broadcasting one message in systems withn processors and communication latency , the running time of which is (( logn)/log( + 1)). For broadcastingm 1 messages, we first examine several generalizations of the algorithm for broadcasting one message and then analyze a family of broadcasting algorithms based on degree-d trees. All the algorithms described in this paper are practical event-driven algorithms that preserve the order of messages.  相似文献   

19.
Tracking Context Changes through Meta-Learning   总被引:6,自引:0,他引:6  
Widmer  Gerhard 《Machine Learning》1997,27(3):259-286
The article deals with the problem of learning incrementally (on-line) in domains where the target concepts are context-dependent, so that changes in context can produce more or less radical changes in the associated concepts. In particular, we concentrate on a class of learning tasks where the domain provides explicit clues as to the current context (e.g., attributes with characteristic values). A general two-level learning model is presented that effectively adjusts to changing contexts by trying to detect (via meta-learning) contextual clues and using this information to focus the learning process. Context learning and detection occur during regular on-line learning, without separate training phases for context recognition. Two operational systems based on this model are presented that differ in the underlying learning algorithm and in the way they use contextual information: METAL(B) combines meta-learning with a Bayesian classifier, while METAL(IB) is based on an instance-based learning algorithm. Experiments with synthetic domains as well as a number of real-world problems show that the algorithms are robust in a variety of dimensions, and that meta-learning can produce substantial increases in accuracy over simple object-level learning in situations with changing contexts.  相似文献   

20.
Practical Issues in Temporal Difference Learning   总被引:18,自引:10,他引:8  
This paper examines whether temporal difference methods for training connectionist networks, such as Sutton's TD() algorithm, can be successfully applied to complex real-world problems. A number of important practical issues are identified and discussed from a general theoretical perspective. These practical issues are then examined in the context of a case study in which TD() is applied to learning the game of backgammon from the outcome of self-play. This is apparently the first application of this algorithm to a complex non-trivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact surpasses comparable networks trained on a massive human expert data set. This indicates that TD learning may work better in practice than one would expect based on current theory, and it suggests that further analysis of TD methods, as well as applications in other complex domains, may be worth investigating.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号