共查询到20条相似文献,搜索用时 0 毫秒
1.
The aim of this paper is to present (jointly) a series of robust high performance (award winning) implementations of reinforcement learning algorithms based on temporal-difference learning and weighted k- nearest neighbors for linear function approximation. These algorithms, named kNN‐TD(λ) methods, where rigorously tested at the Second and Third Annual Reinforcement Learning Competitions (RLC2008 and RCL2009) held in Helsinki and Montreal respectively, where the kNN‐TD(λ) method (JAMH team) won in the PolyAthlon 2008 domain, obtained the second place in 2009 and also the second place in the Mountain-Car 2008 domain showing that it is one of the state of the art general purpose reinforcement learning implementations. These algorithms are able to learn quickly, to generalize properly over continuous state spaces and also to be robust to a high degree of environmental noise. Furthermore, we describe a derivation of kNN‐TD(λ) algorithm for problems where the use of continuous actions have clear advantages over the use of fine grained discrete actions: the Ex〈a〉 reinforcement learning algorithm. 相似文献
2.
We propose two algorithms for Q-learning that use the two-timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments using the proposed algorithms on an application of routing in communication networks are presented on a few different settings. 相似文献
3.
文章介绍了加强学习模型,分别给出了加强学习的四个主要算法:动态规划、蒙特卡罗算法、时序差分算法、Q-学习,并指出了它们之间的区别和联系。最后给出加强学习的两个应用以及今后的研究方向。 相似文献
4.
This work is devoted to the study of a class of recursive algorithms for blind channel identification. Using weak convergence methods, the convergence of the algorithm is obtained and the rate of convergence is ascertained. The technique discussed can also be used in the analysis of rates of convergence for decreasing step-size algorithms. 相似文献
5.
We study a linear stochastic approximation algorithm that arises in the context of reinforcement learning. The algorithm employs a decreasing step-size, and is driven by Markov noise with time-varying statistics. We show that under suitable conditions, the algorithm can track the changes in the statistics of the Markov noise, as long as these changes are slower than the rate at which the step-size of the algorithm goes to zero. 相似文献
6.
We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability one), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning. Second, we present an example illustrating the possibility of divergence when temporal difference learning is used in the presence of a nonlinear function approximator 相似文献
7.
Asymptotic properties of consensus-type algorithms for networked systems with regime-switching topologies 总被引:1,自引:0,他引:1
This paper is concerned with asymptotic properties of consensus-type algorithms for networked systems whose topologies switch randomly. The regime-switching process is modeled as a discrete-time Markov chain with a finite state space. The consensus control is achieved by using stochastic approximation methods. In the setup, the regime-switching process (the Markov chain) contains a rate parameter ε>0 in the transition probability matrix that characterizes how frequently the topology switches. On the other hand, the consensus control algorithm uses a stepsize μ that defines how fast the network states are updated. Depending on their relative values, three distinct scenarios emerge. Under suitable conditions, we show that when 0<ε=O(μ), a continuous-time interpolation of the iterates converges weakly to a system of randomly switching ordinary differential equations modulated by a continuous-time Markov chain. In this case a scaled sequence of tracking errors converges to a system of switching diffusion. When 0<ε?μ, the network topology is almost non-switching during consensus control transient intervals, and hence the limit dynamic system is simply an autonomous differential equation. When μ?ε, the Markov chain acts as a fast varying noise, and only its averaged network matrices are relevant, resulting in a limit differential equation that is an average with respect to the stationary measure of the Markov chain. Simulation results are presented to demonstrate these findings. 相似文献
8.
An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes. The analysis uses multiscale stochastic approximation theory and the envelope theorem' of mathematical economics. 相似文献
9.
Xi-Ren 《Annual Reviews in Control》2009,33(1):11-24
We introduce a sensitivity-based view to the area of learning and optimization of stochastic dynamic systems. We show that this sensitivity-based view provides a unified framework for many different disciplines in this area, including perturbation analysis, Markov decision processes, reinforcement learning, identification and adaptive control, and singular stochastic control; and that this unified framework applies to both the discrete event dynamic systems and continuous-time continuous-state systems. Many results in these disciplines can be simply derived and intuitively explained by using two performance sensitivity formulas. In addition, we show that this sensitivity-based view leads to new results and opens up new directions for future research. For example, the n th bias optimality of Markov processes has been established and the event-based optimization may be developed; this approach has computational and other advantages over the state-based approaches. 相似文献
10.
A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning 总被引:1,自引:0,他引:1
The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via
a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm
to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy
samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application
of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal-difference
learning, and our developments lead to a possibly more efficient variant of temporal-difference learning. We establish convergence
of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.
This research was supported in part by NSF CAREER Grant ECS-9985229, and by the ONR under Grant MURI N00014-00-1-0637. 相似文献
11.
分析了折扣激励学习存在的问题,对MDPs的SARSA(λ)算法进行了折扣的比较实验分析,讨论了平均奖赏常量对无折扣SARSA(()算法的影响。 相似文献
12.
Convergence of learning algorithms with constant learning rates 总被引:3,自引:0,他引:3
The behavior of neural network learning algorithms with a small, constant learning rate, epsilon, in stationary, random input environments is investigated. It is rigorously established that the sequence of weight estimates can be approximated by a certain ordinary differential equation, in the sense of weak convergence of random processes as epsilon tends to zero. As applications, backpropagation in feedforward architectures and some feature extraction algorithms are studied in more detail. 相似文献
13.
Shaping multi-agent systems with gradient reinforcement learning 总被引:1,自引:0,他引:1
Olivier Buffet Alain Dutech François Charpillet 《Autonomous Agents and Multi-Agent Systems》2007,15(2):197-220
An original reinforcement learning (RL) methodology is proposed for the design of multi-agent systems. In the realistic setting
of situated agents with local perception, the task of automatically building a coordinated system is of crucial importance.
To that end, we design simple reactive agents in a decentralized way as independent learners. But to cope with the difficulties
inherent to RL used in that framework, we have developed an incremental learning algorithm where agents face a sequence of
progressively more complex tasks. We illustrate this general framework by computer experiments where agents have to coordinate to reach a global goal.
This work has been conducted in part in NICTA’s Canberra laboratory. 相似文献
14.
Asynchronous Stochastic Approximation and Q-Learning 总被引:15,自引:6,他引:15
We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available. 相似文献
15.
We consider the problem of control of hierarchical Markov decision processes and develop a simulation based two-timescale actor-critic algorithm in a general framework. We also develop certain approximation algorithms that require less computation and satisfy a performance bound. One of the approximation algorithms is a three-timescale actor-critic algorithm while the other is a two-timescale algorithm, however, which operates in two separate stages. All our algorithms recursively update randomized policies using the simultaneous perturbation stochastic approximation (SPSA) methodology. We briefly present the convergence analysis of our algorithms. We then present numerical experiments on a problem of production planning in semiconductor fabs on which we compare the performance of all algorithms together with policy iteration. Algorithms based on certain Hadamard matrix based deterministic perturbations are found to show the best results. 相似文献
16.
Many reinforcement learning approaches can be formulated using the theory of Markov decision processes and the associated method of dynamic programming (DP). The value of this theoretical understanding, however, is tempered by many practical concerns. One important question is whether DP-based approaches that use function approximation rather than lookup tables can avoid catastrophic effects on performance. This note presents a result of Bertsekas (1987) which guarantees that small errors in the approximation of a task's optimal value function cannot produce arbitrarily bad performance when actions are selected by a greedy policy. We derive an upper bound on performance loss that is slightly tighter than that in Bertsekas (1987), and we show the extension of the bound to Q-learning (Watkins, 1989). These results provide a partial theoretical rationale for the approximation of value functions, an issue of great practical importance in reinforcement learning. 相似文献
17.
Natural actor-critic algorithms 总被引:1,自引:0,他引:1
We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 相似文献
18.
We address the problem of reinforcement learning in which observations may exhibit an arbitrary form of stochastic dependence on past observations and actions, i.e. environments more general than (PO)MDPs. The task for an agent is to attain the best possible asymptotic reward where the true generating environment is unknown, but belongs to a known countable family of environments. We find some sufficient conditions on the class of environments under which an agent exists which attains the best asymptotic reward for any environment in the class. We analyze how tight these conditions are, and how they relate to different probabilistic assumptions known in reinforcement learning and related fields, such as Markov Decision Processes and mixing conditions. 相似文献
19.
Alexander L. Strehl 《Journal of Computer and System Sciences》2008,74(8):1309-1331
Several algorithms for learning near-optimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Model-based Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents a theoretical analysis of MBIE and a new variation called MBIE-EB, proving their efficiency even under worst-case conditions. The paper also introduces a new performance metric, average loss, and relates it to its less “online” cousins from the literature. 相似文献
20.
Marlies Vanhulsel Davy Janssens Geert Wets Koen Vanhoof 《Expert systems with applications》2009,36(4):8032-8039
The present study aims at contributing to the current state-of-the art of activity-based travel demand modelling by presenting a framework to simulate sequential data. To this end, the suitability of a reinforcement learning approach to reproduce sequential data is explored. Additionally, as traditional reinforcement learning techniques are not capable of learning efficiently in large state and action spaces with respect to memory and computational time requirements on the one hand, and of generalizing based on infrequent visits of all state-action pairs on the other hand, the reinforcement learning technique as used in most applications, is enhanced by means of regression tree function approximation.Three reinforcement learning algorithms are implemented to validate their applicability: the traditional Q-learning and Q-learning with bucket-brigade updating are tested against the improved reinforcement learning approach with a CART function approximator. These methods are applied on data of 26 diary days. The results are promising and show that the proposed techniques offer great opportunity of simulating sequential data. Moreover, the reinforcement learning approach improved by introducing a regression tree function approximator learns a more optimal solution much faster than the two traditional Q-learning approaches. 相似文献