首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The mean-square asymptotic behavior of temporal-difference learning algorithms with constant step-sizes and linear function approximation is analyzed in this paper. The analysis is carried out for the case of discounted cost function associated with a Markov chain with a finite dimensional state-space. Under mild conditions, an upper bound for the asymptotic mean-square error of these algorithms is determined as a function of the step-size. Moreover, under the same assumptions, it is also shown that this bound is linear in the step size. The main results of the paper are illustrated with examples related to M/G/1 queues and nonlinear AR models with Markov switching. Editor: Robert Schapire  相似文献   

2.
The stochastic approximation problem is to find some roots or minimizers of a nonlinear function whose expression is unknown and whose evaluations are contaminated with noise. In order to accelerate the classical RM algorithm, this paper proposes a new three-term combinatorial direction stochastic approximation algorithm and its general framework which employ a weighted combination of the current noisy gradient and several previous noisy gradients as the iterative direction. Both the almost sure convergence and the asymptotic rate of convergence of the new algorithms are established. Numerical experiments show that the new algorithm outperforms the RM algorithm and another existing combined direction algorithm.  相似文献   

3.
We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability one), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning. Second, we present an example illustrating the possibility of divergence when temporal difference learning is used in the presence of a nonlinear function approximator  相似文献   

4.
The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal-difference learning, and our developments lead to a possibly more efficient variant of temporal-difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems. This research was supported in part by NSF CAREER Grant ECS-9985229, and by the ONR under Grant MURI N00014-00-1-0637.  相似文献   

5.
In this paper, the iterative learning control is introduced to solve the consensus tracking problem of a multi-agent system with random noises and measurement range limitation. A distributed learning control algorithm is proposed for all agents by utilising its nearest neighbour measured information from previous iterations. With the help of the stochastic approximation technique, we first establish the consensus convergence of the input sequences in almost sure sense for fixed topology as the iteration number increases. Then, we extend the results to switching topologies case which is dynamically changing along the time axis. Illustrative simulations verify the effectiveness of the proposed algorithms.  相似文献   

6.
We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms.  相似文献   

7.
An iterative learning control algorithm with iteration decreasing gain is proposed for stochastic point‐to‐point tracking systems. The almost sure convergence and asymptotic properties of the proposed recursive algorithm are strictly proved. The selection of learning gain matrix is given. An illustrative example shows the effectiveness and asymptotic trajectory properties of the proposed approach.  相似文献   

8.
As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI.  相似文献   

9.
一类输出饱和系统的学习控制算法研究   总被引:1,自引:0,他引:1  
传感器饱和是控制系统中较为常见的一种物理约束. 本文针对一类含饱和输出的受限系统, 提出了两种学习控制算法. 具体而言, 首先, 对于重复运行的被控系统, 设计了开环P型迭代学习控制器, 实现在有限时间区间内对期望轨迹的完全跟踪, 并在λ范数意义下分析了算法的收敛性, 给出了含饱和输出的迭代学习控制系统的收敛条件. 进而, 针对期望轨迹为周期信号的被控系统, 提出了闭环P型重复学习控制算法, 并分析了这类系统的收敛性条件. 最后, 通过一个仿真实例验证了本文所提算法的有效性.  相似文献   

10.
徐昕  沈栋  高岩青  王凯 《自动化学报》2012,38(5):673-687
基于马氏决策过程(Markov decision process, MDP)的动态系统学习控制是近年来一个涉及机器学习、控制理论和运筹学等多个学科的交叉研究方向, 其主要目标是实现系统在模型复杂或者不确定等条件下基于数据驱动的多阶段优化控制. 本文对基于MDP的动态系统学习控制理论、算法与应用的发展前沿进行综述,重点讨论增强学习(Reinforcement learning, RL)与近似动态规划(Approximate dynamic programming, ADP)理论与方法的研究进展,其中包括时域差值学习理论、求解连续状态与行为空间MDP的值函数逼近方法、 直接策略搜索与近似策略迭代、自适应评价设计算法等,最后对相关研究领域的应用及发展趋势进行分析和探讨.  相似文献   

11.
Iterative learning control for constrained linear systems   总被引:1,自引:0,他引:1  
This article considers iterative learning control (ILC) for linear systems with convex control input constraints. First, the constrained ILC problem is formulated in a novel successive projection framework. Then, based on this projection method, two algorithms are proposed to solve this constrained ILC problem. The results show that, when perfect tracking is possible, both algorithms can achieve perfect tracking. The two algorithms differ, however, in that one algorithm needs much less computation than the other. When perfect tracking is not possible, both algorithms can exhibit a form of practical convergence to a ‘best approximation’. The effect of weighting matrices on the performance of the algorithms is also discussed and finally, numerical simulations are given to demonstrate the effectiveness of the proposed methods.  相似文献   

12.
基于LMI方法的保性能迭代学习算法设计   总被引:4,自引:0,他引:4  
研究基于性能的迭代学习算法设计与优化问题.首先定义了迭代域二次型性能函数,然后针对线性离散系统给出了迭代域最优迭代学习算法;基于线性矩阵不等式(LMI)方法,针对不确定线性离散系统给出了保性能迭代学习算法及其优化方法.对于这两类迭代学习算法,只要调整性能函数中的权系数矩阵,便可很好地调整迭代学习收敛速度.另外,保性能迭代学习算法设计及优化过程,可利用MATLAB工具箱很方便地求解.  相似文献   

13.
Asymptotic properties of sign algorithms for adaptive filtering   总被引:2,自引:0,他引:2  
This paper develops asymptotic properties of a class of sign-error algorithms with expanding truncation bounds for adaptive filtering. Under merely stationary ergodicity and finite second moments of the reference and output signals, and using trajectory-subsequence (TS) method, it is proved that the algorithm convergers almost surely. Then, a mean squares estimate is derived for the estimation error and a suitably scaled sequence of the estimation error is shown to converge to a diffusion process. The scaling factor together with the stationary covariance gives the rate of convergence result. Moreover, an algorithm under mean squares criterion with expanding truncation bounds is also examined. Compared with the existing results in the literature, sufficient conditions for almost sure convergence are much relaxed. A simple example is provided for demonstration purpose.  相似文献   

14.
The aim of this paper is to present (jointly) a series of robust high performance (award winning) implementations of reinforcement learning algorithms based on temporal-difference learning and weighted k- nearest neighbors for linear function approximation. These algorithms, named kNN‐TD(λ) methods, where rigorously tested at the Second and Third Annual Reinforcement Learning Competitions (RLC2008 and RCL2009) held in Helsinki and Montreal respectively, where the kNN‐TD(λ) method (JAMH team) won in the PolyAthlon 2008 domain, obtained the second place in 2009 and also the second place in the Mountain-Car 2008 domain showing that it is one of the state of the art general purpose reinforcement learning implementations. These algorithms are able to learn quickly, to generalize properly over continuous state spaces and also to be robust to a high degree of environmental noise. Furthermore, we describe a derivation of kNN‐TD(λ) algorithm for problems where the use of continuous actions have clear advantages over the use of fine grained discrete actions: the Exa〉 reinforcement learning algorithm.  相似文献   

15.
本文研究了一类具有可变时滞的中立型随机系统解的渐近性质.利用Lyapunov函数It、^o公式和上鞅收敛定理,得到了该系统解的一些几乎必然渐近稳定性与p阶均值渐近稳定性、几乎必然多项式渐近稳定性与p阶均值多项式渐近稳定性及几乎必然指数稳定性与p阶均值指数稳定性的充分判据.与经典的随机稳定性结论相比,本文所建立的判据充分利用了随机扰动项的作用,无须LV(扩散算子)的负定.  相似文献   

16.
神经网络增强学习的梯度算法研究   总被引:11,自引:1,他引:11  
徐昕  贺汉根 《计算机学报》2003,26(2):227-233
针对具有连续状态和离散行为空间的Markov决策问题,提出了一种新的采用多层前馈神经网络进行值函数逼近的梯度下降增强学习算法,该算法采用了近似贪心且连续可微的Boltzmann分布行为选择策略,通过极小化具有非平稳行为策略的Bellman残差平方和性能指标,以实现对Markov决策过程最优值函数的逼近,对算法的收敛性和近似最优策略的性能进行了理论分析,通过Mountain-Car学习控制问题的仿真研究进一步验证了算法的学习效率和泛化性能。  相似文献   

17.
针对一类输入含死区非线性特性的周期时变系统, 在周期时变参数不可参数化的情形下设计鲁棒重复控制器. 采用微分自适应律估计未知死区参数, 剩余的有界项通过鲁棒方法予以消除, 为避免出现颤振现象, 采用饱和函数替代符号函数. 在系统输出跟踪周期轨迹的情形下, 将非参数化不确定项转化为含周期时变参数的形式, 以达到利用周期学习律进行估计的目的. 理论分析与仿真结果表明, 采用部分饱和或全饱和学习算法均能实现输出误差有界收敛, 并保证闭环系统所有信号有界.  相似文献   

18.
Learning to Predict by the Methods of Temporal Differences   总被引:25,自引:2,他引:23  
This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.  相似文献   

19.
Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9×9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9×9 Computer Go Server.  相似文献   

20.
The iterative learning control (ILC) problem for networked linear system with random link failure is addressed in this paper. The link failures arise both from the controller to the plant and from the plant to the controller. The random link failure is modeled by an independent Bernoulli random variable. Two P-type update laws are designed and critically analyzed. The almost sure convergence property is established according to this case for the first time. Illustrative simulations show the effectiveness of the proposed algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号