共查询到20条相似文献,搜索用时 15 毫秒
1.
连续空间增量最近邻时域差分学习 总被引:1,自引:1,他引:0
针对连续空间强化学习问题,提出一种基于局部加权学习的增量最近邻时域差分(TD)学习框架。通过增量方式在线选取部分已观测状态构建实例词典,采用新观测状态的范围最近邻实例逼近其值函数与策略,并结合TD算法对词典中各实例的值函数和资格迹迭代更新。就框架各主要组成部分给出多种设计方案,并对其收敛性进行理论分析。对24种方案组合进行仿真验证的实验结果表明, SNDN组合具有较好的学习性能和计算效率。 相似文献
2.
Incremental Multi-Step Q-Learning 总被引:23,自引:0,他引:23
This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD() return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q()-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations. 相似文献
3.
The results obtained by Pollack and Blair substantially underperform my 1992 TD Learning results. This is shown by directly benchmarking the 1992 TD nets against Pubeval. A plausible hypothesis for this underperformance is that, unlike TD learning, the hillclimbing algorithm fails to capture nonlinear structure inherent in the problem, and despite the presence of hidden units, only obtains a linear approximation to the optimal policy for backgammon. Two lines of evidence supporting this hypothesis are discussed, the first coming from the structure of the Pubeval benchmark program, and the second coming from experiments replicating the Pollack and Blair results. 相似文献
4.
CHEN Zhan 《艺术与设计.数码设计》2008,(12)
基础素描与结构素描都是艺术基础教学的重要组成部分,本文研究了基础素描和结构素描的学习和表现方法,并从观察方法与构图安排、方法步骤、表现形式和手段、空间观念和细节表现这四个方面分析了基础素描和结构素描之间的差异性。 相似文献
5.
6.
Technical Update: Least-Squares Temporal Difference Learning 总被引:2,自引:0,他引:2
TD./ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD./ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency.This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique. 相似文献
7.
Co-Evolution in the Successful Learning of Backgammon Strategy 总被引:4,自引:0,他引:4
Following Tesauro's work on TD-Gammon, we used a 4,000 parameter feedforward neural network to develop a competitive backgammon evaluation function. Play proceeds by a roll of the dice, application of the network to all legal moves, and selection of the position with the highest evaluation. However, no backpropagation, reinforcement or temporal difference learning methods were employed. Instead we apply simple hillclimbing in a relative fitness environment. We start with an initial champion of all zero weights and proceed simply by playing the current champion network against a slightly mutated challenger and changing weights if the challenger wins. Surprisingly, this worked rather well. We investigate how the peculiar dynamics of this domain enabled a previously discarded weak method to succeed, by preventing suboptimal equilibria in a meta-game of self-learning. 相似文献
8.
强化学习是一种重要的机器学习方法。为了提高强化学习过程的收敛速度和减少学习过程值函数估计的误差,提出了基于递推最小二乘法的多步时序差分学习算法(RLS-TD(λ))。证明了在满足一定条件下,该算法的权值将以概率1收敛到唯一解,并且得出和证明了值函数估计值的误差应满足的关系式。迷宫实验表明,与RLS-TD(0)算法相比,该算法能加快学习过程的收敛,与传统的TD(λ)算法相比,该算法减少了值函数估计误差,从而提高了精度。 相似文献
9.
Learning to Play Chess Using Temporal Differences 总被引:4,自引:0,他引:4
In this paper we present TDLEAF(), a variation on the TD() algorithm that enables it to be used in conjunction with game-tree search. We present some experiments in which our chess program KnightCap used TDLEAF() to learn its evaluation function while playing on Internet chess servers. The main success we report is that KnightCap improved from a 1650 rating to a 2150 rating in just 308 games and 3 days of play. As a reference, a rating of 1650 corresponds to about level B human play (on a scale from E (1000) to A (1800)), while 2150 is human master level. We discuss some of the reasons for this success, principle among them being the use of on-line, rather than self-play. We also investigate whether TDLEAF() can yield better results in the domain of backgammon, where TD() has previously yielded striking success. 相似文献
10.
机器博弈被认为是人工智能领域最具挑战性的研究方向之一。中国象棋计算机博弈的难度绝不亚于国际象棋,但是涉足学者太少,具有自学习能力的就更少了。介绍了中国象棋人机对弈原理,给出了近年来几类典型的评估函数学习方法及其原理,通过比较得出了最适合中国象棋使用的学习方法。分析了这些方法尚存在的问题,并提出了未来研究方向。 相似文献
11.
文章推导了递归最小二乘瞬时差分法,较通常的瞬时差分法有样本使用效率高,收敛速度快,计算量少等特点。并将基于递归最小二乘的强化学习应用于船舶航向控制,克服了通常智能算法的学习需要一定数量样本数据的缺陷,对控制器的参数进行在线学习与调整,可以在一定程度上解决船舶运动中的不确定性问题,仿真结果表明,在有各种分浪流干扰的条件下,船舶航向的控制仍能取得令人满意的效果,说明该算法是有效可行的。 相似文献
12.
机器博弈被认为是人工智能领域最具挑战性的研究方向之一。中国象棋计算机博弈的难度绝不亚于国际象棋,但是涉足学者太少,具有自学习能力的就更少了。介绍了中国象棋人机对弈原理,给出了近年来几类典型的评估函数自学习方法及其原理,通过比较得出了最适合中国象棋使用的学习方法。分析了这些方法尚存在的问题,并提出了未来的研究方向。 相似文献
13.
Ao Xi Thushal Wijekoon Mudiyanselage Dacheng Tao Chao Chen 《IEEE/CAA Journal of Automatica Sinica》2019,6(4):938-951
In this work, we combined the model based reinforcement learning (MBRL) and model free reinforcement learning (MFRL) to stabilize a biped robot (NAO robot) on a rotating platform, where the angular velocity of the platform is unknown for the proposed learning algorithm and treated as the external disturbance. Nonparametric Gaussian processes normally require a large number of training data points to deal with the discontinuity of the estimated model. Although some improved method such as probabilistic inference for learning control (PILCO) does not require an explicit global model as the actions are obtained by directly searching the policy space, the overfitting and lack of model complexity may still result in a large deviation between the prediction and the real system. Besides, none of these approaches consider the data error and measurement noise during the training process and test process, respectively. We propose a hierarchical Gaussian processes (GP) models, containing two layers of independent GPs, where the physically continuous probability transition model of the robot is obtained. Due to the physically continuous estimation, the algorithm overcomes the overfitting problem with a guaranteed model complexity, and the number of training data is also reduced. The policy for any given initial state is generated automatically by minimizing the expected cost according to the predefined cost function and the obtained probability distribution of the state. Furthermore, a novel Q(λ) based MFRL method scheme is employed to improve the policy. Simulation results show that the proposed RL algorithm is able to balance NAO robot on a rotating platform, and it is capable of adapting to the platform with varying angular velocity. 相似文献
14.
Reinforcement Learning with Replacing Eligibility Traces 总被引:26,自引:0,他引:26
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator. 相似文献
15.
16.
Simon M. Lucas 《国际自动化与计算杂志》2008,5(1):45-57
The last few decades have seen a phenomenal increase in the quality, diversity and pervasiveness of computer games. The worldwide computer games market is estimated to be worth around USD 21bn annually, and is predicted to continue to grow rapidly. This paper reviews some of the recent developments in applying computational intelligence (CI) methods to games, points out some of the potential pitfalls, and suggests some fruitful directions for future research. 相似文献
17.
18.
随着微电子机械系统(MEMS)的迅猛发展,自主式微直升机的研究也已成为这一领域内的研究热点之一.由于微直升机尺寸的限制,不能安装功能很强的传感器和处理器,难以获得完全的环境信息,所以传统的基于模型的控制方法不适用于环境是动态的自主微直升机控制.基于行为的控制方法采用累次逼近的方法,不需要环境的精确模型,因此系统的稳定性较好.本文采用基于替代传导径迹的增强式学习,结合即时差分方法,提高其学习效率,仿真实验验证了该学习算法的有效性.最后,本文介绍了微直升机控制中存在的一些问题和我们以后的改进方向. 相似文献
19.
Reinforcement learning (RL) algorithms attempt to learn optimal control actions by iteratively estimating a long-term measure of system performance, the so-called value function. For example, RL algorithms have been applied to walking robots to examine the connection between robot motion and the brain, which is known as embodied cognition. In this paper, RL algorithms are analysed using an exemplar test problem. A closed form solution for the value function is calculated and this is represented in terms of a set of basis functions and parameters, which is used to investigate parameter convergence. The value function expression is shown to have a polynomial form where the polynomial terms depend on the plant's parameters and the value function's discount factor. It is shown that the temporal difference error introduces a null space for the differenced higher order basis associated with the effects of controller switching (saturated to linear control or terminating an experiment) apart from the time of the switch. This leads to slow convergence in the relevant subspace. It is also shown that badly conditioned learning problems can occur, and this is a function of the value function discount factor and the controller switching points. Finally, a comparison is performed between the residual gradient and TD(0) learning algorithms, and it is shown that the former has a faster rate of convergence for this test problem. 相似文献
20.
在确定不同元素 (方案 )重要性权重的问题中 ,为避免直接确定重要性比值的困难 ,提出了从先定性判定差异 ,再对差异量化处理的新思路。文中给出了量化的步骤和方法。 相似文献