首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 46 毫秒
1.
折扣报酬模型强化学习是目前强化学习研究的主流,但折扣因子的选取使得近期期望报酬的影响大于远期期望报酬的影响,而有时候较大远期期望报酬的策略有可能是最优的,因此比较合理的方法是采用平均报酬模型强化学习。该文介绍了平均报酬模型强化学习的两个主要算法以及主要应用。  相似文献   

2.
可重入生产系统的平均报酬型强化学习调度   总被引:4,自引:0,他引:4  
在可重入生产系统中,一个重要的问题就是对调度策略进行优化,以提高系统平均输出率.本文采用了一种平均报酬型强化学习算法来解决该问题,直接从所关心的系统品质出发,自动获得具有自适应性 的动态调度策略.仿真结果表明,其性能优于两种熟知的优先权调度策略.  相似文献   

3.
多Agent协作的强化学习模型和算法   总被引:2,自引:0,他引:2  
结合强化学习技术讨论了多Agent协作学习的过程,构造了一个新的多Agent协作学习模型。在这个模型的基础上,提出一个多Agent协作学习算法。算法充分考虑了多Agent共同学习的特点,使得Agent基于对动作长期利益的估计来预测其动作策略,并做出相应的决策,进而达成最优的联合动作策略。最后,通过对猎人。猎物追逐问题的仿真试验验证了该算法的收敛性,表明这种学习算法是一种高效、快速的学习方法。  相似文献   

4.
强化学习理论、算法及应用   总被引:38,自引:3,他引:38  
强化学习(reinforcement learning)一词来自于行为心理学,这一理论把行为学习看成是反复试验的过程,从而把环境状态映射成相应的动作。首先全面地介绍了强化学习理论的主要算法,即瞬时差分法、Q-学习算法及自适应启发评价算法;然后介绍了强化学习的应用情况;最后讨论了强化学习目前所要研究的问题。  相似文献   

5.
基于神经网络的强化学习算法研究   总被引:11,自引:0,他引:11  
BP神经网络在非线性控制系统中被广泛运用,但作为有导师监督的学习算法,要求批量提供输入输出对神经网络训练,而在一些并不知道最优策略的系统中,这样的输入输出对事先并无法得到,另一方面,强化学习从实际系统学习经验来调整策略,并且是一个逼近最优策略的过程,学习过程并不需要导师的监督。提出了将强化学习与BP神经网络结合的学习算法-RBP模型。该模型的基本思想是通过强化学习控制策略,经过一定周期的学习后再用学到的知识训练神经网络,以使网络逐步收敛到最优状态。最后通过实验验证了该方法的有效性及收敛性。  相似文献   

6.
平均奖赏强化学习算法研究   总被引:7,自引:0,他引:7  
高阳  周如益  王皓  曹志新 《计算机学报》2007,30(8):1372-1378
顺序决策问题常用马尔可夫决策过程(MDP)建模.当决策行为执行从时刻点扩展到连续时间上时,经典的马尔可夫决策过程模型也扩展到半马尔可夫决策过程模型(SMDP).当系统参数未知时,强化学习技术被用来学习最优策略.文中基于性能势理论,证明了平均奖赏强化学习的逼近定理.通过逼近相对参考状态的性能势值函数,研究一个新的平均奖赏强化学习算法--G-学习算法.G-学习算法既可以用于MDP,也可以用于SMDP.不同于经典的R-学习算法,G-学习算法采用相对参考状态的性能势值函数替代相对平均奖赏和的相对值函数.在顾客访问控制和生产库存仿真实验中,G-学习算法表现出优于R-学习算法和SMART算法的性能.  相似文献   

7.
结合强化学习技术讨论了单移动Agent学习的过程,然后扩展到多移动Agent学习领域,提出一个多移动Agent学习算法MMAL(MultiMobileAgentLearning)。算法充分考虑了移动Agent学习的特点,使得移动Agent能够在不确定和有冲突目标的上下文中进行决策,解决在学习过程中Agent对移动时机的选择,并且能够大大降低计算代价。目的是使Agent能在随机动态的环境中进行自主、协作的学习。最后,通过仿真试验表明这种学习算法是一种高效、快速的学习方法。  相似文献   

8.
一种多步Q强化学习方法   总被引:1,自引:0,他引:1  
Q 学习是一种重要的强化学习算法。本文针对 Q 学习和 Q(λ)算法的不足.提出了一种具有多步预见能力的Q学习方法:MQ 方法。首先给出了 MDP 模型.在分析 Q 学习和Q(λ)算法的基础上给出了 MQ 算法的推导过程,并分析了算法的更新策略和 k 值的确定原则。通过悬崖步行仿真试验验证了该算法的有效性。理论分析和数值试验均表明.该算法具有较强的预见能力.同时能降低计算复杂度,是一种有效平衡更新速度和复杂度的强化学习方法。  相似文献   

9.
10.
一种基于强化学习的控制算法研究   总被引:2,自引:0,他引:2  
该文在阐述了强化学习的基本机制的基础上,根据复杂工业过程的非线性、多变量、大时延、强耦合的特点,提出了一种将基于案例的学习和强化学习相结合的控制算法,并对重油分馏塔进行了控制效果的仿真实验,控制结果显示了算法能够很好地满足控制任务。  相似文献   

11.
Mahadevan  Sridhar 《Machine Learning》1996,22(1-3):159-195
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.  相似文献   

12.
Recent Advances in Hierarchical Reinforcement Learning   总被引:22,自引:0,他引:22  
Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state. Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination. This leads naturally to hierarchical control architectures and associated learning algorithms. We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed. Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review. We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting.  相似文献   

13.
Near-Optimal Reinforcement Learning in Polynomial Time   总被引:1,自引:0,他引:1  
Kearns  Michael  Singh  Satinder 《Machine Learning》2002,49(2-3):209-232
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.  相似文献   

14.
策略梯度强化学习中的最优回报基线   总被引:2,自引:0,他引:2  
王学宁  徐昕  吴涛  贺汉根 《计算机学报》2005,28(6):1021-1026
尽管策略梯度强化学习算法有较好的收敛性,但是在梯度估计的过程中方差过大,却是该方法在理论和应用上的一个主要弱点,为减小梯度强化学习算法的方差,该文提出一种新的算法——Istate-Grbp算法:在策略梯度算法Istate-GPOMDP中加入回报基线,以改进策略梯度算法的学习性能,文中证明了在Istate-GPOMDP算法中引入回报基线,不会改变梯度估计的期望值,并且给出了使方差最小的最优回报基线,实验结果表明,和已有算法相比,该文提出的算法通过减小梯度估计的方差,提高了学习效率,加快了学习过程的收敛。  相似文献   

15.
Reinforcement Learning with Replacing Eligibility Traces   总被引:26,自引:0,他引:26  
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator.  相似文献   

16.
随着强化学习在自动机器人控制、复杂决策问题上的广泛应用,强化学习逐渐成为机器学习领域中的一大研究热点.传统强化学习算法是一种通过不断与所处环境进行自主交互并从中得到策略的学习方式.然而,大多数多步决策问题难以给出传统强化学习所需要的反馈信号.这逐渐成为强化学习在更多复杂问题中实现应用的瓶颈.逆强化学习是基于专家决策轨迹最优的假设,在马尔可夫决策过程中逆向求解反馈函数的一类算法.目前,通过将逆强化学习和传统正向强化学习相结合设计的一类示教学习算法已经在机器人控制等领域取得了一系列成果.对强化学习、逆强化学习以及示教学习方法做一定介绍,此外还介绍了逆强化学习在应用过程中所需要解决的问题以及基于逆强化学习的示教学习方法.  相似文献   

17.
激励学习的最优判据研究   总被引:8,自引:0,他引:8       下载免费PDF全文
激励学习智能体通过最优策略的学习与规划来求解序贯决策问题,因此如何定义策略的最优判所是激励学习研究的核心问题之一,本文讨论了一系列来自动态规划的最优判据,通过实例检验了各种判据对激励学习的适用性和优缺点,分析了设计各种判据的激励学习算法的必要性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号