首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Incremental Multi-Step Q-Learning   总被引:23,自引:0,他引:23  
Peng  Jing  Williams  Ronald J. 《Machine Learning》1996,22(1-3):283-290
This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD() return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q()-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.  相似文献   

2.
基于多Agent的并行Q-学习算法   总被引:1,自引:0,他引:1  
提出了一种多Agent并行Q-学习算法.学习系统中存在多个Agent,它们的学习环境、学习任务及自身功能均相同,在每个学习周期内,各个Agent在各自独立的学习环境中进行学习,当一个学习周期结束后,对各个Agent的学习结果进行融合,融合后的结果被所有的Agent共享,并以此为基础进行下一个周期的学习.实验结果表明了该方法的可行性和有效性。  相似文献   

3.
基于经验知识的Q-学习算法   总被引:1,自引:0,他引:1  
为了提高智能体系统中的典型的强化学习Q-学习的学习速度和收敛速度,使学习过程充分利用环境信息,本文提出了一种基于经验知识的Q-学习算法.该算法利用具有经验知识信息的函数,使智能体在进行无模型学习的同时学习系统模型,避免对环境模型的重复学习,从而加速智能体的学习速度.仿真实验结果表明:该算法使学习过程建立在较好的学习基础上,从而更快地趋近于最优状态,其学习效率和收敛速度明显优于标准的Q-学习.  相似文献   

4.
Q学习算法在RoboCup带球中的应用   总被引:1,自引:0,他引:1  
机器人世界杯足球锦标赛(RoboCup)是全球影响力最大的机器人足球比赛之一,而仿真组比赛是其重要的组成部分。鉴于带球技术在仿真组比赛中的重要性,我们将Q学习算法应用于带球技术训练中,使智能体本身具有学习和适应能力,能够自己从环境中获取知识。本文描述了应用Q学习算法在特定场景中进行1vs.1带球技术训练的方法和实验过程,并将训练方法应用于实际球队的训练之中进行了验证。  相似文献   

5.
Technical Note: Q-Learning   总被引:6,自引:0,他引:6  
Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states.This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where manyQ values can be changed each iteration, rather than just one.  相似文献   

6.
Asynchronous Stochastic Approximation and Q-Learning   总被引:15,自引:6,他引:15  
We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available.  相似文献   

7.
现有的强化学习算法存在样本利用率低的问题,导致智能体寻找最优策略的能力下降.为解决这个问题,提出了基于增量式相似度的样本评估方法.设计了一个状态新颖度度量方法和一个样本价值评价函数.计算新状态与基准状态之间的相似度,基于状态的相似度计算状态的新颖程度,再增量式更新基准状态,直到训练结束.计算样本价值时,将状态的新颖程度考虑在内,再针对样本奖励值是否大于零分别进行计算.最后根据其样本价值结合排名选择和随机选择进行采样.该方法在Playing Atari 2600的控制问题中取得了更高的奖励值,说明该方法缓解了样本利用率低的问题,且通过增量式计算相似度减少了计算量.  相似文献   

8.
Reinforcement learning has been widely applied to solve a diverse set of learning tasks, from board games to robot behaviours. In some of them, results have been very successful, but some tasks present several characteristics that make the application of reinforcement learning harder to define. One of these areas is multi-robot learning, which has two important problems. The first is credit assignment, or how to define the reinforcement signal to each robot belonging to a cooperative team depending on the results achieved by the whole team. The second one is working with large domains, where the amount of data can be large and different in each moment of a learning step. This paper studies both issues in a multi-robot environment, showing that introducing domain knowledge and machine learning algorithms can be combined to achieve successful cooperative behaviours.  相似文献   

9.
    
Reinforcement learning (RL) is a powerful solution to adaptive control when no explicit model exists for the system being controlled. To handle uncertainty along with the lack of explicit model for the Cloud's resource management systems, this paper utilizes continuous RL in order to provide an intelligent control scheme for dynamic resource provisioning in the spot market of the Cloud's computational resources. On the other hand, the spot market of computational resources inside Cloud is a real-time environment in which, from the RL point of view, the control task of dynamic resource provisioning requires defining continuous domains for (state, action) pairs. Commonly, function approximation is used in RL controllers to overcome continuous requirements of (state, action) pair remembrance and to provide estimates for unseen statuses. However, due to the computational complexities of approximation techniques like neural networks, RL is almost impractical for real-time applications. Thus, in this paper, Ink Drop Spread (IDS) modeling method, which is a solution to system modeling without dealing with heavy computational complexities, is used as the basis to develop an adaptive controller for dynamic resource provisioning in Cloud's virtualized environment. The performance of the proposed control mechanism is evaluated through measurement of job rejection rate and capacity waste. The results show that at the end of the training episodes, in 90 days, the controller learns to reduce job rejection rate down to 0% while capacity waste is optimized down to 11.9%.  相似文献   

10.
刘晓  毛宁 《数据采集与处理》2015,30(6):1310-1317
学习自动机(Learning automation,LA)是一种自适应决策器。其通过与一个随机环境不断交互学习从一个允许的动作集里选择最优的动作。在大多数传统的LA模型中,动作集总是被取作有限的。因此,对于连续参数学习问题,需要将动作空间离散化,并且学习的精度取决于离散化的粒度。本文提出一种新的连续动作集学习自动机(Continuous action set learning automaton,CALA),其动作集为一个可变区间,同时按照均匀分布方式选择输出动作。学习算法利用来自环境的二值反馈信号对动作区间的端点进行自适应更新。通过一个多模态学习问题的仿真实验,演示了新算法相对于3种现有CALA算法的优越性。  相似文献   

11.
深度强化学习善于解决控制的优化问题,连续动作的控制因为精度的要求,动作的数量随着动作维度的增加呈指数型增长,难以用离散的动作来表示。基于Actor-Critic框架的深度确定性策略梯度(Deep Deterministic Policy Gradient,DDPG)算法虽然解决了连续动作控制问题,但是仍然存在采样方式缺乏科学理论指导、动作维度较高时的最优动作与非最优动作之间差距被忽视等问题。针对上述问题,提出一种基于DDPG算法的优化采样及精确评价的改进算法,并成功应用于选择顺应性装配机器臂(Selective Compliance Assembly Robot Arm,SCARA)的仿真环境中,与原始的DDPG算法对比,取得了良好的效果,实现了SCARA机器人快速自动定位。  相似文献   

12.
基于节点生长k-均值聚类算法的强化学习方法   总被引:3,自引:0,他引:3  
处理连续状态强化学习问题,主要方法有两类:参数化的函数逼近和自适应离散划分.在分析了现有对连续状态空间进行自适应划分方法的优缺点的基础上,提出了一种基于节点生长k均值聚类算法的划分方法,分别给出了在离散动作和连续动作两种情况下该强化学习方法的算法步骤.在离散动作的MountainCar问题和连续动作的双积分问题上进行仿真实验.实验结果表明,该方法能够根据状态在连续空间的分布,自动调整划分的精度,实现对于连续状态空间的自适应划分,并学习到最佳策略.  相似文献   

13.
Q学习的改进研究及其仿真实验   总被引:1,自引:0,他引:1  
张云  刘建平 《计算机仿真》2007,24(10):111-114
Q学习是一种重要的强化学习方法.针对Q学习的不足,进行了一些改进研究.首先引入轮盘赌的方法,通过概率的途径进行行为选择,避免了早期训练中高Q值的束缚,增加了随机性,更加符合Q学习的要求.其次针对复杂环境或是稀疏型回报函数的情况下计算量的指数增长,通过添加正负再励信号的方法进行改进,并通过大量的仿真实验进行反复验证,得出负的再励信号更加有效.理论和实验均证明,该方法具有较强的可行性,切实有效的加快了Q函数的收敛速度,提高了学习效率.  相似文献   

14.
实时竞价(RTB)是在线展示广告中被广泛采用的广告投放模式,针对由于RTB拍卖环境的高度动态性导致最佳出价策略难以获得的问题,提出了一种基于强化学习(RL)的出价策略优化方法,即采用带惩罚的点概率距离策略优化(POP3D)算法来学习最佳出价策略。在基于POP3D的出价框架中,广告投标过程被建模为情节式的马尔可夫决策过程,每个情节被划分为固定数量的时间步,每个广告展示的出价由它的预估点击率大小和竞标因子共同决定。每个时间步,竞标代理都会根据上一时间步的拍卖情况对竞标因子进行调整,以使得出价策略能够适应高度动态的拍卖环境,竞标代理的目标是学习最佳的竞标因子调整策略。在iPinYou数据集上的实验结果表明,与DRLB算法相比,所提出价算法在预算比例为1/16和1/32时,在点击次数方面均提升了0.2%;当预算比例为1/8、1/16和1/32时,在赢标率方面分别提升了1.8%、1.0%和1.7%;另外,在稳定性方面,所提方法也具有优势。表明了该方法的优越性。  相似文献   

15.
宋江帆  李金龙 《计算机应用研究》2023,40(10):2928-2932+2944
在强化学习中,策略梯度法经常需要通过采样将连续时间问题建模为离散时间问题。为了建模更加精确,需要提高采样频率,然而过高的采样频率可能会使动作改变频率过高,从而降低训练效率。针对这个问题,提出了动作稳定更新算法。该方法使用策略函数输出的改变量计算动作重复的概率,并根据该概率随机地重复或改变动作。在理论上分析了算法性能。之后在九个不同的环境中评估算法的性能,并且将它和已有方法进行了比较。该方法在其中六个环境下超过了现有方法。实验结果表明,动作稳定更新算法可以有效提高策略梯度法在连续时间问题中的训练效率。  相似文献   

16.
Abstract

Robot position/force control provides an interaction scheme between the robot and the environment. When the environment is unknown, learning algorithms are needed. But, the learning space and learning time are big. To balance the learning accuracy and the learning time, we propose a hybrid reinforcement learning method, which can be in both discrete and continuous domains. The discrete-time learning has poor learning accuracy and less learning time. The continuous-time learning is slow but has better learning precision. This hybrid reinforcement learning learns the optimal contact force, meanwhile it minimizes the position error in the unknown environment. Convergence of the proposed learning algorithm is proven. Real-time experiments are carried out using the pan and tilt robot and the force/torque sensor.  相似文献   

17.
赵逸凡  郝丹 《软件学报》2023,34(6):2708-2726
在软件交付越来越强调迅速、可靠的当下,持续集成成为一项备受关注的技术.开发人员不断将工作副本集成到代码主干完成软件演化,每次集成会通过自动构建测试来验证代码更新是否引入错误.但随着软件规模的增大,测试用例集包含的测试用例越来越多,测试用例的覆盖范围、检错效果等特征也随着集成周期的延长而变化,传统的测试用例排序技术难以适用.基于强化学习的测试排序技术可以根据测试反馈动态调整排序策略,但现有的相关技术不能综合考虑测试用例集中的信息进行排序,这限制了它们的性能.提出一种新的基于强化学习的持续集成环境中测试用例排序方法——指针排序方法:方法使用测试用例的历史信息等特征作为输入,在每个集成周期中,智能体利用指针注意力机制获得对所有备选测试用例的关注程度,由此得到排序结果,并从测试执行的反馈得到策略更新的方向,在“排序-运行测试-反馈”的过程中不断调整排序策略,最终达到良好的排序性能.在5个规模较大的数据集上验证了所提方法的效果,并探究了使用的历史信息长度对方法性能的影响,方法在仅含回归测试用例的数据集上的排序效果,以及方法的执行效率.最后,得到如下结论:(1)与现有方法相比,指针排序方法能够随着软件版本的演化调整排序策略,在持续集成环境下有效地提升测试序列的检错能力.(2)指针排序方法对输入的历史信息长度有较好的鲁棒性,少量的历史信息即可使其达到最优效果.(3)指针排序方法能够很好地处理回归测试用例和新增测试用例.(4)指针排序方法的时间开销不大,结合其更好、更稳定的排序性能,可以认为指针排序方法是一个非常有竞争力的方法.  相似文献   

18.
Learning Team Strategies: Soccer Case Studies   总被引:1,自引:0,他引:1  
We use simulated soccer to study multiagent learning. Each team's players (agents) share action set and policy, but may behave differently due to position-dependent inputs. All agents making up a team are rewarded or punished collectively in case of goals. We conduct simulations with varying team sizes, and compare several learning algorithms: TD-Q learning with linear neural networks (TD-Q), Probabilistic Incremental Program Evolution (PIPE), and a PIPE version that learns by coevolution (CO-PIPE). TD-Q is based on learning evaluation functions (EFs) mapping input/action pairs to expected reward. PIPE and CO-PIPE search policy space directly. They use adaptive probability distributions to synthesize programs that calculate action probabilities from current inputs. Our results show that linear TD-Q encounters several difficulties in learning appropriate shared EFs. PIPE and CO-PIPE, however, do not depend on EFs and find good policies faster and more reliably. This suggests that in some multiagent learning scenarios direct search in policy space can offer advantages over EF-based approaches.  相似文献   

19.
遗忘是人工神经网络在增量学习中的最大问题,被称为“灾难性遗忘”.而人类可以持续地获取新知识,并能保存大部分经常用到的旧知识.人类的这种能持续“增量学习”而很少遗忘是与人脑具有分区学习结构和记忆回放能力相关的.为模拟人脑的这种结构和能力,提出一种“避免近期偏好的自学习掩码分区增量学习方法”简称ASPIL.它包含“区域隔离”和“区域集成”两阶段,二者交替迭代实现持续的增量学习.首先,提出“BN稀疏区域隔离”方法,将新的学习过程与现有知识隔离,避免干扰现有知识;对于“区域集成”,提出自学习掩码(SLM)和双分支融合(GBF)方法.其中SLM准确提取新知识,并提高网络对新知识的适应性,而GBF将新旧知识融合,以达到建立统一的、高精度的认知的目的;训练时,为确保进一步兼顾旧知识,避免对新知识的偏好,提出间隔损失正则项来避免“近期偏好”问题.为评估以上所提出方法的效用,在增量学习标准数据集CIFAR-100和miniImageNet上系统地进行消融实验,并与最新的一系列知名方法进行比较.实验结果表明,所提方法提高了人工神经网络的记忆能力,与最新知名方法相比识别率平均提升5.27%以上.  相似文献   

20.
Q-Learning是目前一种主流的强化学习算法,但其在随机环境中收敛速度不佳,之前的研究针对Speedy Q-Learning存在的过估计问题进行改进,提出了Double Speedy Q-Learning算法.但Double Speedy Q-Learning算法并未考虑随机环境中存在的自循环结构,即代理执行动作时...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号