首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 344 毫秒
1.
基于强化学习的多Agent协作研究   总被引:2,自引:0,他引:2  
强化学习为多Agent之间的协作提供了鲁棒的学习方法.本文首先介绍了强化学习的原理和组成要素,其次描述了多Agent马尔可夫决策过程MMDP,并给出了Agent强化学习模型.在此基础上,对多Agent协作过程中存在的两种强化学习方式:IL(独立学习)和JAL(联合动作学习)进行了比较.最后分析了在有多个最优策略存在的情况下,协作多Agent系统常用的几种协调机制.  相似文献   

2.
结合强化学习技术讨论了单移动Agent学习的过程,然后扩展到多移动Agent学习领域,提出一个多移动Agent学习算法MMAL(MultiMobileAgentLearning)。算法充分考虑了移动Agent学习的特点,使得移动Agent能够在不确定和有冲突目标的上下文中进行决策,解决在学习过程中Agent对移动时机的选择,并且能够大大降低计算代价。目的是使Agent能在随机动态的环境中进行自主、协作的学习。最后,通过仿真试验表明这种学习算法是一种高效、快速的学习方法。  相似文献   

3.
在多Agent系统中,通过学习可以使Agent不断增加和强化已有的知识与能力,并选择合理的动作最大化自己的利益.但目前有关Agent学习大都限于单Agent模式,或仅考虑Agent个体之间的对抗,没有考虑Agent的群体对抗,没有考虑Agent在团队中的角色,完全依赖对效用的感知来判断对手的策略,导致算法的收敛速度不高.因此,将单Agent学习推广到在非通信群体对抗环境下的群体Agent学习.考虑不同学习问题的特殊性,在学习模型中加入了角色属性,提出一种基于角色跟踪的群体Agent再励学习算法,并进行了实验分析.在学习过程中动态跟踪对手角色,并根据对手角色与其行为的匹配度动态决定学习速率,利用minmax-Q算法修正每个状态的效用值,最终加快学习的收敛速度,从而改进了Bowling和Littman等人的工作.  相似文献   

4.
多Agent在复杂多变的环境中行为的自组织是当前人工智能研究的一个热点.本文借鉴生物系统控制的理论,设计了一种多Agent的体系结构,在此结构中包含有遗传、神经和内分泌控制子系统.本文重点论述了内分泌控制子系统的作用,提出了用情感进行行为学习的新方法.在此方法中,神经系统的作用是记忆和行为决策,其行为决策的效果通过内分泌系统进行反馈,从而避免了神经网络的自学习,使得算法具有较好的求解性能.为了验征该算法的有效性,本文做了倒立模控制的仿真实验.仿真结果也表明算法具有很强的自适应求解能力.  相似文献   

5.
基于角色和CSCL的智能网络协作模型   总被引:2,自引:0,他引:2  
为深入研究智能Agent在开放动态的网络环境中的应用,把角色机制应用到网络学习环境中,提出了一种新型的基于CSCL的智能网络协作模型,给出了智能Agent的结构表示及功能,并从多角色的角度给出了模型中Agent的分类。最后以共同学习活动为例,对Agent之间基于角色的协作过程进行了形式化描述。  相似文献   

6.
多Agent深度强化学习综述   总被引:10,自引:4,他引:6  
近年来, 深度强化学习(Deep reinforcement learning, DRL)在诸多复杂序贯决策问题中取得巨大突破.由于融合了深度学习强大的表征能力和强化学习有效的策略搜索能力, 深度强化学习已经成为实现人工智能颇有前景的学习范式.然而, 深度强化学习在多Agent系统的研究与应用中, 仍存在诸多困难和挑战, 以StarCraft Ⅱ为代表的部分观测环境下的多Agent学习仍然很难达到理想效果.本文简要介绍了深度Q网络、深度策略梯度算法等为代表的深度强化学习算法和相关技术.同时, 从多Agent深度强化学习中通信过程的角度对现有的多Agent深度强化学习算法进行归纳, 将其归纳为全通信集中决策、全通信自主决策、欠通信自主决策3种主流形式.从训练架构、样本增强、鲁棒性以及对手建模等方面探讨了多Agent深度强化学习中的一些关键问题, 并分析了多Agent深度强化学习的研究热点和发展前景.  相似文献   

7.
随着深度学习和强化学习研究取得长足的进展,多Agent强化学习已成为解决大规模复杂序贯决策问题的通用方法。为了推动该领域的发展,从竞争与合作的视角收集并总结近期相关的研究成果。该文介绍单Agent强化学习;分别介绍多Agent强化学习的基本理论框架——马尔可夫博弈以及扩展式博弈,并重点阐述了其在竞争、合作和混合三种场景下经典算法及其近期研究进展;讨论多Agent强化学习面临的核心挑战——环境的不稳定性,并通过一个例子对其解决思路进行总结与展望。  相似文献   

8.
多Agent自动协商中机器学习的应用研究   总被引:2,自引:0,他引:2  
目前将机器学习理论应用到多Agent自动协商系统中已成为电子商务领域的最新研究课题。本文即是利用贝叶斯法则来更新协商中的环境信息(即信念),利用强化学习中的Q学习算法生成协商中的提议,建立了一个具有学习机制的多Agent自动协商模型。并且封传统Q学习算法追行了扩充,设计了基于Agent的当前信念和最近探索盈余的动态Q学习算法。实验验证了算法的收敛性。  相似文献   

9.
基于多Agent的网络学习智能推荐模型   总被引:1,自引:0,他引:1  
针对网络学习者面临海量信息选择的困扰,提出了一个基于多Agent的网络学习智能推荐模型.运用界面Agent采实现与学习者的交互,利用基于知识推荐的Agent提供与学习者兴趣相关的推荐,以及基于相似学习者推荐的Agent向特定学习者推荐新的知识,并对模型中推荐的相似度算法进行了阐述.通过多Agent技术的运用,较好的解决了网络学习推荐的智能化,个性化以及灵活性的问题,使网络学习者能在一种交互式的学习环境中得到更人性化的学习推荐服务.  相似文献   

10.
多Agent协作的强化学习模型和算法   总被引:2,自引:0,他引:2  
结合强化学习技术讨论了多Agent协作学习的过程,构造了一个新的多Agent协作学习模型。在这个模型的基础上,提出一个多Agent协作学习算法。算法充分考虑了多Agent共同学习的特点,使得Agent基于对动作长期利益的估计来预测其动作策略,并做出相应的决策,进而达成最优的联合动作策略。最后,通过对猎人。猎物追逐问题的仿真试验验证了该算法的收敛性,表明这种学习算法是一种高效、快速的学习方法。  相似文献   

11.
并行强化学习算法及其应用研究   总被引:2,自引:0,他引:2       下载免费PDF全文
强化学习是一种重要的机器学习方法,然而在实际应用中,收敛速度缓慢是其主要不足之一。为了提高强化学习的效率,提出了一种并行强化学习算法。多个同时学习,在各自学习一定周期后,利用D-S证据利用对学习结果进行融合,然后在融合结果的基础上,各进行下一周期的学习,从而实现提高整个系统学习效率的目的。实验结果表明了该方法的可行性和有效性。  相似文献   

12.
This paper studies iterative learning control (ILC) in a multi‐agent framework, wherein a group of agents simultaneously and repeatedly perform the same task. Assuming similarity between the agents, we investigate whether exchanging information between the agents improves an individual's learning performance. That is, does an individual agent benefit from the experience of the other agents? We consider the multi‐agent iterative learning problem as a two‐step process of: first, estimating the repetitive disturbance of each agent; and second, correcting for it. We present a comparison of an agent's disturbance estimate in the case of (I) independent estimation, where each agent has access only to its own measurement, and (II) joint estimation, where information of all agents is globally accessible. When the agents are identical and noise comes from measurement only, joint estimation yields a noticeable improvement in performance. However, when process noise is encountered or when the agents have an individual disturbance component, the benefit of joint estimation is negligible. Copyright © 2011 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society  相似文献   

13.
B2B电子市场的定价问题是一个半学习半推理的连续决策过程,每个定价agent不是直接采用多agent学习算法下的均衡策略,而是根据博弈历史进行推理决策,并不断学习对手的策略。提出了基于内省推理方法的多agent环境下agent高效在线学习方法,将基于对手模型的客观观察行为与基于换位思考推理的主观意图推测结合起来。仿真结果证实了算法在电子市场定价中的有效性。  相似文献   

14.
曹伟  孙明 《控制与决策》2018,33(9):1619-1624
针对一类具有任意初始状态的部分非正则多智能体系统,提出一种迭代学习控制算法.该算法将具有固定拓扑结构的多智能体编队控制问题转化为广义上的跟踪问题,即让领导者跟踪给定的期望轨迹,而跟随者要始终保持预定队形对某一智能体进行跟踪,并将该智能体作为自身的领导者.同时,为了使每个智能体在任意初始状态下都能按照期望队形进行编队,对每个智能体的初始状态设计迭代学习律,并从理论上对算法的收敛性进行严格证明,给出算法收敛的充分条件.所提出的算法对于各个智能体在任意初始位置条件下均能实现在有限时间区间内系统的稳定编队.最后,通过仿真算例进一步验证了所提出算法的有效性.  相似文献   

15.
N identical agents with bounded inputs aim to reach a common target state (consensus) in the minimum possible time. Algorithms for computing this time-optimal consensus point, the control law to be used by each agent and the time taken for the consensus to occur, are proposed. Two types of multi-agent systems are considered, namely (1) coupled single-integrator agents on a plane and, (2) double-integrator agents on a line. At the initial time instant, each agent is assumed to have access to the state information of all the other agents. An algorithm, using convexity of attainable sets and Helly's theorem, is proposed, to compute the final consensus target state and the minimum time to achieve this consensus. Further, parts of the computation are parallelised amongst the agents such that each agent has to perform computations of O(N2) run time complexity. Finally, local feedback time-optimal control laws are synthesised to drive each agent to the target point in minimum time. During this part of the operation, the controller for each agent uses measurements of only its own states and does not need to communicate with any neighbouring agents.  相似文献   

16.
This paper studies a distributed policy evaluation in multi-agent reinforcement learning. Under cooperative settings, each agent only obtains a local reward, while all agents share a common environmental state. To optimize the global return as the sum of local return, the agents exchange information with their neighbors through a communication network. The mean squared projected Bellman error minimization problem is reformulated as a constrained convex optimization problem with a consensus constraint; then, a distributed alternating directions method of multipliers (ADMM) algorithm is proposed to solve it. Furthermore, an inexact step for ADMM is used to achieve efficient computation at each iteration. The convergence of the proposed algorithm is established.  相似文献   

17.
《Knowledge》2006,19(1):43-49
Interface agents are computer programs that provide personalized assistance to users with their computer-based tasks. The interface agents developed so far have focused their attention on learning a user's preferences in a given application domain and on assisting him according to them. However, in order to personalize the interaction with users, interface agents should also learn how to best interact with each user and how to provide them assistance of the right sort at the right time. To fulfil this goal, an interface agent has to discover when the user wants a suggestion to solve a problem or deal with a given situation, when he requires only a warning about it and when he does not need any assistance at all. In this work, we propose a learning algorithm, named WoS, to tackle this problem. Our algorithm is based on the observation of a user's actions and on a user's reactions to the agent's assistance actions. The WoS algorithm enables an interface agent to adapt its behavior and its interaction with a user to the user's assistance requirements in each particular context.  相似文献   

18.
This paper formulates a self-organization algorithm to address the problem of global behavior supervision in engineered swarms of arbitrarily large population sizes. The swarms considered in this paper are assumed to be homogeneous collections of independent identical finite-state agents, each of which is modeled by an irreducible finite Markov chain. The proposed algorithm computes the necessary perturbations in the local agents' behavior, which guarantees convergence to the desired observed state of the swarm. The ergodicity property of the swarm, which is induced as a result of the irreducibility of the agent models, implies that while the local behavior of the agents converges to the desired behavior only in the time average, the overall swarm behavior converges to the specification and stays there at all times. A simulation example illustrates the underlying concept.  相似文献   

19.
乔林  罗杰 《计算机科学》2012,39(5):213-216
主要以提高多智能体系统中Q学习算法的学习效率为研究目标,以追捕问题为研究平台,提出了一种基于共享经验的Q学习算法。该算法模拟人类的团队学习行为,各个智能体拥有共同的最终目标,即围捕猎物,同时每个智能体通过协商获得自己的阶段目标。在学习过程中把学习分为阶段性学习,每学习一个阶段,就进行一次阶段性总结,分享彼此好的学习经验,以便于下一阶段的学习。这样以学习快的、好的带动慢的、差的,进而提升总体的学习性能。仿真实验证明,在学习过程中共享经验的Q学习算法能够提高学习系统的性能,高效地收敛于最优策略。  相似文献   

20.
In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号