In recent years, great strides have been made towards creating autonomous agents that can learn via interaction with their environment. When considering just an individual agent, it is often appropriate to model the world as being stationary, meaning that the same action from the same state will always yield the same (possibly stochastic) effects. However, in the presence of other independent agents, the environment is not stationary: an action’s effects may depend on the actions of the other agents. This non-stationarity poses the primary challenge of multiagent learning and comprises the main reason that it is best considered distinctly from single agent learning. The multiagent learning problem is often studied in the stylized settings provided by repeated matrix games. The goal of this article is to introduce a novel multiagent learning algorithm for such a setting, called Convergence with Model Learning and Safety (or CMLeS), that achieves a new set of objectives which have not been previously achieved. Specifically, CMLeS is the first multiagent learning algorithm to achieve the following three objectives: (1) converges to following a Nash equilibrium joint-policy in self-play; (2) achieves close to the best response when interacting with a set of memory-bounded agents whose memory size is upper bounded by a known value; and (3) ensures an individual return that is very close to its security value when interacting with any other set of agents. Our presentation of CMLeS is backed by a rigorous theoretical analysis, including an analysis of sample complexity wherever applicable.  相似文献   

Multiagent learning provides a promising paradigm to study how autonomous agents learn to achieve coordinated behavior in multiagent systems. In multiagent learning, the concurrency of multiple distributed learning processes makes the environment nonstationary for each individual learner. Developing an efficient learning approach to coordinate agents’ behavior in this dynamic environment is a difficult problem especially when agents do not know the domain structure and at the same time have only local observability of the environment. In this paper, a coordinated learning approach is proposed to enable agents to learn where and how to coordinate their behavior in loosely coupled multiagent systems where the sparse interactions of agents constrain coordination to some specific parts of the environment. In the proposed approach, an agent first collects statistical information to detect those states where coordination is most necessary by considering not only the potential contributions from all the domain states but also the direct causes of the miscoordination in a conflicting state. The agent then learns to coordinate its behavior with others through its local observability of the environment according to different scenarios of state transitions. To handle the uncertainties caused by agents’ local observability, an optimistic estimation mechanism is introduced to guide the learning process of the agents. Empirical studies show that the proposed approach can achieve a better performance by improving the average agent reward compared with an uncoordinated learning approach and by reducing the computational complexity significantly compared with a centralized learning approach. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

By using other agents' experiences and knowledge, a learning agent may learn faster, make fewer mistakes, and create some rules for unseen situations. These benefits would be gained if the learning agent can extract proper rules from the other agents' knowledge for its own requirements. One possible way to do this is to have the learner assign some expertness values (intelligence level values) to the other agents and use their knowledge accordingly. Some criteria to measure the expertness of the reinforcement learning agents are introduced. Also, a new cooperative learning method, called weighted strategy sharing (WSS) is presented. In this method, each agent measures the expertness of its teammates and assigns a weight to their knowledge and learns from them accordingly. The presented methods are tested on two Hunter-Prey systems. We consider that the agents are all learning from each other and compare them with those who cooperate only with the more expert ones. Also, the effect of communication noise, as a source of uncertainty, on the cooperative learning method is studied. Moreover, the Q-table of one of the cooperative agents is changed randomly and its effects on the presented methods are examined.  相似文献   

To date, many researchers have proposed various methods to improve the learning ability in multiagent systems. However, most of these studies are not appropriate to more complex multiagent learning problems because the state space of each learning agent grows exponentially in terms of the number of partners present in the environment. Modeling other learning agents present in the domain as part of the state of the environment is not a realistic approach. In this paper, we combine advantages of the modular approach, fuzzy logic and the internal model in a single novel multiagent system architecture. The architecture is based on a fuzzy modular approach whose rule base is partitioned into several different modules. Each module deals with a particular agent in the environment and maps the input fuzzy sets to the action Q-values; these represent the state space of each learning module and the action space, respectively. Each module also uses an internal model table to estimate actions of the other agents. Finally, we investigate the integration of a parallel update method with the proposed architecture. Experimental results obtained on two different environments of a well-known pursuit domain show the effectiveness and robustness of the proposed multiagent architecture and learning approach.  相似文献   

Individual learning in an environment where more than one agent exist is a chal-lengingtask. In this paper, a single learning agent situated in an environment where multipleagents exist is modeled based on reinforcement learning. The environment is non-stationaryand partially accessible from an agents' point of view. Therefore, learning activities of anagent is influenced by actions of other cooperative or competitive agents in the environment.A prey-hunter capture game that has the above characteristics is defined and experimentedto simulate the learning process of individual agents. Experimental results show that thereare no strict rules for reinforcement learning. We suggest two new methods to improve theperformance of agents. These methods decrease the number of states while keeping as muchstate as necessary.  相似文献   


An agent-society of the future is envisioned to be as complex as a human society. Just like human societies, such multiagent systems (MAS) deserve an in-depth study of the dynamics, relationships, and interactions of the constituent agents. An agent in a MAS may have only approximate a priori estimates of the trustworthiness of another agent. But it can learn from interactions with other agents, resulting in more accurate models of these agents and their dependencies together with the influences of other environmental factors. Such models are proposed to be represented as Bayesian or belief networks. An objective mechanism is presented to enable an agent elicit crucial information from the environment regarding the true nature of the other agents. This mechanism allows the modeling agent to choose actions that will produce guaranteed minimal improvement of the model accuracy. The working of the proposed maxim in entropy procedure is demonstrated in a multiagent scenario.  相似文献   

Congestion games offer a perfect environment in which to study the impact of local decisions on global utilities in multiagent systems. What is particularly interesting in such problems is that no individual action is intrinsically “good” or “bad” but that combinations of actions lead to desirable or undesirable outcomes. As a consequence, agents need to learn how to coordinate their actions with those of other agents, rather than learn a particular set of “good” actions. A congestion game can be studied from two different perspectives: (i) from the top down, where a global utility (e.g., a system-centric view of congestion) specifies the task to be achieved; or (ii) from the bottom up, where each agent has its own intrinsic utility it wants to maximize. In many cases, these two approaches are at odds with one another, where agents aiming to maximize their intrinsic utilities lead to poor values of a system level utility. In this paper we extend results on difference utilities, a form of shaped utility that enables multiagent learning in congested, noisy conditions, to study the global behavior that arises from the agents’ choices in two types of congestion games. Our key result is that agents that aim to maximize a modified version of their own intrinsic utilities not only perform well in terms of the global utility, but also, on average perform better with respect to their own original utilities. In addition, we show that difference utilities are robust to agents “defecting” and using their own intrinsic utilities, and that performance degrades gracefully with the number of defectors.  相似文献   

In order to behave intelligently, artificial agents must be able to deliberatively plan their future actions. Unfortunately, realistic agent environments are usually highly dynamic and only partially observable, which makes planning computationally hard. For most practical purposes this rules out planning techniques that account for all possible contingencies in the planning process. However, many agent environments permit an alternative approach, namely continual planning, i.e. the interleaving of planning with acting and sensing. This paper presents a new principled approach to continual planning that describes why and when an agent should switch between planning and acting. The resulting continual planning algorithm enables agents to deliberately postpone parts of their planning process and instead actively gather missing information that is relevant for the later refinement of the plan. To this end, the algorithm explictly reasons about the knowledge (or lack thereof) of an agent and its sensory capabilities. These concepts are modelled in the planning language (MAPL). Since in many environments the major reason for dynamism is the behaviour of other agents, MAPL can also model multiagent environments, common knowledge among agents, and communicative actions between them. For Continual Planning, MAPL introduces the concept of of assertions, abstract actions that substitute yet unformed subplans. To evaluate our continual planning approach empirically we have developed MAPSIM, a simulation environment that automatically builds multiagent simulations from formal MAPL domains. Thus, agents can not only plan, but also execute their plans, perceive their environment, and interact with each other. Our experiments show that, using continual planning techniques, deliberate action planning can be used efficiently even in complex multiagent environments.  相似文献   

Agent Communication Languages (ACLs) have recently acquired a primary role in open multiagent systems, which need a standard communication framework shared by all interacting heterogeneous agents. According to the most important ACL standard proposals so far, agents are supposed to carry out the communication process by performing actions of a specific type, namely, communicative acts, whose semantics is defined in terms of the agents' mental states. Although following the mainstream guidelines inspired by the Speech Act Theory, our work illustrates an alternative model of agent communication, by shifting the focus from agents' mental states to their social state. Taking an existing communicative act library, we provide a semantics for a significant set of acts based on the concept of commitment, and prove that our approach tackles some issues that have not been dealt with in an effective way yet and that may have hindered the rise of an universally accepted ACL standard.  相似文献   

In a dynamic, multiagent environment, an automated intelligent agent is often faced with the possibility that other agents may instigate events that hinder or help the achievement of its own goals. To act intelligently in such an environment, an automated agent needs an event tracking capability to continually monitor the occurrence of such events and the temporal relationships among them. This capability enables an agent to infer the occurrence of important unobserved events as well as to obtain a better understanding of the interaction among events. This article focuses on event tracking in one complex and dynamic multiagent environment: the air-combat simulation environment. It analyzes the challenges that an automated pilot agent must face when tracking events in this environment. This analysis reveals three new issues that have not been addressed in previous work in this area: (i) tracking events generated by agents' flexible and reactive behaviors, (ii) tracking events in the context of continuous agent interactions, and (iii) tracking events in real time. This article proposes one solution to address these issues. One key idea in this solution is that the (architectural) mechanisms that an agent employs in generating its own flexible and reactive behaviors can be used to track other agents' flexible and reactive behaviors in real time. A second key idea is the use of a world-centered representation for modeling agent interactions. The solution is demonstrated using an implementation of an automated pilot agent.  相似文献   

We consider an autonomous agent facing a stochastic, partially observable, multiagent environment. In order to compute an optimal plan, the agent must accurately predict the actions of the other agents, since they influence the state of the environment and ultimately the agent’s utility. To do so, we propose a special case of interactive partially observable Markov decision process, in which the agent does not explicitly model the other agents’ beliefs and preferences, and instead represents them as stochastic processes implemented by probabilistic deterministic finite state controllers (PDFCs). The agent maintains a probability distribution over the PDFC models of the other agents, and updates this belief using Bayesian inference. Since the number of nodes of these PDFCs is unknown and unbounded, the agent places a Bayesian nonparametric prior distribution over the infinitely dimensional set of PDFCs. This allows the size of the learned models to adapt to the complexity of the observed behavior. Deriving the posterior distribution is in this case too complex to be amenable to analytical computation; therefore, we provide a Markov chain Monte Carlo algorithm that approximates the posterior beliefs over the other agents’ PDFCs, given a sequence of (possibly imperfect) observations about their behavior. Experimental results show that the learned models converge behaviorally to the true ones. We consider two settings, one in which the agent first learns, then interacts with other agents, and one in which learning and planning are interleaved. We show that the agent’s performance increases as a result of learning in both situations. Moreover, we analyze the dynamics that ensue when two agents are simultaneously learning about each other while interacting, showing in an example environment that coordination emerges naturally from our approach. Furthermore, we demonstrate how an agent can exploit the learned models to perform indirect inference over the state of the environment via the modeled agent’s actions.  相似文献   

Computer science in general, and artificial intelligence and multiagent systems in particular, are part of an effort to build intelligent transportation systems. An efficient use of the existing infrastructure relates closely to multiagent systems as many problems in traffic management and control are inherently distributed. In particular, traffic signal controllers located at intersections can be seen as autonomous agents. However, challenging issues are involved in this kind of modeling: the number of agents is high; in general agents must be highly adaptive; they must react to changes in the environment at individual level while also causing an unpredictable collective pattern, as they act in a highly coupled environment. Therefore, traffic signal control poses many challenges for standard techniques from multiagent systems such as learning. Despite the progress in multiagent reinforcement learning via formalisms based on stochastic games, these cannot cope with a high number of agents due to the combinatorial explosion in the number of joint actions. One possible way to reduce the complexity of the problem is to have agents organized in groups of limited size so that the number of joint actions is reduced. These groups are then coordinated by another agent, a tutor or supervisor. Thus, this paper investigates the task of multiagent reinforcement learning for control of traffic signals in two situations: agents act individually (individual learners) and agents can be “tutored”, meaning that another agent with a broader sight will recommend a joint action.  相似文献   

Multi-agent learning (MAL) studies how agents learn to behave optimally and adaptively from their experience when interacting with other agents in dynamic environments. The outcome of a MAL process is jointly determined by all agents’ decision-making. Hence, each agent needs to think strategically about others’ sequential moves, when planning future actions. The strategic interactions among agents makes MAL go beyond the direct extension of single-agent learning to multiple agents. With the strategic thinking, each agent aims to build a subjective model of others decision-making using its observations. Such modeling is directly influenced by agents’ perception during the learning process, which is called the information structure of the agent’s learning. As it determines the input to MAL processes, information structures play a significant role in the learning mechanisms of the agents. This review creates a taxonomy of MAL and establishes a unified and systematic way to understand MAL from the perspective of information structures. We define three fundamental components of MAL: the information structure (i.e., what the agent can observe), the belief generation (i.e., how the agent forms a belief about others based on the observations), as well as the policy generation (i.e., how the agent generates its policy based on its belief). In addition, this taxonomy enables the classification of a wide range of state-of-the-art algorithms into four categories based on the belief-generation mechanisms of the opponents, including stationary, conjectured, calibrated, and sophisticated opponents. We introduce Value of Information (VoI) as a metric to quantify the impact of different information structures on MAL. Finally, we discuss the strengths and limitations of algorithms from different categories and point to promising avenues of future research.  相似文献   

Bacteria, bees, and birds often work together in groups to find food. A group of mobile wheeled robots can be designed to coordinate their activities to achieve a goal. Networked cooperative uninhabited air vehicles (UAVs) are being developed for commercial and military applications. In order for such multiagent systems to succeed it is often critical that they can both maintain cohesive behaviors and appropriately respond to environmental stimuli. In this paper, we characterize cohesiveness of discrete-time multiagent systems as a boundedness or stability property of the agents' position trajectories and use a Lyapunov approach to develop conditions under which local agent actions will lead to cohesive group behaviors even in the presence of i) an interagent "sensing topology" that constrains information flow, where by "information flow," we mean the sensing of positions and velocities of agents, ii) a random but bounded delay and "noise" in sensing other agents' positions and velocities, and iii) noise in sensing a resource profile that represents an environmental stimulus and quantifies the goal of the multiagent system. Simulations are used to illustrate the ideas for multivehicle systems and to make connections to synchronization of coupled oscillators  相似文献   

Cooperation in learning (CL) can be realized in a multiagent system, if agents are capable of learning from both their own experiments and other agents' knowledge and expertise. Extra resources are exploited into higher efficiency and faster learning in CL as compared to that of individual learning (IL). In the real world, however, implementation of CL is not a straightforward task, in part due to possible differences in area of expertise (AOE). In this paper, reinforcement-learning homogenous agents are considered in an environment with multiple goals or tasks. As a result, they become expert in different domains with different amounts of expertness. Each agent uses a one-step Q-learning algorithm and is capable of exchanging its Q-table with those of its teammates. Two crucial questions are addressed in this paper: "How the AOE of an agent can be extracted?" and "How agents can improve their performance in CL by knowing their AOEs?" An algorithm is developed to extract the AOE based on state transitions as a gold standard from a behavioral point of view. Moreover, it is discussed that the AOE can be implicitly obtained through agents' expertness in the state level. Three new methods for CL through the combination of Q-tables are developed and examined for overall performance after CL. The performances of developed methods are compared with that of IL, strategy sharing (SS), and weighted SS (WSS). Obtained results show the superior performance of AOE-based methods as compared to that of existing CL methods, which do not use the notion of AOE. These results are very encouraging in support of the idea that "cooperation based on the AOE" performs better than the general CL methods.  相似文献   

There are numerous applications where a variety of human and software participants interactively pursue a given task (play a game, engage in a simulation, etc.). In this paper, we define a basic architecture for a distributed, interactive system (DIS for short). We then formally define a mathematical construct called a DIS abstraction that provides a theoretical basis for a software platform for building distributed interactive systems. Our framework provides a language for building multiagent applications where each agent has its own behaviors and where the behavior of the multiagent application as a whole is governed by one or more “master” agents. Agents in such a multiagent application may compete for resources, may attempt to take actions based on incorrect beliefs, may attempt to take actions that conflict with actions being concurrently attempted by other agents, and so on. Master agents mediate such conflicts. Our language for building agents (ordinary and master) depends critically on a notion called a “generalized constraint” that we define. All agents attempt to optimize an objective function while satisfying such generalized constraints that the agent is bound to preserve. We develop several algorithms to determine how an agent satisfies its generalized constraints in response to events in the multiagent application. We experimentally evaluate these algorithms in an attempt to understand their advantages and disadvantages. This revised version was published online in June 2006 with corrections to the Cover Date.  相似文献   

This paper analyzes the emergent behaviors of pedestrian groups that learn through the multiagent reinforcement learning model developed in our group. Five scenarios studied in the pedestrian model literature, and with different levels of complexity, were simulated in order to analyze the robustness and the scalability of the model. Firstly, a reduced group of agents must learn by interaction with the environment in each scenario. In this phase, each agent learns its own kinematic controller, that will drive it at a simulation time. Secondly, the number of simulated agents is increased, in each scenario where agents have previously learnt, to test the appearance of emergent macroscopic behaviors without additional learning. This strategy allows us to evaluate the robustness and the consistency and quality of the learned behaviors. For this purpose several tools from pedestrian dynamics, such as fundamental diagrams and density maps, are used. The results reveal that the developed model is capable of simulating human-like micro and macro pedestrian behaviors for the simulation scenarios studied, including those where the number of pedestrians has been scaled by one order of magnitude with respect to the situation learned.  相似文献   

The existing reinforcement learning methods have been seriously suffering from the curse of the dimension problem, especially when they are applied to multiagent dynamic environments. One of the typical examples is a case of RoboCup competitions since other agents and their behavior easily cause state and action space variations. This paper presents a method of modular learning in a multiagent environment by which the learning agent can acquire cooperative behavior with its teammates and competitive behavior against its opponents. The key ideas to resolve the issue are as follows. First, a two-layer hierarchical system with multilearning modules is adopted to reduce the size of the sensor and action spaces. The state space of the top layer consists of the state values from the lower level and the macro actions are used to reduce the size of the physical action space. Second, the state of the other, to what extent it is close to its own goal, is estimated by observation and used as a state variable in the top layer state space to realize the cooperative/competitive behavior. The method is applied to a four (defense team)-on-five (offense team) game task and the learning agent (a passer of the offense team) successfully acquired the teamwork plays (pass and shoot) within much shorter learning time.  相似文献   

基于强化学习的多Agent协作研究   总被引:2,自引:0,他引:2  
强化学习为多Agent之间的协作提供了鲁棒的学习方法.本文首先介绍了强化学习的原理和组成要素,其次描述了多Agent马尔可夫决策过程MMDP,并给出了Agent强化学习模型.在此基础上,对多Agent协作过程中存在的两种强化学习方式:IL(独立学习)和JAL(联合动作学习)进行了比较.最后分析了在有多个最优策略存在的情况下,协作多Agent系统常用的几种协调机制.  相似文献   

