首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Communication is an important resource for multiagent coordination. Interactive Dynamic Influence Diagrams (I-DIDs) have been used extensively in multiagent planning when there is uncertainty, and they are recognized graphical representations of Interactive Partially Observable Markov Decision Processes (I-POMDPs). We establish a communication model among multiple agents based on the I-DID framework. We use the AND-communication method by assuming a separate communication and action phase in each step, rather than replacing domain actions, in order that communication facilitates better domain-action selection. We use a synchronized communication type: when an agent initiates communication, all of the agent’s teammates synchronize to share their recent observations. We give a general algorithm to calculate communicative decision from a single-agent perspective by comparing expected rewards with and without communication. Finally, we use multiagent “tiger” and “concert” problems to validate the model’s effectiveness.  相似文献   

2.
Colonies of learning automata   总被引:10,自引:0,他引:10  
Originally, learning automata (LAs) were introduced to describe human behavior from both a biological and psychological point of view. In this paper, we show that a set of interconnected LAs is also able to describe the behavior of an ant colony, capable of finding the shortest path from their nest to food sources and back. The field of ant colony optimization (ACO) models ant colony behavior using artificial ant algorithms. These algorithms find applications in a whole range of optimization problems and have been experimentally proved to work very well. It turns out that a known model of interconnected LA, used to control Markovian decision problems (MDPs) in a decentralized fashion, matches perfectly with these ant algorithms. The field of LAs can thus both impart in the understanding of why ant algorithms work so well and may also become an important theoretical tool for learning in multiagent systems (MAS) in general. To illustrate this, we give an example of how LAs can be used directly in common Markov game problems.  相似文献   

3.
Fujita H  Ishii S 《Neural computation》2007,19(11):3051-3087
Games constitute a challenging domain of reinforcement learning (RL) for acquiring strategies because many of them include multiple players and many unobservable variables in a large state space. The difficulty of solving such realistic multiagent problems with partial observability arises mainly from the fact that the computational cost for the estimation and prediction in the whole state space, including unobservable variables, is too heavy. To overcome this intractability and enable an agent to learn in an unknown environment, an effective approximation method is required with explicit learning of the environmental model. We present a model-based RL scheme for large-scale multiagent problems with partial observability and apply it to a card game, hearts. This game is a well-defined example of an imperfect information game and can be approximately formulated as a partially observable Markov decision process (POMDP) for a single learning agent. To reduce the computational cost, we use a sampling technique in which the heavy integration required for the estimation and prediction can be approximated by a plausible number of samples. Computer simulation results show that our method is effective in solving such a difficult, partially observable multiagent problem.  相似文献   

4.
We exhibit an important property called the asymptotic equipartition property (AEP) on empirical sequences in an ergodic multiagent Markov decision process (MDP). Using the AEP which facilitates the analysis of multiagent learning, we give a statistical property of multiagent learning, such as reinforcement learning (RL), near the end of the learning process. We examine the effect of the conditions among the agents on the achievement of a cooperative policy in three different cases: blind, visible, and communicable. Also, we derive a bound on the speed with which the empirical sequence converges to the best sequence in probability, so that the multiagent learning yields the best cooperative result.  相似文献   

5.
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法,而动态规划方法是Agent在具有Markov环境下与策略相关的值函数学习算法。但由于Agent在学习的过程中,需要记忆全部的值函数,这个记忆容量随着状态空间的增加会变得非常巨大。文章提出了一种基于动态规划方法的激励学习遗忘算法,这个算法是通过将记忆心理学中有关遗忘的基本原理引入到值函数的激励学习中,导出了一类用动态规划方法解决激励学习问题的比较好的方法,即Forget-DP算法。  相似文献   

6.
Decentralized partially observable Markov decision process (DEC-POMDP) is an approach to model multi-robot decision making problems under uncertainty. Since it is NEXP-complete there is no efficient exact algorithm to solve these problems and in spite of the attention it has taken recently, so far only a few approximate solutions that can solve small problems have been proposed. In this study, we offer a novel approximate solution algorithm for DEC-POMDP problems using evolution strategies, and a novel approach to approximately calculate the fitness of the chromosomes which correspond to the expected reward. We also propose a new problem which is a more complex, modified version of the grid meeting problem and solve it. Our results show that our algorithm is scalable and we can solve problems that have more states than the problems attempted in previous studies.  相似文献   

7.
马尔可夫决策过程两种抽象模式   总被引:2,自引:1,他引:1  
抽象层次上马尔可夫决策过程的引入,使得人们可简洁地、陈述地表达复杂的马尔可夫决策过程,解决常规马尔可夫决策过程(MDPs)在实际中所遇到的大型状态空间的表达问题.介绍了结构型和概括型两种不同类型抽象马尔可夫决策过程基本概念以及在各种典型抽象MDPs中的最优策略的精确或近似算法,其中包括与常规MDPs根本不同的一个算法:把Bellman方程推广到抽象状态空间的方法,并且对它们的研究历史进行总结和对它们的发展做一些展望,使得人们对它们有一个透彻的、全面而又重点的理解.  相似文献   

8.
Probabilistic systems of interacting nondeterministic intelligent agents are considered. The states of agents in such systems are probabilistic databases (of facts), and their actions are controlled by probabilistic logical programs. Besides, communication channels between agents are also probabilistic. It is shown how such systems can be transformed in poynomial time to equivalent finite Markov decision processes. This makes it possible to translate the known results on the verification of the dynamic properties of the finite Markov processes to the probabilistic multiagent systems of the considered type.  相似文献   

9.
基于特定角色上下文的多智能体Q学习   总被引:1,自引:0,他引:1  
One of the main problems in cooperative multiagent learning is that the joint action space grows exponentially with the number of agents. In this paper, we investigate a sparse representation of the coordination dependencies between agents to employ roles and context-specific coordination graphs to reduce the joint action space. In our framework, the global joint Q-function is decomposed into a number of local Q-functions. Each local Q-function is shared among a small group of agents and is composed of a set of value rules. We propose a novel multiagent Q-learning algorithm which learns the weights in each value rule automatically. We give empirical evidence to show that our learning algorithm converges to the same optimal policy with a significantly faster speed than traditional multiagent learning techniques.  相似文献   

10.
A generalization of the Hypercube queueing model for exponential queueing systems is presented which allows for distinguishable servers and multiple types of customers. Given costs associated with each server-customer pair, the determination of the assignment policy which minimizes time-averaged costs is formulated as a Markov decision problem. A characterization of optimal policies is obtained and used in an efficient algorithm for determining the optimum. The algorithm combines the method of successive approximations and “Howard's method” in a manner which is particularly applicable to Markov decision problems having large, sparse transition matrices.  相似文献   

11.
This paper deals with preference representation on combinatorial domains and preference-based recommendation in the context of multicriteria or multiagent decision making. The alternatives of the decision problem are seen as elements of a product set of attributes and preferences over solutions are represented by generalized additive decomposable (GAI) utility functions modeling individual preferences or criteria. Thanks to decomposability, utility vectors attached to solutions can be compiled into a graphical structure closely related to junction trees, the so-called GAI network. Using this structure, we present preference-based search algorithms for multicriteria or multiagent decision making. Although such models are often non-decomposable over attributes, we actually show that GAI networks are still useful to determine the most preferred alternatives provided preferences are compatible with Pareto dominance. We first present two algorithms for the determination of Pareto-optimal elements. Then the second of these algorithms is adapted so as to directly focus on the preferred solutions. We also provide results of numerical tests showing the practical efficiency of our procedures in various contexts such as compromise search and fair optimization in multicriteria or multiagent problems.  相似文献   

12.
论文研究了Markov对策模型作为学习框架的强化学习,提出了针对RoboCup仿真球队决策问题这一类复杂问题的学习模型和具体算法。在实验中,成功实现了守门员决策,并取得了良好的效果,证明了算法的可行性和有效性。  相似文献   

13.
基于有限样本的最优费用关联值递归Q学习算法   总被引:4,自引:2,他引:4  
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来求解决策问题。求解最优决策一般有两种途径,一种是求最大奖赏方法,另一种是求最优费用方法。该文利用求解最优费用函数的方法给出了一种新的Q学习算法。Q学习算法是求解信息不完全Markov决策问题的一种有效激励学习方法。文章从求解最优费用函数的方法出发,给出了Q学习的关联值递归算法,这种方法的建立,可以使得动态规划(DP)算法中的许多结论直接应用到Q学习的研究中来。  相似文献   

14.
We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration(PI),i.e.,start from some base policy and generate an improved policy.Rollout is the simplest method of this type,where just one improved policy is generated.We can view PI as repeated application of rollout,where the rollout policy at each iteration serves as the base policy for the next iteration.In contrast with PI,rollout has a robustness property:it can be applied on-line and is suitable for on-line replanning.Moreover,rollout can use as base policy one of the policies produced by PI,thereby improving on that policy.This is the type of scheme underlying the prominently successful Alpha Zero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected(conceptually)by a separate agent.This is the class of multiagent problems where the agents have a shared objective function,and a shared and perfect state information.Based on a problem reformulation that trades off control space complexity with state space complexity,we develop an approach,whereby at every stage,the agents sequentially(one-at-a-time)execute a local rollout algorithm that uses a base policy,together with some coordinating information from the other agents.The amount of total computation required at every stage grows linearly with the number of agents.By contrast,in the standard rollout algorithm,the amount of total computation grows exponentially with the number of agents.Despite the dramatic reduction in required computation,we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout:it guarantees an improved performance relative to the base policy.We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information,which is sufficient to maintain the cost improvement property,without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems,we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation.For one of our PI algorithms,we prove convergence to an agentby-agent optimal policy,thus establishing a connection with the theory of teams.For another PI algorithm,which is executed over a more complex state space,we prove convergence to an optimal policy.Approximate forms of these algorithms are also given,based on the use of policy and value neural networks.These PI algorithms,in both their exact and their approximate form are strictly off-line methods,but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.  相似文献   

15.
郭晓东  郝思达  王丽芳 《计算机应用研究》2023,40(9):2803-2807+2814
车辆边缘计算允许车辆将计算任务卸载到边缘服务器,从而满足车辆爆炸式增长的计算资源需求。但是如何进行卸载决策与计算资源分配仍然是亟待解决的关键问题。并且,运动车辆在连续时间内进行任务卸载很少被提及,尤其对车辆任务到达随机性考虑不足。针对上述问题,建立动态车辆边缘计算模型,描述为7状态2动作空间的Markov决策过程,并建立一个分布式深度强化学习模型来解决问题。另外,针对离散—连续混合决策问题导致的效果欠佳,将输入层与一阶决策网络嵌套,提出一种分阶决策的深度强化学习算法。仿真结果表明,所提算法相较于对比算法,在能耗上保持了较低水平,并且在任务完成率、时延和奖励方面都具备明显优势,这为车辆边缘计算中的卸载决策与计算资源分配问题提供了一种有效的解决方案。  相似文献   

16.
平均和折扣准则MDP基于TD(0)学习的统一NDP方法   总被引:3,自引:0,他引:3  
为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.  相似文献   

17.
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法。文章提出了一种新的算法,这个算法通过牺牲最优性来获取鲁棒性,重点给出了一组逼近算法和它们的收敛结果。利用广义平均算子来替代最优算子max(或min),对激励学习中的两类最重要的算法一动态规划算法和个学习算法一进行了研究,并讨论了它们的收敛性。其目的就是为了提高激励学习算法的鲁棒性。  相似文献   

18.
The value iteration algorithm is a well-known technique for generating solutions to discounted Markov decision process (MDP) models. Although simple to implement, the approach is nevertheless limited in situations where many Markov decision processes must be solved, such as in real-time state-based control problems or in simulation/optimization problems, because of the potentially large number of iterations required for the value function to converge to an ε-optimal solution. Experimental results suggest, however, that the sequence of solution policies associated with each iteration of the algorithm converges much more rapidly than does the value function. This behavior has significant implications for designing solution approaches for MDPs, yet it has not been explicitly characterized in the literature nor generated significant discussion. This paper seeks to generate such discussion by providing comparative empirical convergence results and exploring several predictors that allow estimation of policy convergence speed based on existing MDP parameters.  相似文献   

19.
基于Markov决策过程(MDP)的规划方法可以处理多种不确定规划问题,价值迭代算法(VI)是求解MDP的经典算法,但VI需要计算更新每个状态的值,求解过程相当缓慢。在分析了MDP状态图本身的因果依赖关系的基础上,提出一种改进的价值迭代算法,称为顺序价值迭代算法(SVI)。它先将一个MDP分解成多个拓扑有序的强连通分量,然后应用价值迭代算法顺序求解各个分量,这样处理可以避免对大量无用状态的计算并使得可用状态排成拓扑序列。对比实验结果证明了该算法的有效性及优异性能。  相似文献   

20.
Pan  Yinghui  Tang  Jing  Ma  Biyang  Zeng  Yifeng  Ming  Zhong 《Knowledge and Information Systems》2021,63(9):2431-2453
Knowledge and Information Systems - With the availability of significant amount of data, data-driven decision making becomes an alternative way for solving complex multiagent decision problems....  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号