首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 656 毫秒
1.
将博弈理论与多智能体强化学习结合形成博弈强化学习逐渐受到关注,但是也存在算法的计算复杂度高和无法保证纯策略纳什均衡的问题。Meta equilibrium Q-learning算法通过反应函数将原始博弈转换为元博弈,而元博弈推导出的元均衡是纯策略纳什均衡。该算法在保证纯策略纳什均衡的前提下能够使得每个智能体的回报不低于某特定阈值。同时,基于分形的均衡程度评估模型能够通过计算任意状态的分形维数来判断其稳态,并评估任意状态与均衡状态之间的距离,该模型可以检验元均衡的科学性与合理性,上述算法和模型的相关结论在福利博弈和夺控战中都得到具体验证。  相似文献   

2.
针对目前大多数多智能体强化学习算法在智能体数量增多以及环境动态不稳定的情况下导致的维度爆炸和奖励稀疏的问题,提出了一种基于加权值函数分解的多智能体分层强化学习技能发现算法。首先,该算法将集中训练分散执行的架构与分层强化学习相结合,在上层采用加权值函数分解的方法解决智能体在训练过程中容易忽略最优策略而选择次优策略的问题;其次,在下层采用独立Q学习算法使其能够在多智能体环境中分散式地处理高维复杂的任务;最后,在底层独立Q学习的基础上引入技能发现策略,使智能体之间相互学习互补的技能。分别在简易团队运动和星际争霸Ⅱ两个仿真实验平台上对该算法与多智能体强化学习算法和分层强化学习算法进行对比,实验表明,该算法在奖励回报以及双方对抗胜率等性能指标上都有所提高,提升了整个多智能体系统的决策能力和收敛速度,验证了算法的可行性。  相似文献   

3.
智能博弈对抗场景中,多智能体强化学习算法存在“非平稳性”问题,智能体的策略不仅取决于环境,还受到环境中对手(其他智能体)的影响。根据对手与环境的交互信息,预测其策略和意图,并以此调整智能体自身策略是缓解上述问题的有效方式。提出一种基于对手动作预测的智能博弈对抗算法,对环境中的对手进行隐式建模。该算法通过监督学习获得对手的策略特征,并将其与智能体的强化学习模型融合,缓解对手对学习稳定性的影响。在1v1足球环境中的仿真实验表明,提出的算法能够有效预测对手的动作,加快学习收敛速度,提升智能体的对抗水平。  相似文献   

4.
多智能体粒子群算法在配电网络重构中的应用   总被引:1,自引:1,他引:0       下载免费PDF全文
结合多智能体的学习、协调策略及粒子群算法,提出了一种基于多智能体粒子群优化的配电网络重构方法。该方法采用粒子群算法的拓扑结构来构建多智能体的体系结构,在多智能体系统中,每一个粒子作为一个智能体,通过与邻域的智能体竞争、合作,能够更快、更精确地收敛到全局最优解。粒子的更新规则减少了算法不可行解的产生,提高了算法效率。实验结果表明,该方法具有很高的搜索效率和寻优性能。  相似文献   

5.
为了解决多智能体协同训练过程中的团队奖励稀疏导致样本效率低下、无法进行有效探索以及对参数敏感的问题,本研究在MAPPO算法的基础上引入了分阶段的思想,提出了基于多阶段强化学习的多智能体协同算法MSMAC。该算法将训练划分为2个阶段:一是构建基于进化策略优化的单智能体策略网络,二是对多智能体策略网络进行协同训练。在多智能体粒子环境下的实验结果表明,基于多阶段的强化学习算法不仅提升了协作性能,而且提高了样本的训练效率和模型的收敛速度。  相似文献   

6.
近年来深度强化学习在一系列顺序决策问题中取得了巨大的成功,使其为复杂高维的多智能体系统提供有效优化的决策策略成为可能.然而在复杂的多智能体场景中,现有的多智能体深度强化学习算法不仅收敛速度慢,而且算法的稳定性无法保证.本文提出了基于值分布的多智能体分布式深度确定性策略梯度算法(multi-agent distribut...  相似文献   

7.
在多智能体强化学习算法的研究中,由于训练与测试环境具有差异,如何让智能体有效地应对环境中其他智能体策略变化的情况受到研究人员的广泛关注。针对这一泛化性问题,提出基于人类偏好的多智能体角色策略集成算法,该算法同时考虑了长期回报和即时回报。这一改进使得智能体从一些具有良好长期累积回报的候选行动中选择具有最大即时回报的行动,从而让算法确定了策略更新的方向,避免过度探索和无效训练,能快速找到最优策略。此外,智能体被动态地划分为不同的角色,同角色智能体共享参数,不仅提高了效率,而且实现了多智能体算法的可扩展性。在多智能体粒子环境中与现有算法的比较表明,该算法的智能体能更好地泛化到未知环境,且收敛速度更快,能够更高效地训练出最优策略。  相似文献   

8.
陈浩  李嘉祥  黄健  王菖  刘权  张中杰 《控制与决策》2023,38(11):3209-3218
面对高维连续状态空间或稀疏奖励等复杂任务时,仅依靠深度强化学习算法从零学习最优策略十分困难,如何将已有知识表示为人与学习型智能体之间相互可理解的形式,并有效地加速策略收敛仍是一个难题.对此,提出一种融合认知行为模型的深度强化学习框架,将领域内先验知识建模为基于信念-愿望-意图(belief- desire-intention, BDI)的认知行为模型,用于引导智能体策略学习.基于此框架,分别提出融合认知行为模型的深度Q学习算法和近端策略优化算法,并定量化设计认知行为模型对智能体策略更新的引导方式.最后,通过典型gym环境和空战机动决策对抗环境,验证所提出算法可以高效利用认知行为模型加速策略学习,有效缓解状态空间巨大和环境奖励稀疏的影响.  相似文献   

9.
基于MetrOPOlis准则的Q-学习算法研究   总被引:3,自引:0,他引:3  
探索与扩张是Q-学习算法中动作选取的关键问题,一味地扩张使智能体很快地陷入局部最优,虽然探索可以跳出局部最优并加速学习,而过多的探索将影响算法的性能,通过把Q-学习中寻求成策略表示为组合优化问题中最优解的搜索,将模拟退火算法的Mketropolis准则用于Q-学习中探索和扩张之间的折衷处理,提出基于Metropolis准则的Q-学习算法SA-Q-learning,通过比较,它具有更快的收敛速度,而且避免了过多探索引起的算法性能下降。  相似文献   

10.
作为一种不需要事先获得训练数据的机器学习方法,强化学习(Reinforcement learning, RL)在智能体与环境的不断交互过程中寻找最优策略,是解决序贯决策问题的一种重要方法.通过与深度学习(Deep learning, DL)结合,深度强化学习(Deep reinforcement learning, DRL)同时具备了强大的感知和决策能力,被广泛应用于多个领域来解决复杂的决策问题.异策略强化学习通过将交互经验进行存储和回放,将探索和利用分离开来,更易寻找到全局最优解.如何对经验进行合理高效的利用是提升异策略强化学习方法效率的关键.首先对强化学习的基本理论进行介绍;随后对同策略和异策略强化学习算法进行简要介绍;接着介绍经验回放(Experience replay, ER)问题的两种主流解决方案,包括经验利用和经验增广;最后对相关的研究工作进行总结和展望.  相似文献   

11.
When attempting to solve multiobjective optimization problems (MOPs) using evolutionary algorithms, the Pareto genetic algorithm (GA) has now become a standard of sorts. After its introduction, this approach was further developed and led to many applications. All of these approaches are based on Pareto ranking and use the fitness sharing function to keep diversity. On the other hand, the scheme for solving MOPs presented by Nash introduced the notion of Nash equilibrium and aimed at solving MOPs that originated from evolutionary game theory and economics. Since the concept of Nash Equilibrium was introduced, game theorists have attempted to formalize aspects of the evolutionary equilibrium. Nash genetic algorithm (Nash GA) is the idea to bring together genetic algorithms and Nash strategy. The aim of this algorithm is to find the Nash equilibrium through the genetic process. Another central achievement of evolutionary game theory is the introduction of a method by which agents can play optimal strategies in the absence of rationality. Through the process of Darwinian selection, a population of agents can evolve to an evolutionary stable strategy (ESS). In this article, we find the ESS as a solution of MOPs using a coevolutionary algorithm based on evolutionary game theory. By applying newly designed coevolutionary algorithms to several MOPs, we can confirm that evolutionary game theory can be embodied by the coevolutionary algorithm and this coevolutionary algorithm can find optimal equilibrium points as solutions for an MOP. We also show the optimization performance of the co-evolutionary algorithm based on evolutionary game theory by applying this model to several MOPs and comparing the solutions with those of previous evolutionary optimization models. This work was presented, in part, at the 8th International Symposium on Artificial Life and Robotics, Oita, Japan, January 24#x2013;26, 2003.  相似文献   

12.
为对电力市场环境下电力系统供需互动问题更精确地建模,使其更好地与未来电力市场环境下需求侧负荷聚合商之间多变的关系和复杂的通信拓扑结构相匹配,本文将电力系统供需互动的Stackelberg博弈与复杂网络上反映需求侧负荷聚合商互动的演化博弈相结合,搭建考虑市场因素的电力系统供需互动混合博弈模型.并提出混合博弈强化学习算法求解相应的非凸非连续优化问题,该算法以Q学习为载体,通过引入博弈论和图论的思想,把分块协同和演化博弈的方法相结合,充分地利用博弈者之间互动博弈关系所形成的知识矩阵信息,高质量地求解考虑复杂网络上多智能体系统的非凸优化问题.基于复杂网络理论搭建的四类3机-6负荷系统和南方某一线城市电网的仿真结果表明:混合博弈强化学习算法的寻优性能比大多数集中式的智能算法好,且在不同网络下均可以保证较好的寻优结果,具有很强的适应性和稳定性.  相似文献   

13.
多配送中心车辆路径规划(multi-depot vehicle routing problem, MDVRP)是现阶段供应链应用较为广泛的问题模型,现有算法多采用启发式方法,其求解速度慢且无法保证解的质量,因此研究快速且有效的求解算法具有重要的学术意义和应用价值.以最小化总车辆路径距离为目标,提出一种基于多智能体深度强化学习的求解模型.首先,定义多配送中心车辆路径问题的多智能体强化学习形式,包括状态、动作、回报以及状态转移函数,使模型能够利用多智能体强化学习训练;然后通过对MDVRP的节点邻居及遮掩机制的定义,基于注意力机制设计由多个智能体网络构成的策略网络模型,并利用策略梯度算法进行训练以获得能够快速求解的模型;接着,利用2-opt局部搜索策略和采样搜索策略改进解的质量;最后,通过对不同规模问题仿真实验以及与其他算法进行对比,验证所提出的多智能体深度强化学习模型及其与搜索策略的结合能够快速获得高质量的解.  相似文献   

14.
近年来,深度强化学习作为一种无模型的资源分配方法被用于解决无线网络中的同信道干扰问题。然而,基于常规经验回放策略的网络难以学习到有价值的经验,导致收敛速度较慢;而人工划定探索步长的方式没有考虑算法在每个训练周期上的学习情况,使得对环境的探索存在盲目性,限制了系统频谱效率的提升。对此,提出一种频分多址系统的分布式强化学习功率控制方法,采用优先经验回放策略,鼓励智能体从环境中学习更重要的数据,以加速学习过程;并且设计了一种适用于分布式强化学习、动态调整步长的探索策略,使智能体得以根据自身学习情况探索本地环境,减少人为设定步长带来的盲目性。实验结果表明,相比于现有算法,所提方法加快了收敛速度,提高了移动场景下的同信道干扰抑制能力,在大型网络中具有更高的性能。  相似文献   

15.
Operations research and management science are often confronted with sequential decision making problems with large state spaces. Standard methods that are used for solving such complex problems are associated with some difficulties. As we discuss in this article, these methods are plagued by the so-called curse of dimensionality and the curse of modelling. In this article, we discuss reinforcement learning, a machine learning technique for solving sequential decision making problems with large state spaces. We describe how reinforcement learning can be combined with a function approximation method to avoid both the curse of dimensionality and the curse of modelling. To illustrate the usefulness of this approach, we apply it to a problem with a huge state space—learning to play the game of Othello. We describe experiments in which reinforcement learning agents learn to play the game of Othello without the use of any knowledge provided by human experts. It turns out that the reinforcement learning agents learn to play the game of Othello better than players that use basic strategies.  相似文献   

16.
强化学习(Reinforcement Learning)是学习环境状态到动作的一种映射,并且能够获得最大的奖赏信号。强化学习中有三种方法可以实现回报的最大化:值迭代、策略迭代、策略搜索。该文介绍了强化学习的原理、算法,并对有环境模型和无环境模型的离散空间值迭代算法进行研究,并且把该算法用于固定起点和随机起点的格子世界问题。实验结果表明,相比策略迭代算法,该算法收敛速度快,实验精度好。  相似文献   

17.
In this paper we introduce a new multi-agent reinforcement learning algorithm, called exploring selfish reinforcement learning (ESRL). ESRL allows agents to reach optimal solutions in repeated non-zero sum games with stochastic rewards, by using coordinated exploration. First, two ESRL algorithms for respectively common interest and conflicting interest games are presented. Both ESRL algorithms are based on the same idea, i.e. an agent explores by temporarily excluding some of the local actions from its private action space, to give the team of agents the opportunity to look for better solutions in a reduced joint action space. In a latter stage these two algorithms are transformed into one generic algorithm which does not assume that the type of the game is known in advance. ESRL is able to find the Pareto optimal solution in common interest games without communication. In conflicting interest games ESRL only needs limited communication to learn a fair periodical policy, resulting in a good overall policy. Important to know is that ESRL agents are independent in the sense that they only use their own action choices and rewards to base their decisions on, that ESRL agents are flexible in learning different solution concepts and they can handle both stochastic, possible delayed rewards and asynchronous action selection. A real-life experiment, i.e. adaptive load-balancing of parallel applications is added.  相似文献   

18.
随机博弈框架下的多agent强化学习方法综述   总被引:4,自引:0,他引:4  
宋梅萍  顾国昌  张国印 《控制与决策》2005,20(10):1081-1090
多agent学习是在随机博弈的框架下,研究多个智能体间通过自学习掌握交互技巧的问题.单agent强化学习方法研究的成功,对策论本身牢固的数学基础以及在复杂任务环境中广阔的应用前景,使得多agent强化学习成为目前机器学习研究领域的一个重要课题.首先介绍了多agent系统随机博弈中基本概念的形式定义;然后介绍了随机博弈和重复博弈中学习算法的研究以及其他相关工作;最后结合近年来的发展,综述了多agent学习在电子商务、机器人以及军事等方面的应用研究,并介绍了仍存在的问题和未来的研究方向.  相似文献   

19.
This paper examines the performance of simple reinforcement learningalgorithms in a stationary environment and in a repeated game where theenvironment evolves endogenously based on the actions of other agents. Sometypes of reinforcement learning rules can be extremely sensitive to smallchanges in the initial conditions, consequently, events early in a simulationcan affect the performance of the rule over a relatively long time horizon.However, when multiple adaptive agents interact, algorithms that performedpoorly in a stationary environment often converge rapidly to a stableaggregate behaviors despite the slow and erratic behavior of individuallearners. Algorithms that are robust in stationary environments can exhibitslow convergence in an evolving environment.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号