首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 203 毫秒
1.
分布式凸优化问题的目的是如何以分布式方法最小化局部智能体成本函数和,而现有分布式算法的控制步长选取依赖于系统智能体个数、伴随矩阵等全局性信息,有悖于分布式算法的初衷.针对此问题,提出一种基于非平衡有向网络的完全分布式凸优化算法(FDCOA).基于多智能体一致性理论和梯度跟踪技术,设计了一种非负余量迭代策略,使得FDCOA的控制步长收敛范围仅与智能体局部信息相关,进而实现控制步长的分布式设置.进一步分析了FDCOA在固定强连通和时变强连通网络情形下的收敛性.仿真结果表明本文构建的分布式控制步长选取方法对FDCOA在有向非平衡下的分布式凸优化问题是有效的.  相似文献   

2.
为了解决多智能体协同训练过程中的团队奖励稀疏导致样本效率低下、无法进行有效探索以及对参数敏感的问题,本研究在MAPPO算法的基础上引入了分阶段的思想,提出了基于多阶段强化学习的多智能体协同算法MSMAC。该算法将训练划分为2个阶段:一是构建基于进化策略优化的单智能体策略网络,二是对多智能体策略网络进行协同训练。在多智能体粒子环境下的实验结果表明,基于多阶段的强化学习算法不仅提升了协作性能,而且提高了样本的训练效率和模型的收敛速度。  相似文献   

3.
基于经验知识的Q-学习算法   总被引:1,自引:0,他引:1  
为了提高智能体系统中的典型的强化学习Q-学习的学习速度和收敛速度,使学习过程充分利用环境信息,本文提出了一种基于经验知识的Q-学习算法.该算法利用具有经验知识信息的函数,使智能体在进行无模型学习的同时学习系统模型,避免对环境模型的重复学习,从而加速智能体的学习速度.仿真实验结果表明:该算法使学习过程建立在较好的学习基础上,从而更快地趋近于最优状态,其学习效率和收敛速度明显优于标准的Q-学习.  相似文献   

4.
夏琳  罗威  王俊霞  黄一学 《软件》2023,(2):17-22+41
【目的】针对多智能体强化学习过程中样本利用率低、奖励稀疏、收敛速度慢等问题,提出了一种基于后验经验回放的MAAC(Actor-Attention-Critic for Multi-Agent Reinforcement Learning,MAAC)多智能体强化学习(Hindsight Experience Replay Machanism of MAAC Algorithm,HER-MAAC)算法。【方法】利用失败的探索经验,将依据后验经验回放算法选取的目标重新计算奖励值,存入回放缓冲区中,增大回放缓冲区中成功经验的比例,从而提升样本抽取效率。【结果】实验结果显示,HER-MAAC相较原始MAAC算法,智能体成功率提升,奖励值也明显提高。在典型试验环境下,训练3个智能体胜率提高了7.3%,智能体数量为4时胜率提高8.1%,智能体数目为5时胜率提高5.7%。【结论】研究成果表明,改进后的算法能够有效提升多智能体训练效率。  相似文献   

5.
目前,争夺电磁频谱的控制权已经成为了认知电子战的首要任务,其中认知干扰技术是认知电子战中的核心环节。以往传统的干扰方式灵活性较差,在应对具备一定抗干扰能力的通信系统时,其有效干扰率较低,且容易造成资源浪费。因此,为了提升干扰方的干扰效能,结合深度强化学习的思想,提出了一种基于深度双Q网络(DDQN,Double Deep Q Networks)的通信干扰策略生成算法,搭建了干扰决策网络,并针对传统强化学习中智能体“探索”与“利用”难以平衡的问题,提出根据历史时刻的平均奖赏值来作为控制探索概率改变的因子,对探索策略进行了改进。仿真实验表明,改进后的算法相比于未改进的算法以及传统算法有效干扰率更高,收敛速度更快,随着与环境多轮次地交互,干扰方能够逐渐学习到最优策略。  相似文献   

6.
近年来,深度强化学习(Deep Reinforcement Learning,DRL)已经成为了人工智能领域中的研究热点.为了加速DRL训练,人们提出了分布式强化学习方法用于提升训练速度.目前分布式强化学习可以分为同策略方法、异策略方法以及最新的近同策略方法.近同策略方法改善了同策略方法和异策略方法的问题,但是由于其共享内存并行模型的限制,近同策略模型难以扩展到以网络互连的计算集群上,低可扩展性限制了近同策略方法能够利用的资源数量,增加了计算节点的负载,最终导致训练耗时增加.为了提升近同策略方法的可扩展性,提升收敛速度,本文提出了一种以消息传递为基础,使用Gossip算法与模型融合方法的并行执行者-学习者训练框架(Parallel Actor-Learner Architecture,PALA),这一方法通过增强训练的并行性和可扩展性来提升收敛速度.首先,该框架以Gossip算法作为通信基础,借助全局数据代理并使用消息传递模型创建了一套可扩展的多个并行单智能体训练方法.其次,为了保证探索-利用的同策略性,维持训练稳定,本文创建了一套可以用于多机之间进行隐式同步的进程锁.其次,本文面向含...  相似文献   

7.
作为一种不需要事先获得训练数据的机器学习方法, 强化学习(Reinforcement learning, RL)在智能体与环境的不断交互过程中寻找最优策略, 是解决序贯决策问题的一种重要方法. 通过与深度学习(Deep learning, DL)结合, 深度强化学习(Deep reinforcement learning, DRL)同时具备了强大的感知和决策能力, 被广泛应用于多个领域来解决复杂的决策问题. 异策略强化学习通过将交互经验进行存储和回放, 将探索和利用分离开来, 更易寻找到全局最优解. 如何对经验进行合理高效的利用是提升异策略强化学习方法效率的关键. 首先对强化学习的基本理论进行介绍; 随后对同策略和异策略强化学习算法进行简要介绍; 接着介绍经验回放(Experience replay, ER)问题的两种主流解决方案, 包括经验利用和经验增广; 最后对相关的研究工作进行总结和展望.  相似文献   

8.
非平稳性问题是多智能体环境中深度学习面临的主要挑战之一,它打破了大多数单智能体强化学习算法都遵循的马尔可夫假设,使每个智能体在学习过程中都有可能会陷入由其他智能体所创建的环境而导致无终止的循环。为解决上述问题,研究了中心式训练分布式执行(CTDE)架构在强化学习中的实现方法,并分别从智能体间通信和智能体探索这两个角度入手,采用通过方差控制的强化学习算法(VBC)并引入好奇心机制来改进QMIX算法。通过星际争霸Ⅱ学习环境(SC2LE)中的微操场景对所提算法加以验证。实验结果表明,与QMIX算法相比,所提算法的性能有所提升,并且能够得到收敛速度更快的训练模型。  相似文献   

9.
强化学习当前越来越多地应用于多智能体系统。在强化学习中,奖励信号起引导智能体学习的作用,然而多智能体系统任务复杂,可能只在任务结束时才能获得环境的反馈,导致奖励稀疏,大幅降底算法的收敛速度和效率。为解决稀疏奖励问题,提出一种基于理性好奇心的多智能体强化学习方法。受内在动机理论的启发,将好奇心思想扩展到多智能体中,并给出理性好奇心奖励机制,利用分解求和的网络结构将不同排列的联合状态编码到同一特征表示,减少联合状态的探索空间,将网络的预测误差作为内在奖励,引导智能体去研究新颖且有用的效用状态。在此基础上,引入双值函数网络对Q值进行评估,采用最小化算子计算目标值,缓解Q值的过估计偏差和方差,并采用均值优化策略提高样本利用。在追捕任务和合作导航任务的环境中进行实验评估,结果表明,在最困难的追捕任务中,该方法相较于基线算法,胜率提高15%左右,所需时间步降低20%左右,在合作导航任务中也具有较快的收敛速度。  相似文献   

10.
深度强化学习算法能够很好地实现离散化的决策行为,但是难以运用于高度复杂且行为连续的现代战场环境,同时多智能体环境下算法难以收敛。针对这些问题,提出了一种改进的深度确定策略梯度(DDPG)算法,该算法引入了基于优先级的经验重放技术和单训练模式,以提高算法收敛速度;同时算法中还设计了一种混合双噪声的探索策略,从而实现复杂且连续的军事决策控制行为。采用Unity开发了基于改进DDPG算法的智能军事决策仿真平台,搭建了蓝军步兵进攻红军军事基地的仿真环境,模拟多智能体的作战训练。实验结果显示,该算法能够驱动多作战智能体完成战术机动,实现绕过障碍物抵达优势区域进行射击等战术行为,算法拥有更快的收敛速度和更好的稳定性,可得到更高的回合奖励,达到了提高智能军事决策效率的目的。  相似文献   

11.
针对传统深度强化学习(deep reinforcement learning,DRL)中收敛速度缓慢、经验重放组利用率低的问题,提出了灾害应急场景下基于多智能体深度强化学习(MADRL)的任务卸载策略。首先,针对MEC网络环境随时隙变化且当灾害发生时传感器数据多跳的问题,建立了灾害应急场景下基于MADRL的任务卸载模型;然后,针对传统DRL由高维动作空间导致的收敛缓慢问题,利用自适应差分进化算法(ADE)的变异和交叉操作探索动作空间,提出了自适应参数调整策略调整ADE的迭代次数,避免DRL在训练初期对动作空间的大量无用探索;最后,为进一步提高传统DRL经验重放组中的数据利用率,加入优先级经验重放技术,加速网络训练过程。仿真结果表明,ADE-DDPG算法相比改进的深度确定性策略梯度网络(deep deterministic policy gradient,DDPG)节约了35%的整体开销,验证了ADE-DDPG在性能上的有效性。  相似文献   

12.
在强化学习中,当处于奖励分布稀疏的环境时,由于无法获得有效经验,智能体收敛速度和效率都会大幅下降.针对此类稀疏奖励,文中提出基于情感的异构多智能体强化学习方法.首先,建立基于个性的智能体情感模型,为异构多智能体提供激励机制,作为外部奖励的有效补充.然后,基于上述激励机制,融合深度确定性策略,提出稀疏奖励下基于内在情感激...  相似文献   

13.
深度强化学习探索问题中,需要根据环境给予的外部奖赏以作出决策,而在稀疏奖赏环境下,训练初期将获取不到任何信息,且在训练后期难以动态地结合已获得的信息对探索策略进行调整。为缓解这个问题,提出优先状态估计方法,在对状态进行访问时给予优先值,结合外部奖赏一并存入经验池中,引导探索的策略方向。结合DDQN(Double Deep Q Network)与优先经验回放,在OpenAI Gym中的MountainCar经典控制问题与Atari 2600中的FreeWay游戏中进行对比实验,结果表明该方法在稀疏奖赏环境中具有更好的学习性能,取得了更高的平均分数。  相似文献   

14.
Policy iteration, which evaluates and improves the control policy iteratively, is a reinforcement learning method. Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity. However, most existing online least-squares policy iteration methods only use each sample just once, resulting in the low utilization rate. With the goal of improving the utilization efficiency, we propose an experience replay for least-squares policy iteration (ERLSPI) and prove its convergence. ERLSPI method combines online least-squares policy iteration method with experience replay, stores the samples which are generated online, and reuses these samples with least-squares method to update the control policy. We apply the ERLSPI method for the inverted pendulum system, a typical benchmark testing. The experimental results show that the method can effectively take advantage of the previous experience and knowledge, improve the empirical utilization efficiency, and accelerate the convergence speed.   相似文献   

15.
Path planning and obstacle avoidance are two challenging problems in the study of intelligent robots. In this paper, we develop a new method to alleviate these problems based on deep Q-learning with experience replay and heuristic knowledge. In this method, a neural network has been used to resolve the “curse of dimensionality” issue of the Q-table in reinforcement learning. When a robot is walking in an unknown environment, it collects experience data which is used for training a neural network; such a process is called experience replay. Heuristic knowledge helps the robot avoid blind exploration and provides more effective data for training the neural network. The simulation results show that in comparison with the existing methods, our method can converge to an optimal action strategy with less time and can explore a path in an unknown environment with fewer steps and larger average reward.   相似文献   

16.
Reinforcement learning (RL) is a biologically supported learning paradigm, which allows an agent to learn through experience acquired by interaction with its environment. Its potential to learn complex action sequences has been proven for a variety of problems, such as navigation tasks. However, the interactive randomized exploration of the state space, common in reinforcement learning, makes it difficult to be used in real-world scenarios. In this work we describe a novel real-world reinforcement learning method. It uses a supervised reinforcement learning approach combined with Gaussian distributed state activation. We successfully tested this method in two real scenarios of humanoid robot navigation: first, backward movements for docking at a charging station and second, forward movements to prepare grasping. Our approach reduces the required learning steps by more than an order of magnitude, and it is robust and easy to be integrated into conventional RL techniques.  相似文献   

17.

Deep learning techniques have shown success in learning from raw high-dimensional data in various applications. While deep reinforcement learning is recently gaining popularity as a method to train intelligent agents, utilizing deep learning in imitation learning has been scarcely explored. Imitation learning can be an efficient method to teach intelligent agents by providing a set of demonstrations to learn from. However, generalizing to situations that are not represented in the demonstrations can be challenging, especially in 3D environments. In this paper, we propose a deep imitation learning method to learn navigation tasks from demonstrations in a 3D environment. The supervised policy is refined using active learning in order to generalize to unseen situations. This approach is compared to two popular deep reinforcement learning techniques: deep-Q-networks and Asynchronous actor-critic (A3C). The proposed method as well as the reinforcement learning methods employ deep convolutional neural networks and learn directly from raw visual input. Methods for combining learning from demonstrations and experience are also investigated. This combination aims to join the generalization ability of learning by experience with the efficiency of learning by imitation. The proposed methods are evaluated on 4 navigation tasks in a 3D simulated environment. Navigation tasks are a typical problem that is relevant to many real applications. They pose the challenge of requiring demonstrations of long trajectories to reach the target and only providing delayed rewards (usually terminal) to the agent. The experiments show that the proposed method can successfully learn navigation tasks from raw visual input while learning from experience methods fail to learn an effective policy. Moreover, it is shown that active learning can significantly improve the performance of the initially learned policy using a small number of active samples.

  相似文献   

18.
Many interesting problems in reinforcement learning (RL) are continuous and/or high dimensional, and in this instance, RL techniques require the use of function approximators for learning value functions and policies. Often, local linear models have been preferred over distributed nonlinear models for function approximation in RL. We suggest that one reason for the difficulties encountered when using distributed architectures in RL is the problem of negative interference, whereby learning of new data disrupts previously learned mappings. The continuous temporal difference (TD) learning algorithm TD(lambda) was used to learn a value function in a limited-torque pendulum swing-up task using a multilayer perceptron (MLP) network. Three different approaches were examined for learning in the MLP networks; 1) simple gradient descent; 2) vario-eta; and 3) a pseudopattern rehearsal strategy that attempts to reduce the effects of interference. Our results show that MLP networks can be used for value function approximation in this task but require long training times. We also found that vario-eta destabilized learning and resulted in a failure of the learning process to converge. Finally, we showed that the pseudopattern rehearsal strategy drastically improved the speed of learning. The results indicate that interference is a greater problem than ill conditioning for this task.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号