首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
闫超  相晓嘉  徐昕  王菖  周晗  沈林成 《控制与决策》2022,37(12):3083-3102
得益于深度学习强大的特征表达能力和强化学习有效的策略学习能力,深度强化学习在一系列复杂序贯决策问题中取得了令人瞩目的成就.伴随着深度强化学习在诸多单智能体任务中的成功应用,其在多智能体系统中的研究方兴未艾.近年来,多智能体深度强化学习在人工智能领域备受关注,可扩展与可迁移性已成为其中的核心研究点之一.鉴于此,首先阐释深度强化学习的发展脉络和典型算法,介绍多智能体深度强化学习的3种学习范式,分析两类多智能体强化学习的典型算法,即分解值函数方法和中心化值函数方法;然后归纳注意力机制、图神经网络等6类具有可扩展性的多智能体深度强化学习模型,梳理迁移学习和课程学习在多智能体深度强化学习可迁移性方向的研究进展;最后讨论多智能体深度强化学习的应用前景与研究方向,为未来多智能体深度强化学习的进一步发展提供可借鉴的参考.  相似文献   

Pedestrian simulation is complex because there are different levels of behavior modeling. At the lowest level, local interactions between agents occur; at the middle level, strategic and tactical behaviors appear like overtakings or route choices; and at the highest level path-planning is necessary. The agent-based pedestrian simulators either focus on a specific level (mainly in the lower one) or define strategies like the layered architectures to independently manage the different behavioral levels. In our Multi-Agent Reinforcement-Learning-based Pedestrian simulation framework (MARL-Ped) the situation is addressed as a whole. Each embodied agent uses a model-free Reinforcement Learning (RL) algorithm to learn autonomously to navigate in the virtual environment. The main goal of this work is to demonstrate empirically that MARL-Ped generates learned behaviors adapted to the level required by the pedestrian scenario. Three different experiments, described in the pedestrian modeling literature, are presented to test our approach: (i) election of the shortest path vs. quickest path; (ii) a crossing between two groups of pedestrians walking in opposite directions inside a narrow corridor; (iii) two agents that move in opposite directions inside a maze. The results show that MARL-Ped solves the different problems, learning individual behaviors with characteristics of pedestrians (local control that produces adequate fundamental diagrams, route-choice capability, emergence of collective behaviors and path-planning). Besides, we compared our model with that of Helbing’s social forces, a well-known model of pedestrians, showing similarities between the pedestrian dynamics generated by both approaches. These results demonstrate empirically that MARL-Ped generates variate plausible behaviors, producing human-like macroscopic pedestrian flow.  相似文献   

文章考虑了具适多智能体系统的分布式跟踪控制问题。通过设计带有初始学习机制的$P$型和$PD^{\alpha}$ 型迭代学习控制策略求解跟踪问题。具适导数具有良好的性质且可以刻画不同步长的实际数据采样情况。初始学习机制放松了初始值条件且提高了算法实现趋同跟踪的性能。在可重复操作环境和有向通信拓扑的假设下,提出了一种分布式迭代学习方案,通过重复同一轨迹的控制尝试和用跟踪误差修正不满意的控制信号来实现有限时间趋同。严格证明了随着迭代次数增加,提出的$P$型和$PD^{\alpha}$ 型迭代学习控制策略使得所有智能体能渐近跟踪上参考轨迹。两个代表性数值仿真验证了算法的有效性。  相似文献   

Previous deep learning based approaches to illuminant estimation either resized the raw image to lower resolution or randomly cropped image patches for the deep learning model. However, such practices would inevitably lead to information loss or the selection of noisy patches that would affect estimation accuracy. In this paper, we regard patch selection in neural network based illuminant estimation as a controlling problem of selecting image patches that could help remove noisy patches and improve estimation accuracy. To achieve this, we construct a selection network (SeNet) to learn a patch selection policy. Based on data statistics and the learning progression state of the deep illuminant estimation network (DeNet), the SeNet decides which training patches should be input to the DeNet, which in turn gives feedback to the SeNet for it to update its selection policy. To achieve such interactive and intelligent learning, we utilize a reinforcement learning approach termed policy gradient to optimize the SeNet. We show that the proposed learning strategy can enhance the illuminant estimation accuracy, speed up the convergence and improve the stability of the training process of DeNet. We evaluate our method on two public datasets and demonstrate our method outperforms state-of-the-art approaches.  相似文献   

Delay Tolerant Reinforcement-Based (DTRB) is a delay tolerant routing solution for IEEE 802.11 wireless networks which enables device to device data exchange without the support of any pre-existing network infrastructure. The solution utilizes Multi-Agent Reinforcement Learning techniques to learn about routes in the network and forward/replicate the messages that produce the best reward. The rewarding process is executed by a learning algorithm based on the distances between the nodes, which are calculated as a function of time from the last meetings. DTRB is a flooding-based delay tolerant routing solution. The simulation results show that DTRB can deliver more messages than a traditional delay tolerant routing solution does in densely populated areas, with similar end-to-end delay and lower network overhead.  相似文献   

徐鹏  谢广明      文家燕    高远 《智能系统学报》2019,14(1):93-98
针对经典强化学习的多智能体编队存在通信和计算资源消耗大的问题,本文引入事件驱动控制机制,智能体的动作决策无须按固定周期进行,而依赖于事件驱动条件更新智能体动作。在设计事件驱动条件时,不仅考虑智能体的累积奖赏值,还引入智能体与邻居奖赏值的偏差,智能体间通过交互来寻求最优联合策略实现编队。数值仿真结果表明,基于事件驱动的强化学习多智能体编队控制算法,在保证系统性能的情况下,能有效降低多智能体的动作决策频率和资源消耗。  相似文献   

针对传统深度强化学习(deep reinforcement learning,DRL)中收敛速度缓慢、经验重放组利用率低的问题,提出了灾害应急场景下基于多智能体深度强化学习(MADRL)的任务卸载策略。首先,针对MEC网络环境随时隙变化且当灾害发生时传感器数据多跳的问题,建立了灾害应急场景下基于MADRL的任务卸载模型;然后,针对传统DRL由高维动作空间导致的收敛缓慢问题,利用自适应差分进化算法(ADE)的变异和交叉操作探索动作空间,提出了自适应参数调整策略调整ADE的迭代次数,避免DRL在训练初期对动作空间的大量无用探索;最后,为进一步提高传统DRL经验重放组中的数据利用率,加入优先级经验重放技术,加速网络训练过程。仿真结果表明,ADE-DDPG算法相比改进的深度确定性策略梯度网络(deep deterministic policy gradient,DDPG)节约了35%的整体开销,验证了ADE-DDPG在性能上的有效性。  相似文献   

This article presents two new algorithms for finding the optimal solution of a Multi-agent Multi-objective Reinforcement Learning problem. Both algorithms make use of the concepts of modularization and acceleration by a heuristic function applied in standard Reinforcement Learning algorithms to simplify and speed up the learning process of an agent that learns in a multi-agent multi-objective environment. In order to verify performance of the proposed algorithms, we considered a predator-prey environment in which the learning agent plays the role of prey that must escape the pursuing predator while reaching for food in a fixed location. The results show that combining modularization and acceleration using a heuristics function indeed produced simplification and speeding up of the learning process in a complex problem when comparing with algorithms that do not make use of acceleration or modularization techniques, such as Q-Learning and Minimax-Q.  相似文献   

Pedestrian models need to be validated before being applied to real-life planning. Thus, the validation of these models is worthy of special investigation. In this work, we perform two validation exercises with the pedestrian models named FDS+Evac and JuPedSim based on a well-controlled pedestrian experiment. A comprehensive combination of multiple characteristics is used to enhance the reliability of validation results, including model stability, pedestrian flow, time series of density and velocity, spatiotemporal profiles and pedestrian trajectories. The results show that both FDS+Evac and JuPedSim have weaknesses in reproducing full pedestrian characteristics realistically. Our validation exercises illustrate that single characteristic is not enough to guarantee a reliable validation result and a comprehensive combination of multiple characteristics is necessary. This work demonstrates the defects in most of existing validation of pedestrian models and presents a general validation procedure for pedestrian models in future research.  相似文献   

Multi-agent reinforcement learning technologies are mainly investigated from two perspectives of the concurrence and the game theory. The former chiefly applies to cooperative multi-agent systems, while the latter usually applies to coordinated multi-agent systems. However, there exist such problems as the credit assignment and the multiple Nash equilibriums for agents with them. In this paper, we propose a new multi-agent reinforcement learning model and algorithm LMRL from a layer perspective. LMRL model is composed of an off-line training layer that employs a single agent reinforcement learning technology to acquire stationary strategy knowledge and an online interaction layer that employs a multi-agent reinforcement learning technology and the strategy knowledge that can be revised dynamically to interact with the environment. An agent with LMRL can improve its generalization capability, adaptability and coordination ability. Experiments show that the performance of LMRL can be better than those of a single agent reinforcement learning and Nash-Q.  相似文献   

非平稳性问题是多智能体环境中深度学习面临的主要挑战之一,它打破了大多数单智能体强化学习算法都遵循的马尔可夫假设,使每个智能体在学习过程中都有可能会陷入由其他智能体所创建的环境而导致无终止的循环。为解决上述问题,研究了中心式训练分布式执行(CTDE)架构在强化学习中的实现方法,并分别从智能体间通信和智能体探索这两个角度入手,采用通过方差控制的强化学习算法(VBC)并引入好奇心机制来改进QMIX算法。通过星际争霸Ⅱ学习环境(SC2LE)中的微操场景对所提算法加以验证。实验结果表明,与QMIX算法相比,所提算法的性能有所提升,并且能够得到收敛速度更快的训练模型。  相似文献   

Applied Intelligence - A multi-agent system (MAS) is expected to be applied to various real-world problems where a single agent cannot accomplish given tasks. Due to the inherent complexity in the...  相似文献   

Aiming at human-robot collaboration in manufacturing, the operator's safety is the primary issue during the manufacturing operations. This paper presents a deep reinforcement learning approach to realize the real-time collision-free motion planning of an industrial robot for human-robot collaboration. Firstly, the safe human-robot collaboration manufacturing problem is formulated into a Markov decision process, and the mathematical expression of the reward function design problem is given. The goal is that the robot can autonomously learn a policy to reduce the accumulated risk and assure the task completion time during human-robot collaboration. To transform our optimization object into a reward function to guide the robot to learn the expected behaviour, a reward function optimizing approach based on the deterministic policy gradient is proposed to learn a parameterized intrinsic reward function. The reward function for the agent to learn the policy is the sum of the intrinsic reward function and the extrinsic reward function. Then, a deep reinforcement learning algorithm intrinsic reward-deep deterministic policy gradient (IRDDPG), which is the combination of the DDPG algorithm and the reward function optimizing approach, is proposed to learn the expected collision avoidance policy. Finally, the proposed algorithm is tested in a simulation environment, and the results show that the industrial robot can learn the expected policy to achieve the safety assurance for industrial human-robot collaboration without missing the original target. Moreover, the reward function optimizing approach can help make up for the designed reward function and improve policy performance.  相似文献   

In existing Active Access Control (AAC) models, the scalability and flexibility of security policy specification should be well balanced, especially: (1) authorizations to plenty of tasks should be simplified; (2) team workflows should be enabled; (3) fine-grained constraints should be enforced. To address this issue, a family of Association-Based Active Access Control (ABAAC) models is proposed. In the minimal model ABAAC0, users are assigned to roles while permissions are assigned to task-role associations. In a workflow case, to execute such an association some users assigned to its component role will be allocated. The association's assigned permissions can be performed by them during the task is running in the case. In ABAAC1, a generalized association is employed to extract common authorizations from multiple associations. In ABAAC2, a fine-grained separation of duty (SoD) is enforced among associations. In the maximal model ABAAC3, all these features are integrated, and similar constraints can be specified more concisely. Using a software workflow, case validation is performed. Comparison with a representative association based AAC model and the most scalable AAC model so far indicates that: (1) enough scalability is achieved; (2) without decomposition of a task, different permissions can be authorized to multiple roles in it; (3) separation of more fine-grained duties than roles and tasks can be enforced.  相似文献   

In an environment where robots coexist with humans, mobile robots should be human-aware and comply with humans' behavioural norms so as to not disturb humans' personal space and activities. In this work, we propose an inverse reinforcement learning-based time-dependent A* planner for human-aware robot navigation with local vision. In this method, the planning process of time-dependent A* is regarded as a Markov decision process and the cost function of the time-dependent A* is learned using the inverse reinforcement learning via capturing humans' demonstration trajectories. With this method, a robot can plan a path that complies with humans' behaviour patterns and the robot's kinematics. When constructing feature vectors of the cost function, considering the local vision characteristics, we propose a visual coverage feature for enabling robots to learn from how humans move in a limited visual field. The effectiveness of the proposed method has been validated by experiments in real-world scenarios: using this approach robots can effectively mimic human motion patterns when avoiding pedestrians; furthermore, in a limited visual field, robots can learn to choose a path that enables them to have the larger visual coverage which shows a better navigation performance.  相似文献   

Reinforcement learning (RL) appeals to many researchers in recent years because of its generality. It is an approach to machine intelligence that learns to achieve the given goal by trial-and-error iterations with its environment. This paper proposes a case-based reinforcement learning algorithm (CRL) for dynamic inventory control in a multi-agent supply-chain system. Traditional time-triggered and event-triggered ordering policies remain popular because they are easy to implement. But in the dynamic environment, the results of them may become inaccurate causing excessive inventory (cost) or shortage. Under the condition of nonstationary customer demand, the S value of (T, S) and (Q, S) inventory review method is learnt using the proposed algorithm for satisfying target service level, respectively. Multi-agent simulation of a simplified two-echelon supply chain, where proposed algorithm is implemented, is run for a few times. The results show the effectiveness of CRL in both review methods. We also consider a framework for general learning method based on proposed one, which may be helpful in all aspects of supply-chain management (SCM). Hence, it is suggested that well-designed ‘‘connections” are necessary to be built between CRL, multi-agent system (MAS) and SCM.  相似文献   

为了在领域文本中实现数据定位,将文本视为环境,针对文本环境中存在的动态性以及不确定性等问题,提出了基于多agent分层强化学习的数据定位方法。该方法利用分层结构的特点,将系统任务分解为多个子任务,个体agent分别对对应子任务学习,以此将策略更新限制在规模较小的局部空间;同时利用多agent系统中单agent与系统远期目标的同一性,引入策略协调机制,通过agent之间交换信息来发现趋势性信息,并利用shaping技术,将在线获取的动态知识对各个agent进行趋势性启发,加快agent的收敛速度。将该方法应用于司法领域的判决文书上,实验结果表明:该方法能够在大规模复杂未知的文本环境中对目标数据进行高效准确定位,平均准确率与◢F◣值能够达到96.6%和98.2%,且具有较好的收敛速度。因此可以看出,该方法能够很好地在领域文本中实现数据定位,具有较大的理论以及实际意义。  相似文献   

Unmanned aerial vehicles(UAVs) are recognized as effective means for delivering emergency communication services when terrestrial infrastructures are unavailable. This paper investigates a multiUAV-assisted communication system, where we jointly optimize UAVs’ trajectories, user association, and ground users(GUs)’ transmit power to maximize a defined fairness-weighted throughput metric. Owing to the dynamic nature of UAVs, this problem has to be solved in real time. However, the problem’s non-co...  相似文献   

针对工人和任务进行匹配是空间众包研究的核心问题之一,但已有的方法通常会忽略工人路径对任务分配结果产生的影响.传统的任务分配方法存在计算速度慢、适用范围小和协作效果不突出等问题.对此,从空间众包平台的角度出发研究面向路网的空间众包任务分配问题,以任务完成时间最短为目标,提出考虑工人路径规划的基于多智能体强化学习的QMIX-A*算法,缩短任务的平均完成时间,进而提高用户的满意度.大量的数值仿真研究验证了QMIX-A*的有效性和稳定性,为空间众包服务平台的任务分配与路径优化策略的选择提供决策支持.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号