首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
张峻伟  吕帅  张正昊  于佳玉  龚晓宇 《软件学报》2022,33(11):4217-4238
深度强化学习将深度学习的表示能力和强化学习的决策能力结合,因在复杂控制任务中效果显著而掀起研究热潮.以是否用Bellman方程为基准,将无模型深度强化学习方法分为Q值函数方法和策略梯度方法,并从模型构建方式、优化历程和方法评估等方面对两类方法分别进行了介绍.针对深度强化学习方法中样本效率低的问题进行讨论,根据两类方法的模型特性,说明了Q值函数方法过高估计问题和策略梯度方法采样无偏性约束分别是两类方法样本效率受限的主要原因.从增强探索效率和提高样本利用率两个角度,根据近年来的研究热点和趋势归纳出各类可行的优化方法,分析相关方法的优势和仍存在的问题,并对比其适用范围和优化效果.最后提出增强样本效率优化方法的通用性、探究两类方法间优化机制的迁移和提高理论完备性作为未来的研究方向.  相似文献   

2.
On one hand, multiple object detection approaches of Hough transform (HT) type and randomized HT type have been extended into an evidence accumulation featured general framework for problem solving, with five key mechanisms elaborated and several extensions of HT and RHT presented. On the other hand, another framework is proposed to integrate typical multi-learner based approaches for problem solving, particularly on Gaussian mixture based data clustering and local subspace learning, multi-sets mixture based object detection and motion estimation, and multi-agent coordinated problem solving. Typical learning algorithms, especially those that base on rival penalized competitive learning (RPCL) and Bayesian Ying-Yang (BYY) learning, are summarized from a unified perspective with new extensions. Furthermore, the two different frameworks are not only examined with one viewed crossly from a perspective of the other, with new insights and extensions, but also further unified into a general problem solving paradigm that consists of five basic mechanisms in terms of acquisition, allocation, amalgamation, admission, and affirmation, or shortly A5 paradigm.  相似文献   

3.
多Agent深度强化学习综述   总被引:10,自引:4,他引:6  
近年来, 深度强化学习(Deep reinforcement learning, DRL)在诸多复杂序贯决策问题中取得巨大突破.由于融合了深度学习强大的表征能力和强化学习有效的策略搜索能力, 深度强化学习已经成为实现人工智能颇有前景的学习范式.然而, 深度强化学习在多Agent系统的研究与应用中, 仍存在诸多困难和挑战, 以StarCraft Ⅱ为代表的部分观测环境下的多Agent学习仍然很难达到理想效果.本文简要介绍了深度Q网络、深度策略梯度算法等为代表的深度强化学习算法和相关技术.同时, 从多Agent深度强化学习中通信过程的角度对现有的多Agent深度强化学习算法进行归纳, 将其归纳为全通信集中决策、全通信自主决策、欠通信自主决策3种主流形式.从训练架构、样本增强、鲁棒性以及对手建模等方面探讨了多Agent深度强化学习中的一些关键问题, 并分析了多Agent深度强化学习的研究热点和发展前景.  相似文献   

4.

Deep learning techniques have shown success in learning from raw high-dimensional data in various applications. While deep reinforcement learning is recently gaining popularity as a method to train intelligent agents, utilizing deep learning in imitation learning has been scarcely explored. Imitation learning can be an efficient method to teach intelligent agents by providing a set of demonstrations to learn from. However, generalizing to situations that are not represented in the demonstrations can be challenging, especially in 3D environments. In this paper, we propose a deep imitation learning method to learn navigation tasks from demonstrations in a 3D environment. The supervised policy is refined using active learning in order to generalize to unseen situations. This approach is compared to two popular deep reinforcement learning techniques: deep-Q-networks and Asynchronous actor-critic (A3C). The proposed method as well as the reinforcement learning methods employ deep convolutional neural networks and learn directly from raw visual input. Methods for combining learning from demonstrations and experience are also investigated. This combination aims to join the generalization ability of learning by experience with the efficiency of learning by imitation. The proposed methods are evaluated on 4 navigation tasks in a 3D simulated environment. Navigation tasks are a typical problem that is relevant to many real applications. They pose the challenge of requiring demonstrations of long trajectories to reach the target and only providing delayed rewards (usually terminal) to the agent. The experiments show that the proposed method can successfully learn navigation tasks from raw visual input while learning from experience methods fail to learn an effective policy. Moreover, it is shown that active learning can significantly improve the performance of the initially learned policy using a small number of active samples.

  相似文献   

5.
Mahadevan  Sridhar 《Machine Learning》1996,22(1-3):159-195
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.  相似文献   

6.
提出一种模糊神经网络的自适应控制方案。针对连续空间的复杂学习任务,提出了一种竞争式Takagi—Sugeno模糊再励学习网络,该网络结构集成了Takagi-Sugeno模糊推理系统和基于动作的评价值函数的再励学习方法。相应地,提出了一种优化学习算法,其把竞争式Takagi-Sugeno模糊再励学习网络训练成为一种所谓的Takagi-Sugeno模糊变结构控制器。以一级倒立摆控制系统为例.仿真研究表明所提出的学习算法在性能上优于其它的再励学习算法。  相似文献   

7.
竞争式Takagi-Sugeno模糊再励学习   总被引:4,自引:0,他引:4  
针对连续空间的复杂学习任务,提出了一种竞争式Takagi-Sugeno模糊再励学习网络 (CTSFRLN),该网络结构集成了Takagi-Sugeno模糊推理系统和基于动作的评价值函数的再 励学习方法.文中相应提出了两种学习算法,即竞争式Takagi-Sugeno模糊Q-学习算法和竞争 式Takagi-Sugeno模糊优胜学习算法,其把CTSFRLN训练成为一种所谓的Takagi-Sugeno模 糊变结构控制器.以二级倒立摆控制系统为例,仿真研究表明所提出的学习算法在性能上优于 其它的再励学习算法.  相似文献   

8.
Task scheduling is important for the proper functioning of parallel processor systems. The static scheduling of tasks onto networks of parallel processors is well-defined and documented in the literature. However, in many practical situations a priori information about the tasks that need to be scheduled is not available. In such situations, tasks usually arrive dynamically and the scheduling should be performed on-line or “on the fly”. In this paper, we present a framework based on stochastic reinforcement learning, which is usually used to solve optimization problems in a simple and efficient way. The use of reinforcement learning reduces the dynamic scheduling problem to that of learning a stochastic approximation of an unknown average error surface. The main advantage of the proposed approach is that no prior information is required about the parallel processor system under consideration. The learning system develops an association between the best action (schedule) and the current state of the environment (parallel system). The performance of reinforcement learning is demonstrated by solving several dynamic scheduling problems. The conditions under which reinforcement learning can used to efficiently solve the dynamic scheduling problem are highlighted  相似文献   

9.
提出一种模糊神经网络的自适应控制方案。针对连续空间的复杂学习任务,提出了一种竞争式Takagi-Sugeno模糊再励学习网络,该网络结构集成了Takagi-Sugeno模糊推理系统和基于动作的评价值函数的再励学习方法。相应地,提出了一种优化学习算法,其把竞争式Takagi-Sugeno模糊再励学习网络训练成为一种所谓的Takagi-Sugeno模糊变结构控制器。以一级倒立摆控制系统为例,仿真研究表明所提出的学习算法在性能上优于其它的再励学习算法。  相似文献   

10.
This paper studies the problem of transfer learning in the context of reinforcement learning. We propose a novel transfer learning method that can speed up reinforcement learning with the aid of previously learnt tasks. Before performing extensive learning episodes, our method attempts to analyze the learning task via some exploration in the environment, and then attempts to reuse previous learning experience whenever it is possible and appropriate. In particular, our proposed method consists of four stages: 1) subgoal discovery, 2) option construction, 3) similarity searching, and 4) option reusing. Especially, in order to fulfill the task of identifying similar options, we propose a novel similarity measure between options, which is built upon the intuition that similar options have similar stateaction probabilities. We examine our algorithm using extensive experiments, comparing it with existing methods. The results show that our method outperforms conventional non-transfer reinforcement learning algorithms, as well as existing transfer learning methods, by a wide margin.   相似文献   

11.
模仿学习是强化学习与监督学习的结合,目标是通过观察专家演示,学习专家策略,从而加速强化学习。通过引入任务相关的额外信息,模仿学习相较于强化学习,可以更快地实现策略优化,为缓解低样本效率问题提供了解决方案。模仿学习已成为解决强化学习问题的一种流行框架,涌现出多种提高学习性能的算法和技术。通过与图形图像学的最新研究成果相结合,模仿学习已经在游戏人工智能(artificial intelligence,AI)、机器人控制和自动驾驶等领域发挥了重要作用。本文围绕模仿学习的年度发展,从行为克隆、逆强化学习、对抗式模仿学习、基于观察量的模仿学习和跨领域模仿学习等多个角度进行深入探讨,介绍了模仿学习在实际应用上的最新情况,比较了国内外研究现状,并展望了该领域未来的发展方向。旨在为研究人员和从业人员提供模仿学习的最新进展,从而为开展工作提供参考与便利。  相似文献   

12.
作为自动化和智能化时代的代表,机器人技术的发展成为智能控制领域研究的焦点,各种基于机器人的智能控制技术应运而生,机器人被越来越多地应用于实现与环境之间的复杂多接触交互任务.本文以机器人复杂多接触交互任务为核心问题展开讨论,结合基于强化学习的机器人智能体训练相关研究,对基于强化学习方法实现机器人多接触交互任务展开综述.概述了强化学习在机器人多接触任务研究中的代表性研究,当前研究中存在的问题以及改进多接触交互任务实验效果的优化方法,结合当前研究成果和各优化方法特点对未来机器人多接触交互任务的智能控制方法进行了展望.  相似文献   

13.
提高强化学习速度的方法研究   总被引:4,自引:0,他引:4  
强化学习一词出自于行为心理学,这门学科把学习看作为反复试验的过程,以便把环境的状态映射为动作。强化学习的这种特性必然增加智能系统的困难性,学习时间增长。强化学习学习速度较慢的原因是没有明确的监督信号。因此,强化学习系统在与环境交互时不得不采取反复试验的方法依靠外部评价信号来调整自己的行为。智能系统必然经过很长的学习过程。如何提高强化学习速度是一个最重要的研究问题。该文从几个方面来讨论提高强化学习速度的方法。  相似文献   

14.
深度强化学习中稀疏奖励问题研究综述   总被引:1,自引:0,他引:1  
强化学习作为机器学习的重要分支,是在与环境交互中寻找最优策略的一类方法。强化学习近年来与深度学习进行了广泛结合,形成了深度强化学习的研究领域。作为一种崭新的机器学习方法,深度强化学习同时具有感知复杂输入和求解最优策略的能力,可以应用于机器人控制等复杂决策问题。稀疏奖励问题是深度强化学习在解决任务中面临的核心问题,在实际应用中广泛存在。解决稀疏奖励问题有利于提升样本的利用效率,提高最优策略的水平,推动深度强化学习在实际任务中的广泛应用。文中首先对深度强化学习的核心算法进行阐述;然后介绍稀疏奖励问题的5种解决方案,包括奖励设计与学习、经验回放机制、探索与利用、多目标学习和辅助任务等;最后对相关研究工作进行总结和展望。  相似文献   

15.
基于生成对抗网络的模仿学习综述   总被引:1,自引:0,他引:1  
模仿学习研究如何从专家的决策数据中进行学习,以得到接近专家水准的决策模型.同样学习如何决策的强化学习往往只根据环境的评价式反馈进行学习,与之相比,模仿学习能从决策数据中获得更为直接的反馈.它可以分为行为克隆、基于逆向强化学习的模仿学习两类方法.基于逆向强化学习的模仿学习把模仿学习的过程分解成逆向强化学习和强化学习两个子过程,并反复迭代.逆向强化学习用于推导符合专家决策数据的奖赏函数,而强化学习基于该奖赏函数来学习策略.基于生成对抗网络的模仿学习方法从基于逆向强化学习的模仿学习发展而来,其中最早出现且最具代表性的是生成对抗模仿学习方法(Generative Adversarial Imitation Learning,简称GAIL).生成对抗网络由两个相对抗的神经网络构成,分别为判别器和生成器.GAIL的特点是用生成对抗网络框架求解模仿学习问题,其中,判别器的训练过程可类比奖赏函数的学习过程,生成器的训练过程可类比策略的学习过程.与传统模仿学习方法相比,GAIL具有更好的鲁棒性、表征能力和计算效率.因此,它能够处理复杂的大规模问题,并可拓展到实际应用中.然而,GAIL存在着模态崩塌、环境交互样本利用效率低等问题.最近,新的研究工作利用生成对抗网络技术和强化学习技术等分别对这些问题进行改进,并在观察机制、多智能体系统等方面对GAIL进行了拓展.本文先介绍了GAIL的主要思想及其优缺点,然后对GAIL的改进算法进行了归类、分析和对比,最后总结全文并探讨了可能的未来趋势.  相似文献   

16.
The complexity in planning and control of robot compliance tasks mainly results from simultaneous control of both position and force and inevitable contact with environments. It is quite difficult to achieve accurate modeling of the interaction between the robot and the environment during contact. In addition, the interaction with the environment varies even for compliance tasks of the same kind. To deal with these phenomena, in this paper, we propose a reinforcement learning and robust control scheme for robot compliance tasks. A reinforcement learning mechanism is used to tackle variations among compliance tasks of the same kind. A robust compliance controller that guarantees system stability in the presence of modeling uncertainties and external disturbances is used to execute control commands sent from the reinforcement learning mechanism. Simulations based on deburring compliance tasks demonstrate the effectiveness of the proposed scheme.  相似文献   

17.
In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.  相似文献   

18.
沙宗轩  薛菲  朱杰 《计算机应用》2019,39(2):501-508
为了解决机器人完成大规模状态空间强化学习任务时收敛慢的问题,提出一种基于优先级的并行强化学习任务调度策略。首先,证明Q学习在异步并行计算模式下的收敛性;然后,将复杂问题根据状态空间进行分割,调度中心根据所提策略将子问题和计算节点匹配,各计算节点完成子问题的强化学习任务并向调度中心反馈结果,实现在计算机集群中的并行强化学习;最后,以CloudSim为软件基础搭建实验环境,求解最优步长、折扣率和子问题规模等参数,并通过对实际问题求解证明在不同计算节点数的情况下所提策略的性能。在使用64个计算节点的情况下所提策略相比轮询调度和随机调度的效率分别提升了61%和86%。实验结果表明,该策略在并行计算情况下有效提高了收敛速度,并进一步验证了该策略得到百万级状态空间控制问题的最优策略需要约1.6×105 s。  相似文献   

19.

Website hacking is a frequent attack type used by malicious actors to obtain confidential information, modify the integrity of web pages or make websites unavailable. The tools used by attackers are becoming more and more automated and sophisticated, and malicious machine learning agents seem to be the next development in this line. In order to provide ethical hackers with similar tools, and to understand the impact and the limitations of artificial agents, we present in this paper a model that formalizes web hacking tasks for reinforcement learning agents. Our model, named Agent Web Model, considers web hacking as a capture-the-flag style challenge, and it defines reinforcement learning problems at seven different levels of abstraction. We discuss the complexity of these problems in terms of actions and states an agent has to deal with, and we show that such a model allows to represent most of the relevant web vulnerabilities. Aware that the driver of advances in reinforcement learning is the availability of standardized challenges, we provide an implementation for the first three abstraction layers, in the hope that the community would consider these challenges in order to develop intelligent web hacking agents.

  相似文献   

20.
多智能体深度强化学习的若干关键科学问题   总被引:6,自引:0,他引:6  
孙长银  穆朝絮 《自动化学报》2020,46(7):1301-1312
强化学习作为一种用于解决无模型序列决策问题的方法已经有数十年的历史, 但强化学习方法在处理高维变量问题时常常会面临巨大挑战. 近年来, 深度学习迅猛发展, 使得强化学习方法为复杂高维的多智能体系统提供优化的决策策略、在充满挑战的环境中高效执行目标任务成为可能. 本文综述了强化学习和深度强化学习方法的原理, 提出学习系统的闭环控制框架, 分析了多智能体深度强化学习中存在的若干重要问题和解决方法, 包括多智能体强化学习的算法结构、环境非静态和部分可观性等问题, 对所调查方法的优缺点和相关应用进行分析和讨论. 最后提供多智能体深度强化学习未来的研究方向, 为开发更强大、更易应用的多智能体强化学习控制系统提供一些思路.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号