针对深度强化学习算法中经验缓存机制构建问题,提出一种基于TD误差的重抽样优选缓存机制;针对该机制存在的训练集坍塌现象,提出基于排行的分层抽样算法进行改进,并结合该机制对已有的几种典型基于DQN的深度强化学习算法进行改进.通过对Open AI Gym平台上Cart Port学习控制问题的仿真实验对比分析表明,优选机制能够提升训练样本的质量,实现对值函数的有效逼近,具有良好的学习效率和泛化性能,收敛速度和训练性能均有明显提升.  相似文献   

作为一种不需要事先获得训练数据的机器学习方法, 强化学习(Reinforcement learning, RL)在智能体与环境的不断交互过程中寻找最优策略, 是解决序贯决策问题的一种重要方法. 通过与深度学习(Deep learning, DL)结合, 深度强化学习(Deep reinforcement learning, DRL)同时具备了强大的感知和决策能力, 被广泛应用于多个领域来解决复杂的决策问题. 异策略强化学习通过将交互经验进行存储和回放, 将探索和利用分离开来, 更易寻找到全局最优解. 如何对经验进行合理高效的利用是提升异策略强化学习方法效率的关键. 首先对强化学习的基本理论进行介绍; 随后对同策略和异策略强化学习算法进行简要介绍; 接着介绍经验回放(Experience replay, ER)问题的两种主流解决方案, 包括经验利用和经验增广; 最后对相关的研究工作进行总结和展望.  相似文献   

深度强化学习中稀疏奖励问题研究综述   总被引:1,自引:0,他引:1  
强化学习作为机器学习的重要分支,是在与环境交互中寻找最优策略的一类方法.强化学习近年来与深度学习进行了广泛结合,形成了深度强化学习的研究领域.作为一种崭新的机器学习方法,深度强化学习同时具有感知复杂输入和求解最优策略的能力,可以应用于机器人控制等复杂决策问题.稀疏奖励问题是深度强化学习在解决任务中面临的核心问题,在实际...  相似文献   

Policy iteration, which evaluates and improves the control policy iteratively, is a reinforcement learning method. Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity. However, most existing online least-squares policy iteration methods only use each sample just once, resulting in the low utilization rate. With the goal of improving the utilization efficiency, we propose an experience replay for least-squares policy iteration (ERLSPI) and prove its convergence. ERLSPI method combines online least-squares policy iteration method with experience replay, stores the samples which are generated online, and reuses these samples with least-squares method to update the control policy. We apply the ERLSPI method for the inverted pendulum system, a typical benchmark testing. The experimental results show that the method can effectively take advantage of the previous experience and knowledge, improve the empirical utilization efficiency, and accelerate the convergence speed.   相似文献   

区别于传统深度强化学习中通过从经验回放单元逐个选择的状态转移样本进行训练的方式,针对采用整个序列轨迹作为训练样本的深度Q网络(Deep Q Network,DQN),提出基于遗传算法的交叉操作扩充序列样本的方法.序列轨迹是由智能体与环境交互的试错决策过程中产生,其中会存在相似的关键状态.以两条序列轨迹中的相似状态作为交叉点,能产生出当前未出现过的序列轨迹,从而达到扩充序列样本数量、增大序列样本的多样性的目的,进而增加智能体的探索能力,提高样本效率.与深度Q网络随机采样训练样本和采用序列样本向后更新的算法(Episodic Backward Update,EBU)进行对比,所提出的方法在Playing Atari 2600视频游戏中能取得更高的奖赏值.  相似文献   

分层强化学习中的Option自动生成算法   总被引:2,自引:1,他引:2  
分层强化学习中目前有Option、HAM和MAXQ三种主要方法,其自动分层问题均未得到有效解决,该文针对第一种方法,提出了Option自动生成算法,该算法以Agent在学习初始阶段探测到的状态空间为输入,采用人工免疫网络技术对其进行聚类,在聚类后的各状态子集上通过经验回放学习产生内部策略集,从而生成Option,仿真实验验证了该算法的有效性。  相似文献   

为了提高大数据中多模态信息的检索效果,提出一种基于深度神经网络的多模态信息检索算法.设计深度自编码器,将不同模态的数据投影到一个相同的广义子空间内;利用稀疏编码技术降低共同特征向量的维度,过滤冗余特征和噪声特征;通过去卷积操作和上采样操作对数据进行重建.基于公开模态识别数据集的实验结果表明,该算法能够有效地学习和泛化多...  相似文献   

This paper reviews exploration techniques in deep reinforcement learning. Exploration techniques are of primary importance when solving sparse reward problems. In sparse reward problems, the reward is rare, which means that the agent will not find the reward often by acting randomly. In such a scenario, it is challenging for reinforcement learning to learn rewards and actions association. Thus more sophisticated exploration methods need to be devised. This review provides a comprehensive overview of existing exploration approaches, which are categorised based on the key contributions as: reward novel states, reward diverse behaviours, goal-based methods, probabilistic methods, imitation-based methods, safe exploration and random-based methods. Then, unsolved challenges are discussed to provide valuable future research directions. Finally, the approaches of different categories are compared in terms of complexity, computational effort and overall performance.  相似文献   

为减少深度Q网络算法的训练时间,采用结合优先经验回放机制与竞争网络结构的DQN方法,针对Open AI Gym平台cart pole和mountain car两个经典控制问题进行研究,其中经验回放采用基于排序的机制,而竞争结构中采用深度神经网络。仿真结果表明,相比于常规DQN算法、基于竞争网络结构的DQN方法和基于优先经验回放的DQN方法,该方法具有更好的学习性能,训练时间最少。同时,详细分析了算法参数对于学习性能的影响,为实际运用提供了有价值的参考。  相似文献   

王童  李骜  宋海荦  刘伟  王明会 《控制与决策》2022,37(11):2799-2807
针对现有基于深度强化学习(deep reinforcement learning, DRL)的分层导航方法在包含长廊、死角等结构的复杂环境下导航效果不佳的问题,提出一种基于option-based分层深度强化学习(hierarchical deep reinforcement learning, HDRL)的移动机器人导航方法.该方法的模型框架分为高层和低层两部分,其中低层的避障和目标驱动控制模型分别实现避障和目标接近两种行为策略,高层的行为选择模型可自动学习稳定、可靠的行为选择策略,从而有效避免对人为设计调控规则的依赖.此外,所提出方法通过对避障控制模型进行优化训练,使学习到的避障策略更加适用于复杂环境下的导航任务.在与现有DRL方法的对比实验中,所提出方法在全部仿真测试环境中均取得最高的导航成功率,同时在其他指标上也具有整体优势,表明所提出方法可有效解决复杂环境下导航效果不佳的问题,且具有较强的泛化能力.此外,真实环境下的测试进一步验证了所提出方法的潜在应用价值.  相似文献   

提出了一种新的分层强化学习(HRL)Option自动生成算法,以Agent在学习初始阶段探测到的状态空间为输入,并采用改进的蚁群聚类算法(ACCA)对其进行聚类,在聚类后的各状态子集上通过经验回放学习产生内部策略集,从而生成Option,仿真实验验证了该算法是有效的。  相似文献   

Compared with a single robot, Multi-robot Systems (MRSs) can undertake more challenging tasks in complex scenarios benefiting from the increased transportation capacity and fault tolerance. This paper presents a hierarchical framework for multi-robot navigation and formation in unknown environments with static and dynamic obstacles, where the robots compute and maintain the optimized formation while making progress to the target together. In the proposed framework, each single robot is capable of navigating to the global target in unknown environments based on its local perception, and only limited communication among robots is required to obtain the optimal formation. Accordingly, three modules are included in this framework. Firstly, we design a learning network based on Deep Deterministic Policy Gradient (DDPG) to address the global navigation task for single robot, which derives end-to-end policies that map the robot’s local perception into its velocity commands. To handle complex obstacle distributions (e.g. narrow/zigzag passage and local minimum) and stabilize the training process, strategies of Curriculum Learning (CL) and Reward Shaping (RS) are combined. Secondly, for an expected formation, its real-time configuration is optimized by a distributed optimization. This configuration considers surrounding obstacles and current formation status, and provides each robot with its formation target. Finally, a velocity adjustment method considering the robot kinematics is designed which adjusts the navigation velocity of each robot according to its formation target, making all the robots navigate to their targets while maintaining the expected formation. This framework allows for formation online reconfiguration and is scalable with the number of robots. Extensive simulations and 3-D evaluations verify that our method can navigate the MRS in unknown environments while maintaining the optimal formation.  相似文献   

This paper presents a hybrid path planning algorithm for the design of autonomous vehicles such as mobile robots. The hybrid planner is based on Potential Field method and Voronoi Diagram approach and is represented with the ability of concurrent navigation and map building. The system controller (Look-ahead Control) with the Potential Field method guarantees the robot generate a smooth and safe path to an expected position. The Voronoi Diagram approach is adopted for the purpose of helping the mobile robot to avoid being trapped by concave environment while exploring a route to a target. This approach allows the mobile robot to accomplish an autonomous navigation task with only an essential exploration between a start and goal position. Based on the existing topological map the mobile robot is able to construct sub-goals between predefined start and goal, and follows a smooth and safe trajectory in a flexible manner when stationary and moving obstacles co-exist.  相似文献   

In this paper, a data-based feedback relearning algorithm is proposed for the robust control problem of uncertain nonlinear systems. Motivated by the classical on-policy and off-policy algorithms of reinforcement learning, the online feedback relearning (FR) algorithm is developed where the collected data includes the influence of disturbance signals. The FR algorithm has better adaptability to environmental changes (such as the control channel disturbances) compared with the off-policy algorithm, and has higher computational efficiency and better convergence performance compared with the on-policy algorithm. Data processing based on experience replay technology is used for great data efficiency and convergence stability. Simulation experiments are presented to illustrate convergence stability, optimality and algorithmic performance of FR algorithm by comparison.   相似文献   

由深度学习(deep learning, DL)和强化学习(reinforcement learning, RL)结合形成的深度强化学习(deep reinforcement learning, DRL)是目前人工智能领域的一个热点.深度强化学习在处理具有高维度输入的最优策略求解任务中取得了很大的突破.为了减少转移状态之间暂时的相关性,传统深度Q网络使用经验回放的采样机制,从缓存记忆中随机采样转移样本.然而,随机采样并不考虑缓存记忆中各个转移样本的优先级,导致网络训练过程中可能会过多地采用信息较低的样本,而忽略一些高信息量的样本,结果不但增加了训练时间,而且训练效果也不理想.针对此问题,在传统深度Q网络中引入优先级概念,提出基于最大置信上界的采样算法,通过奖赏、时间步、采样次数共同决定经验池中样本的优先级,提高未被选择的样本、更有信息价值的样本以及表现优秀的样本的被选概率,保证了所采样本的多样性,使智能体能更有效地选择动作.最后,在Atari 2600的多个游戏环境中进行仿真实验,验证了算法的有效性.  相似文献   

Since several years dynamic movement primitives (DMPs) are more and more getting into the center of interest for flexible movement control in robotics. In this study we introduce sensory feedback together with a predictive learning mechanism which allows tightly coupled dual-agent systems to learn an adaptive, sensor-driven interaction based on DMPs. The coupled conventional (no-sensors, no learning) DMP-system automatically equilibrates and can still be solved analytically allowing us to derive conditions for stability. When adding adaptive sensor control we can show that both agents learn to cooperate. Simulations as well as real-robot experiments are shown. Interestingly, all these mechanisms are entirely based on low level interactions without any planning or cognitive component.  相似文献   

近年来,深度强化学习的取得了飞速发展,为了提高深度强化学习处理高维状态空间或动态复杂环境的能力,研究者将记忆增强型神经网络引入到深度强化学习,并提出了不同的记忆增强型深度强化学习算法,记忆增强型深度强化学习已成为当前的研究热点.本文根据记忆增强型神经网络类型,将记忆增强型深度强化学习分为了4类:基于经验回放的深度强化学...  相似文献   

In this paper, a finite-time optimal tracking control scheme based on integral reinforcement learning is developed for partially unknown nonlinear systems. In order to realize the prescribed performance, the original system is transformed into an equivalent unconstrained system so as to a composite system is constructed. Subsequently, a modified nonlinear quadratic performance function containing the auxiliary tracking error is designed. Furthermore, the technique of experience replay is used to update the critic neural network, which eliminates the persistent of excitation condition in traditional optimal methods. By combining the prescribed performance control with the finite-time optimization control technique, the tracking error is driven to a desired performance in finite time. Consequently, it has been shown that all signals in the partially unknown nonlinear system are semiglobally practical finite-time stable by stability analysis. Finally, the provided comparative simulation results verify the effectiveness of the developed control scheme.  相似文献   

Aiming at human-robot collaboration in manufacturing, the operator's safety is the primary issue during the manufacturing operations. This paper presents a deep reinforcement learning approach to realize the real-time collision-free motion planning of an industrial robot for human-robot collaboration. Firstly, the safe human-robot collaboration manufacturing problem is formulated into a Markov decision process, and the mathematical expression of the reward function design problem is given. The goal is that the robot can autonomously learn a policy to reduce the accumulated risk and assure the task completion time during human-robot collaboration. To transform our optimization object into a reward function to guide the robot to learn the expected behaviour, a reward function optimizing approach based on the deterministic policy gradient is proposed to learn a parameterized intrinsic reward function. The reward function for the agent to learn the policy is the sum of the intrinsic reward function and the extrinsic reward function. Then, a deep reinforcement learning algorithm intrinsic reward-deep deterministic policy gradient (IRDDPG), which is the combination of the DDPG algorithm and the reward function optimizing approach, is proposed to learn the expected collision avoidance policy. Finally, the proposed algorithm is tested in a simulation environment, and the results show that the industrial robot can learn the expected policy to achieve the safety assurance for industrial human-robot collaboration without missing the original target. Moreover, the reward function optimizing approach can help make up for the designed reward function and improve policy performance.  相似文献   

