共查询到20条相似文献,搜索用时 15 毫秒
1.
为减少深度Q网络算法的训练时间,采用结合优先经验回放机制与竞争网络结构的DQN方法,针对Open AI Gym平台cart pole和mountain car两个经典控制问题进行研究,其中经验回放采用基于排序的机制,而竞争结构中采用深度神经网络。仿真结果表明,相比于常规DQN算法、基于竞争网络结构的DQN方法和基于优先经验回放的DQN方法,该方法具有更好的学习性能,训练时间最少。同时,详细分析了算法参数对于学习性能的影响,为实际运用提供了有价值的参考。 相似文献
2.
Compared with a single robot, Multi-robot Systems (MRSs) can undertake more challenging tasks in complex scenarios benefiting from the increased transportation capacity and fault tolerance. This paper presents a hierarchical framework for multi-robot navigation and formation in unknown environments with static and dynamic obstacles, where the robots compute and maintain the optimized formation while making progress to the target together. In the proposed framework, each single robot is capable of navigating to the global target in unknown environments based on its local perception, and only limited communication among robots is required to obtain the optimal formation. Accordingly, three modules are included in this framework. Firstly, we design a learning network based on Deep Deterministic Policy Gradient (DDPG) to address the global navigation task for single robot, which derives end-to-end policies that map the robot’s local perception into its velocity commands. To handle complex obstacle distributions (e.g. narrow/zigzag passage and local minimum) and stabilize the training process, strategies of Curriculum Learning (CL) and Reward Shaping (RS) are combined. Secondly, for an expected formation, its real-time configuration is optimized by a distributed optimization. This configuration considers surrounding obstacles and current formation status, and provides each robot with its formation target. Finally, a velocity adjustment method considering the robot kinematics is designed which adjusts the navigation velocity of each robot according to its formation target, making all the robots navigate to their targets while maintaining the expected formation. This framework allows for formation online reconfiguration and is scalable with the number of robots. Extensive simulations and 3-D evaluations verify that our method can navigate the MRS in unknown environments while maintaining the optimal formation. 相似文献
3.
概述了移动机器人常用的自主导航算法及其优缺点,在此基础上提出了强化学习方法。描述了强化学习算法的原理,并实现了用神经网络解决泛化问题。设计了基于障碍物探测传感器信息的机器人自主导航强化学习方法,给出了学习算法中各要素的数学模型。经仿真验证,算法正确有效,具有良好的收敛性和泛化能力。 相似文献
4.
In this paper, an integral reinforcement learning (IRL) algorithm on an actor–critic structure is developed to learn online the solution to the Hamilton–Jacobi–Bellman equation for partially-unknown constrained-input systems. The technique of experience replay is used to update the critic weights to solve an IRL Bellman equation. This means, unlike existing reinforcement learning algorithms, recorded past experiences are used concurrently with current data for adaptation of the critic weights. It is shown that using this technique, instead of the traditional persistence of excitation condition which is often difficult or impossible to verify online, an easy-to-check condition on the richness of the recorded data is sufficient to guarantee convergence to a near-optimal control law. Stability of the proposed feedback control law is shown and the effectiveness of the proposed method is illustrated with simulation examples. 相似文献
5.
This paper reviews exploration techniques in deep reinforcement learning. Exploration techniques are of primary importance when solving sparse reward problems. In sparse reward problems, the reward is rare, which means that the agent will not find the reward often by acting randomly. In such a scenario, it is challenging for reinforcement learning to learn rewards and actions association. Thus more sophisticated exploration methods need to be devised. This review provides a comprehensive overview of existing exploration approaches, which are categorised based on the key contributions as: reward novel states, reward diverse behaviours, goal-based methods, probabilistic methods, imitation-based methods, safe exploration and random-based methods. Then, unsolved challenges are discussed to provide valuable future research directions. Finally, the approaches of different categories are compared in terms of complexity, computational effort and overall performance. 相似文献
6.
针对现有的AGV在大规模未知复杂环境中进行自主导航配送的问题,基于深度强化学习完成了AGV智能导航系统设计。首先,结合传感器对周围的障碍物进行探测感知,利用DDPG(deep deterministic policy gradient)算法实现AGV小车从环境的感知输入到动作的直接输出控制,帮助AGV完成自主导航和避障任务。此外,针对训练样本易受环境干扰的问题,提出了一种新颖的DL(disturb learning)- DDPG算法,通过对学习样本中相关数据进行高斯噪声预处理,帮助智能体适应噪声状态下的训练环境,提升了AGV在真实环境中的鲁棒性。仿真实验表明,经改进后的DL-DDPG 算法能够为AGV导航系统提供更高效的在线决策能力,使AGV小车完成自主导航与智能控制。 相似文献
7.
The comprehensive utilization of incomplete multi-modality data is a difficult problem with strong practical value. Most of the previous multimodal learning algorithms require massive training data with complete modalities and annotated labels, which greatly limits their practicality. Although some existing algorithms can be used to complete the data imputation task, they still have two disadvantages: (1) they cannot control the semantics of the imputed modalities accurately; and (2) they need to establish multiple independent converters between any two modalities when extended to multimodal cases. To overcome these limitations, we propose a novel doubly semi-supervised multimodal learning (DSML) framework. Specifically, DSML uses a modality-shared latent space and multiple modality-specific generators to associate multiple modalities together. Here we divided the shared latent space into two independent parts, the semantic labels and the semantic-free styles, which allows us to easily control the semantics of generated samples. In addition, each modality has its own separate encoder and classifier to infer the corresponding semantic and semantic-free latent variables. The above DSML framework can be adversarially trained by using our specially designed softmax-based discriminators. Large amounts of experimental results show that the DSML obtains better performance than the baselines on three tasks, including semi-supervised classification, missing modality imputation and cross-modality retrieval. 相似文献
8.
Health sensing system (HSS), offering a variety of health services, has attracted considerable research attention in the area of smart healthcare. However, continuous sensing inevitably brings dramatic energy consumption of mobile sensing devices. On the other hand, the reduction of sensing time duration causes excessive delay in sensing a user state change and the missing of critical physiologic signal. Thus, the trade-off between energy consumption and delay constitutes a primary challenge in the design of HSS. In this paper, we propose an adaptive sensing strategy to intelligently determine the trigger time for sensing physiological parameters at a HSS. Furthermore, human context recognition (HCR) is adopted to design context-aware sensing strategy, where the health condition, sensing requirements, and dependence on physiological data are considered simultaneously. To devise the sensing strategy, we first generate a dynamic observation model. Next, we propose a sort retention double-DQN based sensing strategy. In comparison to traditional double-DQN, the proposed approach can effectively enhance learning stability and sample efficiency. With SRD-DQN, we can obtain the optimized solution for the schedule of the successive window according to the current state. We implement blood pressure and heart rate monitoring simulations to evaluate the performance of the proposed sensing strategy. Simulation results reveal that the sensing strategy can effectively restrain energy consumption and delay, and SRD-DQN converges faster than traditional DQN. 相似文献
9.
针对目前基于机器学习的自动驾驶运动规划需要大量样本、没有关联时间信息,以及没有利用全局导航信息等问题,提出一种基于深度时空Q网络的定向导航自动驾驶运动规划算法。首先,为提取自动驾驶的空间图像特征与前后帧的时间信息,基于原始深度Q网络,结合长短期记忆网络,提出一种新的深度时空Q网络;然后,为充分利用自动驾驶的全局导航信息,在提取环境信息的图像中加入指向信号来实现定向导航的目的;最后,基于提出的深度时空Q网络,设计面向自动驾驶运动规划模型的学习策略,实现端到端的运动规划,从输入的序列图像中预测车辆方向盘转角和油门刹车数据。在Carla驾驶模拟器中进行训练和测试的实验结果表明,在四条测试道路中该算法平均偏差均小于0.7 m,且稳定性能优于四种对比算法。该算法具有较好的学习性、稳定性和实时性,能够实现在全局导航路线下的自动驾驶运动规划。 相似文献
10.
In the actual working site, the equipment often works in different working conditions while the manufacturing system is rather complicated. However, traditional multi-label learning methods need to use the pre-defined label sequence or synchronously predict all labels of the input sample in the fault diagnosis domain. Deep reinforcement learning (DRL) combines the perception ability of deep learning and the decision-making ability of reinforcement learning. Moreover, the curriculum learning mechanism follows the learning approach of humans from easy to complex. Consequently, an improved proximal policy optimization (PPO) method, which is a typical algorithm in DRL, is proposed as a novel method on multi-label classification in this paper. The improved PPO method could build a relationship between several predicted labels of input sample because of designing an action history vector, which encodes all history actions selected by the agent at current time step. In two rolling bearing experiments, the diagnostic results demonstrate that the proposed method provides a higher accuracy than traditional multi-label methods on fault recognition under complicated working conditions. Besides, the proposed method could distinguish the multiple labels of input samples following the curriculum mechanism from easy to complex, compared with the same network using the pre-defined label sequence. 相似文献
11.
The predictive hotspot mapping of sparse spatio-temporal events (e.g., crime and traffic accidents) aims to forecast areas or locations with higher average risk of event occurrence, which is important to offer insight for preventative strategies. Although a network-based structure can better capture the micro-level variation of spatio-temporal events, existing deep learning methods of sparse events forecasting are either based on area or grid units due to the data sparsity in both space and time, and the complex network topology. To overcome these challenges, this paper develops the first deep learning (DL) model for network-based predictive mapping of sparse spatio-temporal events. Leveraging a graph-based representation of the network-structured data, a gated localised diffusion network (GLDNet) is introduced, which integrating a gated network to model the temporal propagation and a novel localised diffusion network to model the spatial propagation confined by the network topology. To deal with the sparsity issue, we reformulate the research problem as an imbalance regression task and employ a weighted loss function to train the DL model. The framework is validated on a crime forecasting case of South Chicago, USA, which outperforms the state-of-the-art benchmark by 12% and 25% in terms of the mean hit rate at 10% and 20% coverage level, respectively. 相似文献
12.
To benefit from the accurate simulation and high-throughput data contributed by advanced digital twin technologies in modern smart plants, the deep reinforcement learning (DRL) method is an appropriate choice to generate a self-optimizing scheduling policy. This study employs the deep Q-network (DQN), which is a successful DRL method, to solve the dynamic scheduling problem of flexible manufacturing systems (FMSs) involving shared resources, route flexibility, and stochastic arrivals of raw products. To model the system in consideration of both manufacturing efficiency and deadlock avoidance, we use a class of Petri nets combining timed-place Petri nets and a system of simple sequential processes with resources (S3PR), which is named as the timed S3PR. The dynamic scheduling problem of the timed S3PR is defined as a Markov decision process (MDP) that can be solved by the DQN. For constructing deep neural networks to approximate the DQN action-value function that maps the timed S3PR states to scheduling rewards, we innovatively employ a graph convolutional network (GCN) as the timed S3PR state approximator by proposing a novel graph convolution layer called a Petri-net convolution (PNC) layer. The PNC layer uses the input and output matrices of the timed S3PR to compute the propagation of features from places to transitions and from transitions to places, thereby reducing the number of parameters to be trained and ensuring robust convergence of the learning process. Experimental results verify that the proposed DQN with a PNC network can provide better solutions for dynamic scheduling problems in terms of manufacturing performance, computational efficiency, and adaptability compared with heuristic methods and a DQN with basic multilayer perceptrons. 相似文献
13.
Fault diagnosis methods for rotating machinery have always been a hot research topic, and artificial intelligence-based approaches have attracted increasing attention from both researchers and engineers. Among those related studies and methods, artificial neural networks, especially deep learning-based methods, are widely used to extract fault features or classify fault features obtained by other signal processing techniques. Although such methods could solve the fault diagnosis problems of rotating machinery, there are still two deficiencies. (1) Unable to establish direct linear or non-linear mapping between raw data and the corresponding fault modes, the performance of such fault diagnosis methods highly depends on the quality of the extracted features. (2) The optimization of neural network architecture and parameters, especially for deep neural networks, requires considerable manual modification and expert experience, which limits the applicability and generalization of such methods. As a remarkable breakthrough in artificial intelligence, AlphaGo, a representative achievement of deep reinforcement learning, provides inspiration and direction for the aforementioned shortcomings. Combining the advantages of deep learning and reinforcement learning, deep reinforcement learning is able to build an end-to-end fault diagnosis architecture that can directly map raw fault data to the corresponding fault modes. Thus, based on deep reinforcement learning, a novel intelligent diagnosis method is proposed that is able to overcome the shortcomings of the aforementioned diagnosis methods. Validation tests of the proposed method are carried out using datasets of two types of rotating machinery, rolling bearings and hydraulic pumps, which contain a large number of measured raw vibration signals under different health states and working conditions. The diagnosis results show that the proposed method is able to obtain intelligent fault diagnosis agents that can mine the relationships between the raw vibration signals and fault modes autonomously and effectively. Considering that the learning process of the proposed method depends only on the replayed memories of the agent and the overall rewards, which represent much weaker feedback than that obtained by the supervised learning-based method, the proposed method is promising in establishing a general fault diagnosis architecture for rotating machinery. 相似文献
14.
The integration of reinforcement learning (RL) and imitation learning (IL) is an important problem that has long been studied in the field of intelligent robotics. RL optimizes policies to maximize the cumulative reward, whereas IL attempts to extract general knowledge about the trajectories demonstrated by experts, i.e, demonstrators. Because each has its own drawbacks, many methods combining them and compensating for each set of drawbacks have been explored thus far. However, many of these methods are heuristic and do not have a solid theoretical basis. This paper presents a new theory for integrating RL and IL by extending the probabilistic graphical model (PGM) framework for RL, control as inference. We develop a new PGM for RL with multiple types of rewards, called probabilistic graphical model for Markov decision processes with multiple optimality emissions (pMDP-MO). Furthermore, we demonstrate that the integrated learning method of RL and IL can be formulated as a probabilistic inference of policies on pMDP-MO by considering the discriminator in generative adversarial imitation learning (GAIL) as an additional optimality emission. We adapt the GAIL and task-achievement reward to our proposed framework, achieving significantly better performance than policies trained with baseline methods. 相似文献
15.
Distributed manufacturing plays an important role for large-scale companies to reduce production and transportation costs for globalized orders. However, how to real-timely and properly assign dynamic orders to distributed workshops is a challenging problem. To provide real-time and intelligent decision-making of scheduling for distributed flowshops, we studied the distributed permutation flowshop scheduling problem (DPFSP) with dynamic job arrivals using deep reinforcement learning (DRL). The objective is to minimize the total tardiness cost of all jobs. We provided the training and execution procedures of intelligent scheduling based on DRL for the dynamic DPFSP. In addition, we established a DRL-based scheduling model for distributed flowshops by designing suitable reward function, scheduling actions, and state features. A novel reward function is designed to directly relate to the objective. Various problem-specific dispatching rules are introduced to provide efficient actions for different production states. Furthermore, four efficient DRL algorithms, including deep Q-network (DQN), double DQN (DbDQN), dueling DQN (DlDQN), and advantage actor-critic (A2C), are adapted to train the scheduling agent. The training curves show that the agent learned to generate better solutions effectively and validate that the system design is reasonable. After training, all DRL algorithms outperform traditional meta-heuristics and well-known priority dispatching rules (PDRs) by a large margin in terms of solution quality and computation efficiency. This work shows the effectiveness of DRL for the real-time scheduling of dynamic DPFSP. 相似文献
16.
Machine vision, especially deep learning methods, has become a hot topic for product surface inspection. In practice, capturing high quality images is a base for defect detection. It turns out to be challenging for complex products as image quality suffers from occlusion, illumination, and other issues. Multiple images from different viewpoints are often required in this scenario to cover all the important areas of the products. Reducing the viewpoints while ensuring the coverage is the key to make the inspection system more efficient in production. This paper proposes a high-efficient view planning method based on deep reinforcement learning to solve this problem. First, visibility estimation method is developed so that the visible areas can be quickly identified for a given viewpoint. Then, a new reward function is designed, and the Asynchronous Advantage Actor-Critic method is applied to solve the view planning problem. The effectiveness and efficiency of the proposed method is verified with a set of experiments. The proposed method could also be potentially applied to other similar vision-based tasks. 相似文献
17.
随着海量新能源接入到微电网中, 微电网系统模型的参数空间成倍增长, 其能量优化调度的计算难度不断上升. 同时, 新能源电源出力的不确定性也给微电网的优化调度带来巨大挑战. 针对上述问题, 本文提出了一种基于分布式深度强化学习的微电网实时优化调度策略. 首先, 在分布式的架构下, 将主电网和每个分布式电源看作独立智能体. 其次, 各智能体拥有一个本地学习模型, 并根据本地数据分别建立状态和动作空间, 设计一个包含发电成本、交易电价、电源使用寿命等多目标优化的奖励函数及其约束条件. 最后, 各智能体通过与环境交互来寻求本地最优策略, 同时智能体之间相互学习价值网络参数, 优化本地动作选择, 最终实现最小化微电网系统运行成本的目标. 仿真结果表明, 与深度确定性策略梯度算法(Deep Deterministic Policy Gradient, DDPG)相比, 本方法在保证系统稳定以及求解精度的前提下, 训练速度提高了17.6%, 成本函数值降低了67%, 实现了微电网实时优化调度. 相似文献
18.
针对现有基于策略梯度的深度强化学习方法应用于办公室、走廊等室内复杂场景下的机器人导航时,存在训练时间长、学习效率低的问题,本文提出了一种结合优势结构和最小化目标Q值的深度强化学习导航算法.该算法将优势结构引入到基于策略梯度的深度强化学习算法中,以区分同一状态价值下的动作差异,提升学习效率,并且在多目标导航场景中,对状态价值进行单独估计,利用地图信息提供更准确的价值判断.同时,针对离散控制中缓解目标Q值过估计方法在强化学习主流的Actor-Critic框架下难以奏效,设计了基于高斯平滑的最小目标Q值方法,以减小过估计对训练的影响.实验结果表明本文算法能够有效加快学习速率,在单目标、多目标连续导航训练过程中,收敛速度上都优于柔性演员评论家算法(SAC),双延迟深度策略性梯度算法(TD3),深度确定性策略梯度算法(DDPG),并使移动机器人有效远离障碍物,训练得到的导航模型具备较好的泛化能力. 相似文献
19.
In complex working site, bearings used as the important part of machine, could simultaneously have faults on several positions. Consequently, multi-label learning approach considering fully the correlation between different faulted positions of bearings becomes the popular learning pattern. Deep reinforcement learning (DRL) combining the perception ability of deep learning and the decision-making ability of reinforcement learning, could be adapted to the compound fault diagnosis while having a strong ability extracting the fault feature from the raw data. However, DRL is difficult to converge and easily falls into the unstable training problem. Therefore, this paper integrates the feature extraction ability of DRL and the knowledge transfer ability of transfer learning (TL), and proposes the multi-label transfer reinforcement learning (ML-TRL). In detail, the proposed method utilizes the improved trust region policy optimization (TRPO) as the basic DRL framework and pre-trains the fixed convolutional networks of ML-TRL using the multi-label convolutional neural network method. In compound fault experiment, the final results demonstrate powerfully that the proposed method could have the higher accuracy than other multi-label learning methods. Hence, the proposed method is a remarkable alternative when recognizing the compound fault of bearings. 相似文献
20.
How to design System of Systems has been widely concerned in recent years, especially in military applications. This problem is also known as SoS architecting, which can be boiled down to two subproblems: selecting a number of systems from a set of candidates and specifying the tasks to be completed for each selected system. Essentially, such a problem can be reduced to a combinatorial optimization problem. Traditional exact solvers such as branch-bound algorithm are not efficient enough to deal with large scale cases. Heuristic algorithms are more scalable, but if input changes, these algorithms have to restart the searching process. Re-searching process may take a long time and interfere with the mission achievement of SoS in highly dynamic scenarios, e.g., in the Mosaic Warfare. In this paper, we combine artificial intelligence with SoS architecting and propose a deep reinforcement learning approach DRL-SoSDP for SoS design. Deep neural networks and actor–critic algorithms are used to find the optimal solution with constraints. Evaluation results show that the proposed approach is superior to heuristic algorithms in both solution quality and computation time, especially in large scale cases. DRL-SoSDP can find great solutions in a near real-time manner, showing great potential for cases that require an instant reply. DRL-SoSDP also shows good generalization ability and can find better results than heuristic algorithms even when the scale of SoS is much larger than that in training data. 相似文献