This paper proposes a three-layered parallel fuzzy inference model called reinforcement fuzzy neural network with distributed prediction scheme (RFNN-DPS), which performs reinforcement learning with a novel distributed prediction scheme. In RFNN-DPS, an additional predictor for predicting the external reinforcement signal is not necessary, and the internal reinforcement information is distributed into fuzzy rules (rule nodes). Therefore, using RFNN-DPS, only one network is needed to construct a fuzzy logic system with the abilities of parallel inference and reinforcement learning. Basically, the information for prediction in RFNN-DPS is composed of credit values stored in fuzzy rule nodes, where each node holds a credit vector to represent the reliability of the corresponding fuzzy rule. The credit values are not only accessed for predicting external reinforcement signals, but also provide a more profitable internal reinforcement signal to each fuzzy rule itself. RFNN-DPS performs a credit-based exploratory algorithm to adjust its internal status according to the internal reinforcement signal. During learning, the RFNN-DPS network is constructed by a single-step or multistep reinforcement learning algorithm based on the ART concept. According to our experimental results, RFNN-DPS shows the advantages of simple network structure, fast learning speed, and explicit representation of rule reliability.  相似文献   

在多机器人系统中,协作环境探索的强化学习的空间规模是机器人个数的指数函数,学习空间非常庞大造成收敛速度极慢。为了解决这个问题,将基于动作预测的强化学习方法及动作选择策略应用于多机器人协作研究中,通过预测机器人可能执行动作的概率以加快学习算法的收敛速度。实验结果表明,基于动作预测的强化学习方法能够比原始算法更快速地获取多机器人的协作策略。  相似文献   

深度强化学习(deep reinforcement learning,DRL)可广泛应用于城市交通信号控制领域,但在现有研究中,绝大多数的DRL智能体仅使用当前的交通状态进行决策,在交通流变化较大的情况下控制效果有限。提出一种结合状态预测的DRL信号控制算法。首先,利用独热编码设计简洁且高效的交通状态;然后,使用长短期记忆网络(long short-term memory,LSTM)预测未来的交通状态;最后,智能体根据当前状态和预测状态进行最优决策。在SUMO(simulation of urban mobility)仿真平台上的实验结果表明,在单交叉口、多交叉口的多种交通流量条件下,与三种典型的信号控制算法相比,所提算法在平均等待时间、行驶时间、燃油消耗、CO2排放等指标上都具有最好的性能。  相似文献   

Making complex decisions in real world problems often involves assigning values to sets of interdependent variables where an expressive dependency structure among these can influence, or even dictate, what assignments are possible. Commonly used models typically ignore expressive dependencies since the traditional way of incorporating non-local dependencies is inefficient and hence leads to expensive training and inference. The contribution of this paper is two-fold. First, this paper presents Constrained Conditional Models (CCMs), a?framework that augments linear models with declarative constraints as a way to support decisions in an expressive output space while maintaining modularity and tractability of training. The paper develops, analyzes and compares novel algorithms for CCMs based on Hidden Markov Models and Structured Perceptron. The proposed CCM framework is also compared to task-tailored models, such as semi-CRFs. Second, we propose CoDL, a?constraint-driven learning algorithm, which makes use of constraints to guide semi-supervised learning. We provide theoretical justification for CoDL along with empirical results which show the advantage of using declarative constraints in the context of semi-supervised training of probabilistic models.  相似文献   

Zhang  Wei 《Applied Intelligence》2021,51(11):7990-8009

When reinforcement learning with a deep neural network is applied to heuristic search, the search becomes a learning search. In a learning search system, there are two key components: (1) a deep neural network with sufficient expression ability as a heuristic function approximator that estimates the distance from any state to a goal; (2) a strategy to guide the interaction of an agent with its environment to obtain more efficient simulated experience to update the Q-value or V-value function of reinforcement learning. To date, neither component has been sufficiently discussed. This study theoretically discusses the size of a deep neural network for approximating a product function of p piecewise multivariate linear functions. The existence of such a deep neural network with O(n + p) layers and O(dn + dnp + dp) neurons has been proven, where d is the number of variables of the multivariate function being approximated, ?? is the approximation error, and n = O(p + log2(pd/??)). For the second component, this study proposes a general propagational reinforcement-learning-based learning search method that improves the estimate h(.) according to the newly observed distance information about the goals, propagates the improvement bidirectionally in the search tree, and consequently obtains a sequence of more accurate V-values for a sequence of states. Experiments on the maze problems show that our method increases the convergence rate of reinforcement learning by a factor of 2.06 and reduces the number of learning episodes to 1/4 that of other nonpropagating methods.


Prediction of wind speed can provide a reference for the reliable utilization of wind energy. This study focuses on 1-hour, 1-step ahead deterministic wind speed prediction with only wind speed as input. To consider the time-varying characteristics of wind speed series, a dynamic ensemble wind speed prediction model based on deep reinforcement learning is proposed. It includes ensemble learning, multi-objective optimization, and deep reinforcement learning to ensure effectiveness. In part A, deep echo state network enhanced by real-time wavelet packet decomposition is used to construct base models with different vanishing moments. The variety of vanishing moments naturally guarantees the diversity of base models. In part B, multi-objective optimization is adopted to determine the combination weights of base models. The bias and variance of ensemble model are synchronously minimized to improve generalization ability. In part C, the non-dominated solutions of combination weights are embedded into a deep reinforcement learning environment to achieve dynamic selection. By reasonably designing the reinforcement learning environment, it can dynamically select non-dominated solution in each prediction according to the time-varying characteristics of wind speed. Four actual wind speed series are used to validate the proposed dynamic ensemble model. The results show that: (a) The proposed dynamic ensemble model is competitive for wind speed prediction. It significantly outperforms five classic intelligent prediction models and six ensemble methods; (b) Every part of the proposed model is indispensable to improve the prediction accuracy.  相似文献   

 We investigate a recently developed abstraction of genetic algorithms (GAs) in which a population of GAs in any generation is represented by a single vector whose elements are the probabilities of the corresponding bit positions being equivalent to 1. The process of evolution is represented by learning the elements of the probability vector; the method is clearly linked to the artificial neural network (ANN) method of competitive learning. We use techniques from ANNs to extend the applicability of the method to non-static problems, to multi-objective criteria, to multi-modal problems and to creating an order on a set of sub-populations.  相似文献   

Devin Schwab  Soumya Ray 《Machine Learning》2017,106(9-10):1569-1598
In this work, we build upon the observation that offline reinforcement learning (RL) is synergistic with task hierarchies that decompose large Markov decision processes (MDPs). Task hierarchies can allow more efficient sample collection from large MDPs, while offline algorithms can learn better policies than the so-called “recursively optimal” or even hierarchically optimal policies learned by standard hierarchical RL algorithms. To enable this synergy, we study sample collection strategies for offline RL that are consistent with a provided task hierarchy while still providing good exploration of the state-action space. We show that naïve extensions of uniform random sampling do not work well in this case and design a strategy that has provably good convergence properties. We also augment the initial set of samples using additional information from the task hierarchy, such as state abstraction. We use the augmented set of samples to learn a policy offline. Given a capable offline RL algorithm, this policy is then guaranteed to have a value greater than or equal to the value of the hierarchically optimal policy. We evaluate our approach on several domains and show that samples generated using a task hierarchy with a suitable strategy allow significantly more sample-efficient convergence than standard offline RL. Further, our approach also shows more sample-efficient convergence to policies with value greater than or equal to hierarchically optimal policies found through an online hierarchical RL approach.  相似文献   

Recently, many models of reinforcement learning with hierarchical or modular structures have been proposed. They decompose a task into simpler subtasks and solve them by using multiple agents. However, these models impose certain restrictions on the topological relations of agents and so on. By relaxing these restrictions, we propose networked reinforcement learning, where each agent in a network acts autonomously by regarding the other agents as a part of its environment. Although convergence to an optimal policy is no longer assured, by means of numerical simulations, we show that our model functions appropriately, at least in certain simple situations. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   


This article is about deep learning (DL) and deep reinforcement learning (DRL) works applied to robotics. Both tools have been shown to be successful in delivering data-driven solutions for robotics tasks, as well as providing a natural way to develop an end-to-end pipeline from the robot’s sensing to its actuation, passing through the generation of a policy to perform the given task. These frameworks have been proven to be able to deal with real-world complications such as noise in sensing, imprecise actuation, variability in the scenarios where the robot is being deployed, among others. Following that vein, and given the growing interest in DL and DRL, the present work starts by providing a brief tutorial on deep reinforcement learning, where the goal is to understand the main concepts and approaches followed in the field. Later, the article describes the main, recent, and most promising approaches of DL and DRL in robotics, with sufficient technical detail to understand the core of the works and to motivate interested readers to initiate their own research in the area. Then, to provide a comparative analysis, we present several taxonomies in which the references can be classified, according to high-level features, the task that the work addresses, the type of system, and the learning techniques used in the work. We conclude by presenting promising research directions in both DL and DRL.


This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H(infinity) control, we consider a differential game in which a "disturbing" agent tries to make the worst possible disturbance while a "control" agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H(infinity) control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.  相似文献   

深度强化学习在训练过程中会探索大量环境样本,造成算法收敛时间过长,而重用或传输来自先前任务(源任务)学习的知识,对算法在新任务(目标任务)的学习具有提高算法收敛速度的潜力.为了提高算法学习效率,提出一种双Q网络学习的迁移强化学习算法,其基于actor-critic框架迁移源任务最优值函数的知识,使目标任务中值函数网络对策略作出更准确的评价,引导策略快速向最优策略方向更新.将该算法用于Open AI Gym以及在三维空间机械臂到达目标物位置的实验中,相比于常规深度强化学习算法取得了更好的效果,实验证明提出的双Q网络学习的迁移强化学习算法具有较快的收敛速度,并且在训练过程中算法探索更加稳定.  相似文献   

Adaptive fuzzy command acquisition with reinforcement learning   总被引:2,自引:0,他引:2  
Proposes a four-layered adaptive fuzzy command acquisition network (AFCAN) for adaptively acquiring fuzzy command via interactions with the user or environment. It can catch the intended information from a sentence (command) given in natural language with fuzzy predicates. The intended information includes a meaningful semantic action and the fuzzy linguistic information of that action. The proposed AFCAN has three important features. First, we can make no restrictions whatever on the fuzzy command input, which is used to specify the desired information, and the network requires no acoustic, prosodic, grammar, and syntactic structure, Second, the linguistic information of an action is learned adaptively and it is represented by fuzzy numbers based on α-level sets. Third, the network can learn during the course of performing the task. The AFCAN can perform off-line as well as online learning. For the off-line learning, the mutual-information (MI) supervised learning scheme and the fuzzy backpropagation (FBP) learning scheme are employed when the training data are available in advance. The former learning scheme is used to learn meaningful semantic actions and the latter learn linguistic information. The AFCAN can also perform online learning interactively when it is in use for fuzzy command acquisition. For the online learning, the MI-reinforcement learning scheme and the fuzzy reinforcement learning scheme are developed for the online learning of meaningful actions and linguistic information, respectively. An experimental system is constructed to illustrate the performance and applicability of the proposed AFCAN  相似文献   

Shape grammars are a powerful and appealing formalism for automatic shape generation in computer-based design systems. This paper presents a proposal complementing the generative power of shape grammars with reinforcement learning techniques. We use simple (naive) shape grammars capable of generating a large variety of different designs. In order to generate those designs that comply with given design requirements, the grammar is subject to a process of machine learning using reinforcement learning techniques. Based on this method, we have developed a system for architectural design, aimed at generating two-dimensional layout schemes of single-family housing units. Using relatively simple grammar rules, we learn to generate schemes that satisfy a set of requirements stated in a design guideline. Obtained results are presented and discussed.  相似文献   

Shaping multi-agent systems with gradient reinforcement learning   总被引:1,自引:0,他引:1  
An original reinforcement learning (RL) methodology is proposed for the design of multi-agent systems. In the realistic setting of situated agents with local perception, the task of automatically building a coordinated system is of crucial importance. To that end, we design simple reactive agents in a decentralized way as independent learners. But to cope with the difficulties inherent to RL used in that framework, we have developed an incremental learning algorithm where agents face a sequence of progressively more complex tasks. We illustrate this general framework by computer experiments where agents have to coordinate to reach a global goal. This work has been conducted in part in NICTA’s Canberra laboratory.  相似文献   

In this paper,we present a technique for ensuring the stability of a large class of adaptively controlled systems.We combine IQC models of both the controlled system and the controller with a method of filtering control parameter updates to ensure stable behavior of the controlled system under adaptation of the controller.We present a specific application to a system that uses recurrent neural networks adapted via reinforcement learning techniques.The work presented extends earlier works on stable reinforcement learning with neural networks.Specifically,we apply an improved IQC analysis for RNNs with time-varying weights and evaluate the approach on more complex control system.  相似文献   

Ren  Changwei  An  Lixingjian  Gu  Zhanquan  Wang  Yuexuan  Gao  Yunjun 《World Wide Web》2020,23(4):2491-2511
World Wide Web - With the sharing economy boom, there is a notable increase in the number of car-sharing corporations, which provided a variety of travel options and improved convenience and...  相似文献   

In complex working site, bearings used as the important part of machine, could simultaneously have faults on several positions. Consequently, multi-label learning approach considering fully the correlation between different faulted positions of bearings becomes the popular learning pattern. Deep reinforcement learning (DRL) combining the perception ability of deep learning and the decision-making ability of reinforcement learning, could be adapted to the compound fault diagnosis while having a strong ability extracting the fault feature from the raw data. However, DRL is difficult to converge and easily falls into the unstable training problem. Therefore, this paper integrates the feature extraction ability of DRL and the knowledge transfer ability of transfer learning (TL), and proposes the multi-label transfer reinforcement learning (ML-TRL). In detail, the proposed method utilizes the improved trust region policy optimization (TRPO) as the basic DRL framework and pre-trains the fixed convolutional networks of ML-TRL using the multi-label convolutional neural network method. In compound fault experiment, the final results demonstrate powerfully that the proposed method could have the higher accuracy than other multi-label learning methods. Hence, the proposed method is a remarkable alternative when recognizing the compound fault of bearings.  相似文献   

In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.  相似文献   

Multiple model-based reinforcement learning   总被引:1,自引:0,他引:1  
We propose a modular reinforcement learning architecture for nonlinear, nonstationary control tasks, which we call multiple model-based reinforcement learning (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The system is composed of multiple modules, each of which consists of a state prediction model and a reinforcement learning controller. The "responsibility signal," which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules, as well as to gate the learning of the prediction models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, finite-state case and continuous-time, continuous-state case. The performance of MMRL was demonstrated for discrete case in a nonstationary hunting task in a grid world and for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters.  相似文献   

