期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Multiple mini-robots navigation using a collaborative multiagent reinforcement learning framework

Piyabhum Chaysri Kostas Vlachos 《Advanced Robotics》2020,34(13):902-916

In this work we investigate the use of a reinforcement learning (RL) framework for the autonomous navigation of a group of mini-robots in a multi-agent collaborative environment. Each mini-robot is driven by inertial forces provided by two vibration motors that are controlled by a simple and efficient low-level speed controller. The action of the RL agent is the direction of each mini-robot, and it is based on the position of each mini-robot, the distance between them and the sign of the distance gradient between each mini-robot and the nearest one. Each mini-robot is considered a moving obstacle that must be avoided by the others. We propose suitable state space and reward function that result in an efficient collaborative RL framework. The classical and the double Q-learning algorithms are employed, where the latter is considered to learn optimal policies of mini-robots that offers more stable and reliable learning process. A simulation environment is created, using the ROS framework, that include a group of four mini-robots. The dynamic model of each mini-robot and of the vibration motors is also included. Several application scenarios are simulated and the results are presented to demonstrate the performance of the proposed approach. 相似文献

2.

Inverse reinforcement learning-based time-dependent A* planner for human-aware robot navigation with local vision

Shiying Sun Xiaoguang Zhao Qianzhong Li Min Tan 《Advanced Robotics》2020,34(13):888-901

In an environment where robots coexist with humans, mobile robots should be human-aware and comply with humans' behavioural norms so as to not disturb humans' personal space and activities. In this work, we propose an inverse reinforcement learning-based time-dependent A* planner for human-aware robot navigation with local vision. In this method, the planning process of time-dependent A* is regarded as a Markov decision process and the cost function of the time-dependent A* is learned using the inverse reinforcement learning via capturing humans' demonstration trajectories. With this method, a robot can plan a path that complies with humans' behaviour patterns and the robot's kinematics. When constructing feature vectors of the cost function, considering the local vision characteristics, we propose a visual coverage feature for enabling robots to learn from how humans move in a limited visual field. The effectiveness of the proposed method has been validated by experiments in real-world scenarios: using this approach robots can effectively mimic human motion patterns when avoiding pedestrians; furthermore, in a limited visual field, robots can learn to choose a path that enables them to have the larger visual coverage which shows a better navigation performance. 相似文献

3.

Efficient vision-based navigation

Armin Hornung Maren Bennewitz Hauke Strasdat 《Autonomous Robots》2010,29(2):137-149

In this article, we present a novel approach to learning efficient navigation policies for mobile robots that use visual features for localization. As fast movements of a mobile robot typically introduce inherent motion blur in the acquired images, the uncertainty of the robot about its pose increases in such situations. As a result, it cannot be ensured anymore that a navigation task can be executed efficiently since the robot’s pose estimate might not correspond to its true location. We present a reinforcement learning approach to determine a navigation policy to reach the destination reliably and, at the same time, as fast as possible. Using our technique, the robot learns to trade off velocity against localization accuracy and implicitly takes the impact of motion blur on observations into account. We furthermore developed a method to compress the learned policy via a clustering approach. In this way, the size of the policy representation is significantly reduced, which is especially desirable in the context of memory-constrained systems. Extensive simulated and real-world experiments carried out with two different robots demonstrate that our learned policy significantly outperforms policies using a constant velocity and more advanced heuristics. We furthermore show that the policy is generally applicable to different indoor and outdoor scenarios with varying landmark densities as well as to navigation tasks of different complexity. 相似文献

4.

Learning to Acquire Whole-Body Humanoid Center of Mass Movements to Achieve Dynamic Tasks

《Advanced Robotics》2013,27(10):1125-1142

This paper presents a novel approach for acquiring dynamic whole-body movements on humanoid robots focused on learning a control policy for the center of mass (CoM). In our approach, we combine both a model-based CoM controller and a model-free reinforcement learning (RL) method to acquire dynamic whole-body movements in humanoid robots. (i) To cope with high dimensionality, we use a model-based CoM controller as a basic controller that derives joint angular velocities from the desired CoM velocity. The balancing issue can also be considered in the controller. (ii) The RL method is used to acquire a controller that generates the desired CoM velocity based on the current state. To demonstrate the effectiveness of our approach, we apply it to a ball-punching task on a simulated humanoid robot model. The acquired whole-body punching movement was also demonstrated on Fujitsu's Hoap-2 humanoid robot. 相似文献

5.

Preference-based reinforcement learning: a formal framework and a policy iteration algorithm

Johannes Fürnkranz Eyke Hüllermeier Weiwei Cheng Sang-Hyeun Park 《Machine Learning》2012,89(1-2):123-156

This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preference-based approach to reinforcement learning is the observation that in many real-world domains, numerical feedback signals are not readily available, or are defined arbitrarily in order to satisfy the needs of conventional RL algorithms. Instead, we propose an alternative framework for reinforcement learning, in which qualitative reward signals can be directly used by the learner. The framework may be viewed as a generalization of the conventional RL framework in which only a partial order between policies is required instead of the total order induced by their respective expected long-term reward. Therefore, building on novel methods for preference learning, our general goal is to equip the RL agent with qualitative policy models, such as ranking functions that allow for sorting its available actions from most to least promising, as well as algorithms for learning such models from qualitative feedback. As a proof of concept, we realize a first simple instantiation of this framework that defines preferences based on utilities observed for trajectories. To that end, we build on an existing method for approximate policy iteration based on roll-outs. While this approach is based on the use of classification methods for generalization and policy learning, we make use of a specific type of preference learning method called label ranking. Advantages of preference-based approximate policy iteration are illustrated by means of two case studies. 相似文献

6.

Developing reinforcement learning for adaptive co-construction of continuous high-dimensional state and action spaces

Masato Nagayoshi Hajime Murao Hisashi Tamaki 《Artificial Life and Robotics》2012,17(2):204-210

Engineers and researchers are paying more attention to reinforcement learning (RL) as a key technique for realizing adaptive and autonomous decentralized systems. In general, however, it is not easy to put RL into practical use. Our approach mainly deals with the problem of designing state and action spaces. Previously, an adaptive state space construction method which is called a ??state space filter?? and an adaptive action space construction method which is called ??switching RL??, have been proposed after the other space has been fixed. Then, we have reconstituted these two construction methods as one method by treating the former method and the latter method as a combined method for mimicking an infant??s perceptual and motor developments and we have proposed a method which is based on introducing and referring to ??entropy??. In this paper, a computational experiment was conducted using a so-called ??robot navigation problem?? with three-dimensional continuous state space and two-dimensional continuous action space which is more complicated than a so-called ??path planning problem??. As a result, the validity of the proposed method has been confirmed. 相似文献

7.

Deep imitation learning for 3D navigation tasks

Hussein Ahmed Elyan Eyad Gaber Mohamed Medhat Jayne Chrisina 《Neural computing & applications》2018,29(7):389-404

Deep learning techniques have shown success in learning from raw high-dimensional data in various applications. While deep reinforcement learning is recently gaining popularity as a method to train intelligent agents, utilizing deep learning in imitation learning has been scarcely explored. Imitation learning can be an efficient method to teach intelligent agents by providing a set of demonstrations to learn from. However, generalizing to situations that are not represented in the demonstrations can be challenging, especially in 3D environments. In this paper, we propose a deep imitation learning method to learn navigation tasks from demonstrations in a 3D environment. The supervised policy is refined using active learning in order to generalize to unseen situations. This approach is compared to two popular deep reinforcement learning techniques: deep-Q-networks and Asynchronous actor-critic (A3C). The proposed method as well as the reinforcement learning methods employ deep convolutional neural networks and learn directly from raw visual input. Methods for combining learning from demonstrations and experience are also investigated. This combination aims to join the generalization ability of learning by experience with the efficiency of learning by imitation. The proposed methods are evaluated on 4 navigation tasks in a 3D simulated environment. Navigation tasks are a typical problem that is relevant to many real applications. They pose the challenge of requiring demonstrations of long trajectories to reach the target and only providing delayed rewards (usually terminal) to the agent. The experiments show that the proposed method can successfully learn navigation tasks from raw visual input while learning from experience methods fail to learn an effective policy. Moreover, it is shown that active learning can significantly improve the performance of the initially learned policy using a small number of active samples.

相似文献

8.

A topological reinforcement learning agent for navigation

Arthur P. S. Braga Aluízio F. R. Araújo 《Neural computing & applications》2003,12(3-4):220-236

This article proposes a reinforcement learning procedure for mobile robot navigation using a latent-like learning schema. Latent learning refers to learning that occurs in the absence of reinforcement signals and is not apparent until reinforcement is introduced. This concept considers that part of a task can be learned before the agent receives any indication of how to perform such a task. In the proposed topological reinforcement learning agent (TRLA), a topological map is used to perform the latent learning. The propagation of the reinforcement signal throughout the topological neighborhoods of the map permits the estimation of a value function which takes in average less trials and with less updatings per trial than six of the main temporal difference reinforcement learning algorithms: Q-learning, SARSA, Q(λ)-learning, SARSA(λ), Dyna-Q and fast Q(λ)-learning. The RL agents were tested in four different environments designed to consider a growing level of complexity in accomplishing navigation tasks. The tests suggested that the TRLA chooses shorter trajectories (in the number of steps) and/or requires less value function updatings in each trial than the other six reinforcement learning (RL) algorithms. 相似文献

9.

Offline reinforcement learning with anderson acceleration for robotic tasks

Zuo Guoyu Huang Shuai Li Jiangeng Gong Daoxiong 《Applied Intelligence》2022,52(9):9885-9898

Applied Intelligence - Offline reinforcement learning (RL) can learn effective policy from a fixed batch of data without interaction. However, the real-world requirements, such as better... 相似文献

10.

A robust reinforcement learning using the concept of sliding mode control

M. Obayashi N. Nakahara T. Kuremoto K. Kobayashi 《Artificial Life and Robotics》2009,13(2):526-530

In this article, we propose a new control method using reinforcement learning (RL) with the concept of sliding mode control (SMC). Some remarkable characteristics of the SMC method are good robustness and stability for deviations from control conditions. On the other hand, RL may be applicable to complex systems that are difficult to model. However, applying reinforcement learning to a real system has a serious problem, i.e., many trials are required for learning. We intend to develop a new control method with good characteristics for both these methods. To realize it, we employ the actor-critic method, a kind of RL, to unite with the SMC. We are able to verify the effectiveness of the proposed control method through a computer simulation of inverted pendulum control without the use of inverted pendulum dynamics. In particular, it is shown that the proposed method enables the RL to learn in fewer trials than the reinforcement learning method. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008 相似文献

11.

Supervised fuzzy reinforcement learning for robot navigation

《Applied Soft Computing》2016

This paper addresses a new method for combination of supervised learning and reinforcement learning (RL). Applying supervised learning in robot navigation encounters serious challenges such as inconsistent and noisy data, difficulty for gathering training data, and high error in training data. RL capabilities such as training only by one evaluation scalar signal, and high degree of exploration have encouraged researchers to use RL in robot navigation problem. However, RL algorithms are time consuming as well as suffer from high failure rate in the training phase. Here, we propose Supervised Fuzzy Sarsa Learning (SFSL) as a novel idea for utilizing advantages of both supervised and reinforcement learning algorithms. A zero order Takagi–Sugeno fuzzy controller with some candidate actions for each rule is considered as the main module of robot's controller. The aim of training is to find the best action for each fuzzy rule. In the first step, a human supervisor drives an E-puck robot within the environment and the training data are gathered. In the second step as a hard tuning, the training data are used for initializing the value (worth) of each candidate action in the fuzzy rules. Afterwards, the fuzzy Sarsa learning module, as a critic-only based fuzzy reinforcement learner, fine tunes the parameters of conclusion parts of the fuzzy controller online. The proposed algorithm is used for driving E-puck robot in the environment with obstacles. The experiment results show that the proposed approach decreases the learning time and the number of failures; also it improves the quality of the robot's motion in the testing environments. 相似文献

12.

A consideration of human immunity-based reinforcement learning with continuous states 总被引：1，自引：1，他引：0

Shu Hosokawa Kazushi Nakano Kazunori Sakurama 《Artificial Life and Robotics》2010,15(4):560-564

Many reinforcement learning methods have been studied on the assumption that a state is discretized and the environment size is predetermined. However, an operating environment may have a continuous state and its size may not be known in advance, e.g., in robot navigation and control. When applying these methods to the environment described above, we may need a large amount of time for learning or failing to learn. In this study, we improve our previous human immunity-based reinforcement learning method so that it will work in continuous state space environments. Since our method selects an action based on the distance between the present state and the memorized action, information about the environment (e.g., environment size) is not required in advance. The validity of our method is demonstrated through simulations for the swingup control of an inverted pendulum. 相似文献

13.

Model-based reinforcement learning for partially observable games with sampling-based state estimation

Fujita H Ishii S 《Neural computation》2007,19(11):3051-3087

Games constitute a challenging domain of reinforcement learning (RL) for acquiring strategies because many of them include multiple players and many unobservable variables in a large state space. The difficulty of solving such realistic multiagent problems with partial observability arises mainly from the fact that the computational cost for the estimation and prediction in the whole state space, including unobservable variables, is too heavy. To overcome this intractability and enable an agent to learn in an unknown environment, an effective approximation method is required with explicit learning of the environmental model. We present a model-based RL scheme for large-scale multiagent problems with partial observability and apply it to a card game, hearts. This game is a well-defined example of an imperfect information game and can be approximately formulated as a partially observable Markov decision process (POMDP) for a single learning agent. To reduce the computational cost, we use a sampling technique in which the heavy integration required for the estimation and prediction can be approximated by a plausible number of samples. Computer simulation results show that our method is effective in solving such a difficult, partially observable multiagent problem. 相似文献

14.

TEXPLORE: real-time sample-efficient reinforcement learning for robots

Todd Hester Peter Stone 《Machine Learning》2013,90(3):385-429

The use of robots in society could be expanded by using reinforcement learning (RL) to allow robots to learn and adapt to new situations online. RL is a paradigm for learning sequential decision making tasks, usually formulated as a Markov Decision Process (MDP). For an RL algorithm to be practical for robotic control tasks, it must learn in very few samples, while continually taking actions in real-time. In addition, the algorithm must learn efficiently in the face of noise, sensor/actuator delays, and continuous state features. In this article, we present texplore, the first algorithm to address all of these challenges together. texplore is a model-based RL method that learns a random forest model of the domain which generalizes dynamics to unseen states. The agent explores states that are promising for the final policy, while ignoring states that do not appear promising. With sample-based planning and a novel parallel architecture, texplore can select actions continually in real-time whenever necessary. We empirically evaluate the importance of each component of texplore in isolation and then demonstrate the complete algorithm learning to control the velocity of an autonomous vehicle in real-time. 相似文献

15.

强化学习在足球机器人基本动作学习中的应用 总被引：1，自引：0，他引：1

段勇杨淮清崔宝侠徐心和《机器人》2008,30(5):1

主要研究了强化学习算法及其在机器人足球比赛技术动作学习问题中的应用．强化学习的状态空间和动作空间过大或变量连续,往往导致学习的速度过慢甚至难于收敛．针对这一问题,提出了基于T-S 模型模糊神经网络的强化学习方法,能够有效地实现强化学习状态空间到动作空间的映射．此外,使用提出的强化学习方法设计了足球机器人的技术动作,研究了在不需要专家知识和环境模型情况下机器人的行为学习问题．最后,通过实验证明了所研究方法的有效性,其能够满足机器人足球比赛的需要．相似文献

16.

一种基于自生成样本学习的奖赏塑形方法

钱煜俞扬周志华《软件学报》2013,24(11):2667-2675

强化学习通过从以往的决策反馈中学习,使Agent 做出正确的短期决策,以最大化其获得的累积奖赏值.以往研究发现,奖赏塑形方法通过提供简单、易学的奖赏替代函数(即奖赏塑性函数)来替换真实的环境奖赏,能够有效地提高强化学习性能.然而奖赏塑形函数通常是在领域知识或者最优策略示例的基础上建立的,均需要专家参与,代价高昂.研究是否可以在强化学习过程中自动地学习有效的奖赏塑形函数.通常,强化学习算法在学习过程中会采集大量样本.这些样本虽然有很多是失败的尝试,但对构造奖赏塑形函数可能提供有用信息.提出了针对奖赏塑形的新型最优策略不变条件,并在此基础上提出了RFPotential 方法,从自生成样本中学习奖赏塑形.在多个强化学习算法和问题上进行了实验,其结果表明,该方法可以加速强化学习过程. 相似文献

17.

An approach to the design of reinforcement functions in real world,agent-based applications 总被引：2，自引：0，他引：2

Bonarini A. Bonacina C. Matteucci M. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2001,31(3):288-301

The success of any reinforcement learning (RL) application is in large part due to the design of an appropriate reinforcement function. A methodological framework to support the design of reinforcement functions has not been defined yet, and this critical and often underestimated activity is left to the ability of the RL application designer. We propose an approach to support reinforcement function design in RL applications concerning learning behaviors for autonomous agents. We define some dimensions along which we can describe reinforcement functions; we consider the distribution of reinforcement values, their coherence and their matching with the designer's perspective. We give hints to define measures that objectively describe the reinforcement function; we discuss the trade-offs that should be considered to improve learning and we introduce the dimensions along which this improvement can be expected. The approach we are presenting is general enough to be adopted in a large number of RL projects. We show how to apply it in the design of learning classifier systems (LCS) applications. We consider a simple, but quite complete case study in evolutionary robotics, and we discuss reinforcement function design issues in this sample context. 相似文献

18.

Reinforcement learning for dynamic environment: a classification of dynamic environments and a detection method of environmental changes

Masato Nagayoshi Hajime Murao H. Tamaki 《Artificial Life and Robotics》2013,18(1-2):104-108

Engineers and researchers are paying more attention to reinforcement learning (RL) as a key technique for realizing computational intelligence such as adaptive and autonomous decentralized systems. In general, it is not easy to put RL into practical use. In prior research our approach mainly dealt with the problem of designing state and action spaces and we have proposed an adaptive co-construction method of state and action spaces. However, it is more difficult to design state and action spaces in dynamic environments than in static ones. Therefore, it is even more effective to use an adaptive co-construction method of state and action spaces in dynamic environments. In this paper, our approach mainly deals with a problem of adaptation in dynamic environments. First, we classify tasks of dynamic environments and propose a detection method of environmental changes to adapt to dynamic environments. Next, we conducted computational experiments using a so-called “path planning problem” with a slowly changing environment where the aging of the system is assumed. The performances of a conventional RL method and the proposed detection method were confirmed. 相似文献

19.

Evolutionary Development of Hierarchical Learning Structures

Elfwing S. Uchibe E. Doya K. Christensen H.I. 《Evolutionary Computation, IEEE Transactions on》2007,11(2):249-264

Hierarchical reinforcement learning (RL) algorithms can learn a policy faster than standard RL algorithms. However, the applicability of hierarchical RL algorithms is limited by the fact that the task decomposition has to be performed in advance by the human designer. We propose a Lamarckian evolutionary approach for automatic development of the learning structure in hierarchical RL. The proposed method combines the MAXQ hierarchical RL method and genetic programming (GP). In the MAXQ framework, a subtask can optimize the policy independently of its parent task's policy, which makes it possible to reuse learned policies of the subtasks. In the proposed method, the MAXQ method learns the policy based on the task hierarchies obtained by GP, while the GP explores the appropriate hierarchies using the result of the MAXQ method. To show the validity of the proposed method, we have performed simulation experiments for a foraging task in three different environmental settings. The results show strong interconnection between the obtained learning structures and the given task environments. The main conclusion of the experiments is that the GP can find a minimal strategy, i.e., a hierarchy that minimizes the number of primitive subtasks that can be executed for each type of situation. The experimental results for the most challenging environment also show that the policies of the subtasks can continue to improve, even after the structure of the hierarchy has been evolutionary stabilized, as an effect of Lamarckian mechanisms 相似文献

20.

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Xin Xu Chunming Liu Dewen Hu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1055-1070

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI. 相似文献