首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
While Reinforcement Learning (RL) is not traditionally designed for interactive supervisory input from a human teacher, several works in both robot and software agents have adapted it for human input by letting a human trainer control the reward signal. In this work, we experimentally examine the assumption underlying these works, namely that the human-given reward is compatible with the traditional RL reward signal. We describe an experimental platform with a simulated RL robot and present an analysis of real-time human teaching behavior found in a study in which untrained subjects taught the robot to perform a new task. We report three main observations on how people administer feedback when teaching a Reinforcement Learning agent: (a) they use the reward channel not only for feedback, but also for future-directed guidance; (b) they have a positive bias to their feedback, possibly using the signal as a motivational channel; and (c) they change their behavior as they develop a mental model of the robotic learner. Given this, we made specific modifications to the simulated RL robot, and analyzed and evaluated its learning behavior in four follow-up experiments with human trainers. We report significant improvements on several learning measures. This work demonstrates the importance of understanding the human-teacher/robot-learner partnership in order to design algorithms that support how people want to teach and simultaneously improve the robot's learning behavior.  相似文献   

2.
Machine learning is traditionally formalized and investigated as the study of learning concepts and decision functions from labeled examples, requiring a representation that encodes information about the domain of the decision function to be learned. We are interested in providing a way for a human teacher to interact with an automated learner using natural instructions, thus allowing the teacher to communicate the relevant domain expertise to the learner without necessarily knowing anything about the internal representations used in the learning process. In this paper we suggest to view the process of learning a decision function as a natural language lesson interpretation problem, as opposed to learning from labeled examples. This view of machine learning is motivated by human learning processes, in which the learner is given a lesson describing the target concept directly and a few instances exemplifying it. We introduce a learning algorithm for the lesson interpretation problem that receives feedback from its performance on the final task, while learning jointly (1) how to interpret the lesson and (2) how to use this interpretation to do well on the final task. traditional machine learning by focusing on supplying the learner only with information that can be provided by a task expert. We evaluate our approach by applying it to the rules of the solitaire card game. We show that our learning approach can eventually use natural language instructions to learn the target concept and play the game legally. Furthermore, we show that the learned semantic interpreter also generalizes to previously unseen instructions.  相似文献   

3.
This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preference-based approach to reinforcement learning is the observation that in many real-world domains, numerical feedback signals are not readily available, or are defined arbitrarily in order to satisfy the needs of conventional RL algorithms. Instead, we propose an alternative framework for reinforcement learning, in which qualitative reward signals can be directly used by the learner. The framework may be viewed as a generalization of the conventional RL framework in which only a partial order between policies is required instead of the total order induced by their respective expected long-term reward. Therefore, building on novel methods for preference learning, our general goal is to equip the RL agent with qualitative policy models, such as ranking functions that allow for sorting its available actions from most to least promising, as well as algorithms for learning such models from qualitative feedback. As a proof of concept, we realize a first simple instantiation of this framework that defines preferences based on utilities observed for trajectories. To that end, we build on an existing method for approximate policy iteration based on roll-outs. While this approach is based on the use of classification methods for generalization and policy learning, we make use of a specific type of preference learning method called label ranking. Advantages of preference-based approximate policy iteration are illustrated by means of two case studies.  相似文献   

4.
The behavior of reinforcement learning (RL) algorithms is best understood in completely observable, discrete-time controlled Markov chains with finite state and action spaces. In contrast, robot-learning domains are inherently continuous both in time and space, and moreover are partially observable. Here we suggest a systematic approach to solve such problems in which the available qualitative and quantitative knowledge is used to reduce the complexity of learning task. The steps of the design process are to: (i) decompose the task into subtasks using the qualitative knowledge at hand; (ii) design local controllers to solve the subtasks using the available quantitative knowledge, and (iii) learn a coordination of these controllers by means of reinforcement learning. It is argued that the approach enables fast, semi-automatic, but still high quality robot-control as no fine-tuning of the local controllers is needed. The approach was verified on a non-trivial real-life robot task. Several RL algorithms were compared by ANOVA and it was found that the model-based approach worked significantly better than the model-free approach. The learnt switching strategy performed comparably to a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which were not foreseen in advance, thus supporting the view that adaptive algorithms are advantageous to nonadaptive ones in complex environments.  相似文献   

5.
Imitation is an important learning mechanism of widespread utility and common occurrence. This article presents a theory and working computational model of the detailed mechanisms of imitation. The model is in the restricted domain of the learning of pencil and paper procedures. The task that is modelled is of a teacher demonstrating the steps of a procedure, such as long division to a student by means of one or more examples. Such a task can be learned by an imitation-learning mechanism, but the mechanism has a much wider range of application. Imitation is treated as a four-stage process: the events performed by the teacher are segmented by the learner; the events are encoded and explained in terms of spatial relations between objects; repeated patterns in the events are recognized; and finally, different examples are merged together. This model is implemented as a computer program learning algorithms from worked examples (LAWE).  相似文献   

6.
The behavior of reinforcement learning (RL) algorithms is best understood in completely observable, discrete-time controlled Markov chains with finite state and action spaces. In contrast, robot-learning domains are inherently continuous both in time and space, and moreover are partially observable. Here we suggest a systematic approach to solve such problems in which the available qualitative and quantitative knowledge is used to reduce the complexity of learning task. The steps of the design process are to:i) decompose the task into subtasks using the qualitative knowledge at hand; ii) design local controllers to solve the subtasks using the available quantitative knowledge and iii) learn a coordination of these controllers by means of reinforcement learning. It is argued that the approach enables fast, semi-automatic, but still high quality robot-control as no fine-tuning of the local controllers is needed. The approach was verified on a non-trivial real-life robot task. Several RL algorithms were compared by ANOVA and it was found that the model-based approach worked significantly better than the model-free approach. The learnt switching strategy performed comparably to a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which were not foreseen in advance, thus supporting the view that adaptive algorithms are advantageous to non-adaptive ones in complex environments.  相似文献   

7.
S. Ben-David 《Algorithmica》1998,22(1-2):3-17
Commonly, in learning theory, the task of the learner is to identify an unknown target object. We consider a variant of this basic task in which the learner is required only to decide whether the unknown target has a certain property. We allow an infinite learning process in which the learner is required to eventually arrive at the correct answer. We say that a problem for which such a learning algorithm exists is Decidable In the Limit (DIL). We analyze the class of DIL problems and provide a necessary and sufficient condition for the membership of a decision problem in this class. We offer an algorithm for any DIL problem, and apply it to several types of learning tasks. We introduce an extension of the usual Inductive Inference learning model—Inductive Inference with a Cheating Teacher. In this model the teacher may choose to present to the learner, not only a language belonging to the agreed-upon family of languages, but also an arbitrary language outside this family. In such a case we require that the learner will be able to eventually detect the faulty choice made by the teacher. We show that such a strong type of learning is possible, and there exist learning algorithms that will fail only on arbitrarily small sets of faulty languages. Furthermore, if an a priori probability distribution P , according to which f is being chosen, is available to the algorithm, then it can be strengthened into a finite algorithm. More precisely, for many distributions P , there exists a polynomial function, l , such that, for every 0 < δ < 1 , there is an algorithm using at most l(log(δ)) many probes that succeeds on more than (1-δ) of the f 's (as measured by P ). We believe that the new approach presented here will be found useful for many further applications. Received February 14, 1997; revised July 6, 1997, and July 18, 1997.  相似文献   

8.
强化学习在足球机器人基本动作学习中的应用   总被引:1,自引:0,他引:1  
主要研究了强化学习算法及其在机器人足球比赛技术动作学习问题中的应用.强化学习的状态空间 和动作空间过大或变量连续,往往导致学习的速度过慢甚至难于收敛.针对这一问题,提出了基于T-S 模型模糊 神经网络的强化学习方法,能够有效地实现强化学习状态空间到动作空间的映射.此外,使用提出的强化学习方 法设计了足球机器人的技术动作,研究了在不需要专家知识和环境模型情况下机器人的行为学习问题.最后,通 过实验证明了所研究方法的有效性,其能够满足机器人足球比赛的需要.  相似文献   

9.
This paper addresses a new method for combination of supervised learning and reinforcement learning (RL). Applying supervised learning in robot navigation encounters serious challenges such as inconsistent and noisy data, difficulty for gathering training data, and high error in training data. RL capabilities such as training only by one evaluation scalar signal, and high degree of exploration have encouraged researchers to use RL in robot navigation problem. However, RL algorithms are time consuming as well as suffer from high failure rate in the training phase. Here, we propose Supervised Fuzzy Sarsa Learning (SFSL) as a novel idea for utilizing advantages of both supervised and reinforcement learning algorithms. A zero order Takagi–Sugeno fuzzy controller with some candidate actions for each rule is considered as the main module of robot's controller. The aim of training is to find the best action for each fuzzy rule. In the first step, a human supervisor drives an E-puck robot within the environment and the training data are gathered. In the second step as a hard tuning, the training data are used for initializing the value (worth) of each candidate action in the fuzzy rules. Afterwards, the fuzzy Sarsa learning module, as a critic-only based fuzzy reinforcement learner, fine tunes the parameters of conclusion parts of the fuzzy controller online. The proposed algorithm is used for driving E-puck robot in the environment with obstacles. The experiment results show that the proposed approach decreases the learning time and the number of failures; also it improves the quality of the robot's motion in the testing environments.  相似文献   

10.
We discuss the role of state focus in reinforcement learning (RL) systems that are applicable to mechanical systems including robots. Although the concept of the state focus is similar to attention/focusing in visual domains, its implementation requires some theoretical background based on RL. We propose an RL system that effectively learns how to choose the focus simultaneously with how to achieve a task. This RL system does not need heuristics for the adaptation of its focus. We conducted a capture experiment to compare the learning speed between the proposed system and the traditional systems, SARSAs, and conducted a navigation experiment to confirm the applicability of the proposed system to a realistic task. In the capture experiment, the proposed system learned faster than SARSAs. We visualized the developmental process of the focusing strategy in the proposed system using a Q-value analysis technique. In the navigation task, the proposed system demonstrated faster learning than SARSAs in the realistic task. The proposed system is applicable to a wide class of RLs that are applicable to mechanical systems including robots.  相似文献   

11.
Cooperative Multi-Agent Learning: The State of the Art   总被引:1,自引:4,他引:1  
Cooperative multi-agent systems (MAS) are ones in which several agents attempt, through their interaction, to jointly solve tasks or to maximize utility. Due to the interactions among the agents, multi-agent problem complexity can rise rapidly with the number of agents or their behavioral sophistication. The challenge this presents to the task of programming solutions to MAS problems has spawned increasing interest in machine learning techniques to automate the search and optimization process. We provide a broad survey of the cooperative multi-agent learning literature. Previous surveys of this area have largely focused on issues common to specific subareas (for example, reinforcement learning, RL or robotics). In this survey we attempt to draw from multi-agent learning work in a spectrum of areas, including RL, evolutionary computation, game theory, complex systems, agent modeling, and robotics. We find that this broad view leads to a division of the work into two categories, each with its own special issues: applying a single learner to discover joint solutions to multi-agent problems (team learning), or using multiple simultaneous learners, often one per agent (concurrent learning). Additionally, we discuss direct and indirect communication in connection with learning, plus open issues in task decomposition, scalability, and adaptive dynamics. We conclude with a presentation of multi-agent learning problem domains, and a list of multi-agent learning resources.  相似文献   

12.
Hierarchical reinforcement learning (RL) algorithms can learn a policy faster than standard RL algorithms. However, the applicability of hierarchical RL algorithms is limited by the fact that the task decomposition has to be performed in advance by the human designer. We propose a Lamarckian evolutionary approach for automatic development of the learning structure in hierarchical RL. The proposed method combines the MAXQ hierarchical RL method and genetic programming (GP). In the MAXQ framework, a subtask can optimize the policy independently of its parent task's policy, which makes it possible to reuse learned policies of the subtasks. In the proposed method, the MAXQ method learns the policy based on the task hierarchies obtained by GP, while the GP explores the appropriate hierarchies using the result of the MAXQ method. To show the validity of the proposed method, we have performed simulation experiments for a foraging task in three different environmental settings. The results show strong interconnection between the obtained learning structures and the given task environments. The main conclusion of the experiments is that the GP can find a minimal strategy, i.e., a hierarchy that minimizes the number of primitive subtasks that can be executed for each type of situation. The experimental results for the most challenging environment also show that the policies of the subtasks can continue to improve, even after the structure of the hierarchy has been evolutionary stabilized, as an effect of Lamarckian mechanisms  相似文献   

13.
Reinforcement learning (RL) for robot control is an important technology for future robots since it enables us to design a robot’s behavior using the reward function. However, RL for high degree-of-freedom robot control is still an open issue. This paper proposes a discrete action space DCOB which is generated from the basis functions (BFs) given to approximate a value function. The remarkable feature is that, by reducing the number of BFs to enable the robot to learn quickly the value function, the size of DCOB is also reduced, which improves the learning speed. In addition, a method WF-DCOB is proposed to enhance the performance, where wire-fitting is utilized to search for continuous actions around each discrete action of DCOB. We apply the proposed methods to motion learning tasks of a simulated humanoid robot and a real spider robot. The experimental results demonstrate outstanding performance.  相似文献   

14.
This article proposes a field application of a Reinforcement Learning (RL) control system for solving the action selection problem of an autonomous robot in a cable tracking task. The Ictineu Autonomous Underwater Vehicle (AUV) learns to perform a visual based cable tracking task in a two step learning process. First, a policy is computed by means of simulation where a hydrodynamic model of the vehicle simulates the cable following task. The identification procedure follows a specially designed Least Squares (LS) technique. Once the simulated results are accurate enough, in a second step, the learnt-in-simulation policy is transferred to the vehicle where the learning procedure continues in a real environment, improving the initial policy. The Natural Actor–Critic (NAC) algorithm has been selected to solve the problem. This Actor–Critic (AC) algorithm aims to take advantage of Policy Gradient (PG) and Value Function (VF) techniques for fast convergence. The work presented contains extensive real experimentation. The main objective of this work is to demonstrate the feasibility of RL techniques to learn autonomous underwater tasks, the selection of a cable tracking task is motivated by an increasing industrial demand in a technology to survey and maintain underwater structures.  相似文献   

15.
自主机器人的强化学习研究进展   总被引:9,自引:1,他引:8  
陈卫东  席裕庚  顾冬雷 《机器人》2001,23(4):379-384
虽然基于行为控制的自主机器人具有较高的鲁棒性,但其对于动态环境缺乏必要的自 适应能力.强化学习方法使机器人可以通过学习来完成任务,而无需设计者完全预先规定机 器人的所有动作,它是将动态规划和监督学习结合的基础上发展起来的一种新颖的学习方法 ,它通过机器人与环境的试错交互,利用来自成功和失败经验的奖励和惩罚信号不断改进机 器人的性能,从而达到目标,并容许滞后评价.由于其解决复杂问题的突出能力,强化学习 已成为一种非常有前途的机器人学习方法.本文系统论述了强化学习方法在自主机器人中的 研究现状,指出了存在的问题,分析了几种问题解决途径,展望了未来发展趋势.  相似文献   

16.
Reinforcement learning (RL) attracts much attention as a technique for realizing computational intelligence such as adaptive and autonomous decentralized systems. In general, however, it is not easy to put RL to practical use. This difficulty includes the problem of designing a suitable action space for an agent, i.e., satisfying two requirements in trade-off: (i) to keep the characteristics (or structure) of an original search space as much as possible in order to seek strategies that lie close to the optimal, and (ii) to reduce the search space as much as possible in order to expedite the learning process. In order to design a suitable action space adaptively, in this article, we propose a RL model with switching controllers based on Q-learning and an actor-critic to mimic the process of an infant’s motor development in which gross motor skills develop before fine motor skills. Then a method for switching controllers is constructed by introducing and referring to the “entropy.” Further, through computational experiments by using a path-planning problem with continuous action space, the validity and potential of the proposed method have been confirmed.  相似文献   

17.
The robot soccer game has been proposed as a benchmark problem for the artificial intelligence and robotic researches. Decision-making system is the most important part of the robot soccer system. As the environment is dynamic and complex, one of the reinforcement learning (RL) method named FNN-RL is employed in learning the decision-making strategy. The FNN-RL system consists of the fuzzy neural network (FNN) and RL. RL is used for structure identification and parameters tuning of FNN. On the other hand, the curse of dimensionality problem of RL can be solved by the function approximation characteristics of FNN. Furthermore, the residual algorithm is used to calculate the gradient of the FNN-RL method in order to guarantee the convergence and rapidity of learning. The complex decision-making task is divided into multiple learning subtasks that include dynamic role assignment, action selection, and action implementation. They constitute a hierarchical learning system. We apply the proposed FNN-RL method to the soccer agents who attempt to learn each subtask at the various layers. The effectiveness of the proposed method is demonstrated by the simulation and the real experiments.  相似文献   

18.
We formalize the problem of Structured Prediction as a Reinforcement Learning task. We first define a Structured Prediction Markov Decision Process (SP-MDP), an instantiation of Markov Decision Processes for Structured Prediction and show that learning an optimal policy for this SP-MDP is equivalent to minimizing the empirical loss. This link between the supervised learning formulation of structured prediction and reinforcement learning (RL) allows us to use approximate RL methods for learning the policy. The proposed model makes weak assumptions both on the nature of the Structured Prediction problem and on the supervision process. It does not make any assumption on the decomposition of loss functions, on data encoding, or on the availability of optimal policies for training. It then allows us to cope with a large range of structured prediction problems. Besides, it scales well and can be used for solving both complex and large-scale real-world problems. We describe two series of experiments. The first one provides an analysis of RL on classical sequence prediction benchmarks and compares our approach with state-of-the-art SP algorithms. The second one introduces a tree transformation problem where most previous models fail. This is a complex instance of the general labeled tree mapping problem. We show that RL exploration is effective and leads to successful results on this challenging task. This is a clear confirmation that RL could be used for large size and complex structured prediction problems.  相似文献   

19.
For many forms of e-learning environments, the system??s behavior can be viewed as a sequential decision process wherein, at each discrete step, the system is responsible for selecting the next action to take. Pedagogical strategies are policies to decide the next system action when there are multiple ones available. In this project we present a Reinforcement Learning (RL) approach for inducing effective pedagogical strategies and empirical evaluations of the induced strategies. This paper addresses the technical challenges in applying RL to Cordillera, a Natural Language Tutoring System teaching students introductory college physics. The algorithm chosen for this project is a model-based RL approach, Policy Iteration, and the training corpus for the RL approach is an exploratory corpus, which was collected by letting the system make random decisions when interacting with real students. Overall, our results show that by using a rather small training corpus, the RL-induced strategies indeed measurably improved the effectiveness of Cordillera in that the RL-induced policies improved students?? learning gains significantly.  相似文献   

20.
A programmed instruction approach to knowledge acquisition is operationalized by student–teacher interactions that involve monitoring and managing the moment-by-moment progress of a learner throughout the process of achieving a criterion of mastery in a knowledge domain. The teacher is generic and may include another person, a structured text, or a computer. When the steps and increments leading to a task's completion and mastery can be enumerated, the process of interaction-based learning lends itself to implementation with computer-based tutoring systems. The present paper describes a computer-based programmed instruction tutoring system that teaches a learner how to write a simple Java™ computer program. The system is based on a series of Java Applets, which are computer programs that are downloaded from a network server and executed by a client browser such as Netscape®. The use of Java Applets makes the tutoring system available to learners over the World Wide Web. Performance acquisition data are presented to show an individual learner's progression to task completion and mastery using the tutoring system. Survey data are presented to show the subjective responses of a class of students to the use of the tutoring system. The adoption of this computer-based tutoring system is justified as one component within a personalized system of instruction that also includes lectures and collaborative learning experiences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号