首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 750 毫秒
1.
In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.  相似文献   

2.
In active perception tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. For example, a mobile robot takes sensory actions to efficiently navigate in a new environment. While partially observable Markov decision processes (POMDPs) provide a natural model for such problems, reward functions that directly penalize uncertainty in the agent’s belief can remove the piecewise-linear and convex (PWLC) property of the value function required by most POMDP planners. Furthermore, as the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially with it, making POMDP planning infeasible with traditional methods. In this article, we address a twofold challenge of modeling and planning for active perception tasks. We analyze \(\rho \)POMDP and POMDP-IR, two frameworks for modeling active perception tasks, that restore the PWLC property of the value function. We show the mathematical equivalence of these two frameworks by showing that given a \(\rho \)POMDP along with a policy, they can be reduced to a POMDP-IR and an equivalent policy (and vice-versa). We prove that the value function for the given \(\rho \)POMDP (and the given policy) and the reduced POMDP-IR (and the reduced policy) is the same. To efficiently plan for active perception tasks, we identify and exploit the independence properties of POMDP-IR to reduce the computational cost of solving POMDP-IR (and \(\rho \)POMDP). We propose greedy point-based value iteration (PBVI), a new POMDP planning method that uses greedy maximization to greatly improve scalability in the action space of an active perception POMDP. Furthermore, we show that, under certain conditions, including submodularity, the value function computed using greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. We establish the conditions under which the value function of an active perception POMDP is guaranteed to be submodular. Finally, we present a detailed empirical analysis on a dataset collected from a multi-camera tracking system employed in a shopping mall. Our method achieves similar performance to existing methods but at a fraction of the computational cost leading to better scalability for solving active perception tasks.  相似文献   

3.
Reinforcement Learning (RL) is a well-known technique for learning the solutions of control problems from the interactions of an agent in its domain. However, RL is known to be inefficient in problems of the real-world where the state space and the set of actions grow up fast. Recently, heuristics, case-based reasoning (CBR) and transfer learning have been used as tools to accelerate the RL process. This paper investigates a class of algorithms called Transfer Learning Heuristically Accelerated Reinforcement Learning (TLHARL) that uses CBR as heuristics within a transfer learning setting to accelerate RL. The main contributions of this work are the proposal of a new TLHARL algorithm based on the traditional RL algorithm Q(λ) and the application of TLHARL on two distinct real-robot domains: a robot soccer with small-scale robots and the humanoid-robot stability learning. Experimental results show that our proposed method led to a significant improvement of the learning rate in both domains.  相似文献   

4.
Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process (MDP) using belief states. However, because the belief state space is continuous and multi-dimensional, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete POMDP model of the environment, which is not always practical. This article introduces a modified memory-based reinforcement learning algorithm called modified U-Tree that is capable of learning from raw sensor experiences with minimum prior knowledge. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and also proposes a modification of the statistical test for reward estimation, which allows the algorithm to be benchmarked against some traditional model-based algorithms with a set of well known POMDP problems.  相似文献   

5.
作为机器学习和人工智能领域的一个重要分支,多智能体分层强化学习以一种通用的形式将多智能体的协作能力与强化学习的决策能力相结合,并通过将复杂的强化学习问题分解成若干个子问题并分别解决,可以有效解决空间维数灾难问题。这也使得多智能体分层强化学习成为解决大规模复杂背景下智能决策问题的一种潜在途径。首先对多智能体分层强化学习中涉及的主要技术进行阐述,包括强化学习、半马尔可夫决策过程和多智能体强化学习;然后基于分层的角度,对基于选项、基于分层抽象机、基于值函数分解和基于端到端等4种多智能体分层强化学习方法的算法原理和研究现状进行了综述;最后介绍了多智能体分层强化学习在机器人控制、博弈决策以及任务规划等领域的应用现状。  相似文献   

6.
Partially observable Markov decision processes (POMDPs) provide a rich mathematical framework for planning tasks in partially observable stochastic environments. The notion of the covering number, a metric of capturing the search space size of a POMDP planning problem, has been proposed as a complexity measure of approximate POMDP planning. Existing theoretical results are based on POMDPs with finite and discrete state spaces and measured in the l 1-metric space. When considering heuristics, they are assumed to be always admissible. This paper extends the theoretical results on the covering numbers of different search spaces, including the newly defined space reachable under inadmissible heuristics, to the l n-metric spaces. We provide a simple but scalable algorithm for estimating covering numbers. Experimentally, we provide estimated covering numbers of the search spaces reachable by following different policies on several benchmark problems, and analyze their abilities to predict the runtime of POMDP planning algorithms.  相似文献   

7.
在状态空间满足结构化条件的前提下,通过状态空间的维度划分直接将复杂的原始MDP问题递阶分解为一组简单的MDP或SMDP子问题,并在线对递阶结构进行完善.递阶结构中嵌入不同的再励学习方法可以形成不同的递阶学习.所提出的方法在具备递阶再励学习速度快、易于共享等优点的同时,降低了对先验知识的依赖程度,缓解了学习初期回报值稀少的问题.  相似文献   

8.
9.
Flexible, general-purpose robots need to autonomously tailor their sensing and information processing to the task at hand. We pose this challenge as the task of planning under uncertainty. In our domain, the goal is to plan a sequence of visual operators to apply on regions of interest (ROIs) in images of a scene, so that a human and a robot can jointly manipulate and converse about objects on a tabletop. We pose visual processing management as an instance of probabilistic sequential decision making, and specifically as a Partially Observable Markov Decision Process (POMDP). The POMDP formulation uses models that quantitatively capture the unreliability of the operators and enable a robot to reason precisely about the trade-offs between plan reliability and plan execution time. Since planning in practical-sized POMDPs is intractable, we partially ameliorate this intractability for visual processing by defining a novel hierarchical POMDP based on the cognitive requirements of the corresponding planning task. We compare our hierarchical POMDP planning system (HiPPo) with a non-hierarchical POMDP formulation and the Continual Planning (CP) framework that handles uncertainty in a qualitative manner. We show empirically that HiPPo and CP outperform the naive application of all visual operators on all ROIs. The key result is that the POMDP methods produce more robust plans than CP or the naive visual processing. In summary, visual processing problems represent a challenging domain for planning techniques and our hierarchical POMDP-based approach for visual processing management opens up a promising new line of research.  相似文献   

10.
This paper presents a method for learning decision theoretic models of human behaviors from video data. Our system learns relationships between the movements of a person, the context in which they are acting, and a utility function. This learning makes explicit that the meaning of a behavior to an observer is contained in its relationship to actions and outcomes. An agent wishing to capitalize on these relationships must learn to distinguish the behaviors according to how they help the agent to maximize utility. The model we use is a partially observable Markov decision process, or POMDP. The video observations are integrated into the POMDP using a dynamic Bayesian network that creates spatial and temporal abstractions amenable to decision making at the high level. The parameters of the model are learned from training data using an a posteriori constrained optimization technique based on the expectation-maximization algorithm. The system automatically discovers classes of behaviors and determines which are important for choosing actions that optimize over the utility of possible outcomes. This type of learning obviates the need for labeled data from expert knowledge about which behaviors are significant and removes bias about what behaviors may be useful to recognize in a particular situation. We show results in three interactions: a single player imitation game, a gestural robotic control problem, and a card game played by two people.  相似文献   

11.
We obtain the conditions for the emergence of the swarm intelligence effect in an interactive game of restless multi-armed bandit (rMAB). A player competes with multiple agents. Each bandit has a payoff that changes with a probability p c per round. The agents and player choose one of three options: (1) Exploit (a good bandit), (2) Innovate (asocial learning for a good bandit among n I randomly chosen bandits), and (3) Observe (social learning for a good bandit). Each agent has two parameters (c, p obs ) to specify the decision: (i) c, the threshold value for Exploit, and (ii) p obs , the probability for Observe in learning. The parameters (c, p obs ) are uniformly distributed. We determine the optimal strategies for the player using complete knowledge about the rMAB. We show whether or not social or asocial learning is more optimal in the (p c , n I ) space and define the swarm intelligence effect. We conduct a laboratory experiment (67 subjects) and observe the swarm intelligence effect only if (p c , n I ) are chosen so that social learning is far more optimal than asocial learning.  相似文献   

12.
We address the problem of how a set of agents can decide to share a resource, represented as a unit-sized pie. The pie can be generated by the entire set but also by some of its subsets. We investigate a finite horizon non-cooperative bargaining game, in which the players take it in turns to make proposals on how the resource should for this purpose be allocated, and the other players vote on whether or not to accept the allocation. Voting is modelled as a Bayesian weighted voting game with uncertainty about the players’ weights. The agenda, (i.e., the order in which the players are called to make offers), is defined exogenously. We focus on impatient players with heterogeneous discount factors. In the case of a conflict, (i.e., no agreement by the deadline), no player receives anything. We provide a Bayesian subgame perfect equilibrium for the bargaining game and conduct an ex-ante analysis of the resulting outcome. We show that the equilibrium is unique, computable in polynomial time, results in an instant Pareto optimal outcome, and, under certain conditions provides a foundation for the core and also the nucleolus of the Bayesian voting game. In addition, our analysis leads to insights on how an individual’s bargained share is influenced by his position on the agenda. Finally, we show that, if the conflict point of the bargaining game changes, then the problem of determining the non-cooperative equilibrium becomes NP-hard even under the perfect information assumption. Our research also reveals how this change in conflict point impacts on the above mentioned results.  相似文献   

13.
In this paper we propose a machine learning technique for real-time robot path planning for an autonomous robot in a planar environment with obstacles where the robot possess no a priori map of its environment. Our main insight in this paper is that a robot’s path planning times can be significantly reduced if it can refer to previous maneuvers it used to avoid obstacles during earlier missions, and adapt that information to avoid obstacles during its current navigation. We propose an online path planning algorithm called LearnerRRT that utilizes a pattern matching technique called Sample Consensus Initial Alignment (SAC-IA) in combination with an experience-based learning technique to adapt obstacle boundary patterns encountered in previous environments to the current scenario followed by corresponding adaptations in the obstacle-avoidance paths. Our proposed algorithm LearnerRRT works as a learning-based reactive path planning technique which enables robots to improve their overall path planning performance by locally improving maneuvers around commonly encountered obstacle patterns by accessing previously accumulated environmental information. We have conducted several experiments in simulations and hardware to verify the performance of the LearnerRRT algorithm and compared it with a state-of-the-art sampling-based planner. LearnerRRT on average takes approximately 10% of the planning time and 14% of the total time taken by the sampling-based planner to solve the same navigation task based on simulation results and takes only 33% of the planning time, 46% of total time and 95% of total distance compared to the sampling-based planner based on our hardware results.  相似文献   

14.
A Reinforcement Learning (RL) algorithm based on eXtended Classifier System (XCS) is used to navigate a spherical robot. Traditional motion planning strategies rely on pre-planned optimal trajectories and feedback control techniques. The proposed learning agent approach enjoys a direct model-free methodology that enables the robot to function in dynamic and/or partially observable environments. The agent uses a set of guard-action rules that determines the motion inputs at each step. Using a number of control inputs (actions) and the developed RL scheme, the agent learns to make near-optimal moves in response to the incoming position/orientation signals. The proposed method employs an improved variant of the XCS as its learning agent. Results of several simulated experiments for the spherical robot show that this approach is capable of planning a near-optimal path to a predefined target from any given position/orientation.  相似文献   

15.
Recent Large-Scale Multimedia Retrieval (LSMR) methods seem to heavily rely on analysing a large amount of data using high-performance machines. This paper aims to warn this research trend. We advocate that the above methods are useful only for recognising certain primitive meanings, knowledge about human interpretation is necessary to derive high-level meanings from primitive ones. We emphasise this by conducting a retrospective survey on machine-based methods which build classifiers based on features, and human-based methods which exploit user annotation and interaction. Our survey reveals that due to prioritising the generality and scalability for large-scale data, knowledge about human interpretation is left out by recent methods, while it was fully used in classical methods. Thus, we defend the importance of human-machine cooperation which incorporates the above knowledge into LSMR. In particular, we define its three future directions (cognition-based, ontology-based and adaptive learning) depending on types of knowledge, and suggest to explore each direction by considering its relation to the others.  相似文献   

16.
This paper describes the design and ecologically valid evaluation of a learner model that lies at the heart of an intelligent learning environment called iTalk2Learn. A core objective of the learner model is to adapt formative feedback based on students’ affective states. Types of adaptation include what type of formative feedback should be provided and how it should be presented. Two Bayesian networks trained with data gathered in a series of Wizard-of-Oz studies are used for the adaptation process. This paper reports results from a quasi-experimental evaluation, in authentic classroom settings, which compared a version of iTalk2Learn that adapted feedback based on students’ affective states as they were talking aloud with the system (the affect condition) with one that provided feedback based only on the students’ performance (the non-affect condition). Our results suggest that affect-aware support contributes to reducing boredom and off-task behavior, and may have an effect on learning. We discuss the internal and ecological validity of the study, in light of pedagogical considerations that informed the design of the two conditions. Overall, the results of the study have implications both for the design of educational technology and for classroom approaches to teaching, because they highlight the important role that affect-aware modelling plays in the adaptive delivery of formative feedback to support learning.  相似文献   

17.
Collaborative privacy-preserving planning (CPPP) is a multi-agent planning task in which agents need to achieve a common set of goals without revealing certain private information. In many CPPP algorithms, the individual agents reason about a projection of the multi-agent problem onto a single-agent classical planning problem. For example, an agent can plan as if it controls the public actions of other agents, ignoring any private preconditions and effects theses actions may have, and use the cost of this plan as a heuristic estimate of the cost of the full, multi-agent plan. Using such a projection, however, ignores some dependencies between agents’ public actions. In particular, it does not contain dependencies between public actions of other agents caused by their private facts. We propose a projection in which these private dependencies are maintained. The benefit of our dependency-preserving projection is demonstrated by using it to produce high-level plans in a new privacy-preserving planner, and as a heuristic for guiding forward search privacy-preserving algorithms. Both are able to solve more benchmark problems than any other state-of-the-art privacy-preserving planner. This more informed projection does not explicitly expose any private fact, action, or precondition. In addition, we show that even if an adversary agent knows that an agent has some private objects of a given type (e.g., trucks), it cannot infer the number of such private objects that the agent controls. This introduces a novel form of strong privacy, which we call object-cardinality privacy, that is motivated by real-world requirements.  相似文献   

18.
We consider the problem of continuum armed bandits where the arms are indexed by a compact subset of \(\mathbb {R}^{d}\). For large d, it is well known that mere smoothness assumptions on the reward functions lead to regret bounds that suffer from the curse of dimensionality. A typical way to tackle this in the literature has been to make further assumptions on the structure of reward functions. In this work we assume the reward functions to be intrinsically of low dimension k ? d and consider two models: (i) The reward functions depend on only an unknown subset of k coordinate variables and, (ii) a generalization of (i) where the reward functions depend on an unknown k dimensional subspace of \(\mathbb {R}^{d}\). By placing suitable assumptions on the smoothness of the rewards we derive randomized algorithms for both problems that achieve nearly optimal regret bounds in terms of the number of rounds n.  相似文献   

19.
Reinforcement learning (RL) for solving large and complex problems faces the curse of dimensions problem. To overcome this problem, frameworks based on the temporal abstraction have been presented; each having their advantages and disadvantages. This paper proposes a new method like the strategies introduced in the hierarchical abstract machines (HAMs) to create a high-level controller layer of reinforcement learning which uses options. The proposed framework considers a non-deterministic automata as a controller to make a more effective use of temporally extended actions and state space clustering. This method can be viewed as a bridge between option and HAM frameworks, which tries to suggest a new framework to decrease the disadvantage of both by creating connection structures between them and at the same time takes advantages of them. Experimental results on different test environments show significant efficiency of the proposed method.  相似文献   

20.
The robot soccer game has been proposed as a benchmark problem for the artificial intelligence and robotic researches. Decision-making system is the most important part of the robot soccer system. As the environment is dynamic and complex, one of the reinforcement learning (RL) method named FNN-RL is employed in learning the decision-making strategy. The FNN-RL system consists of the fuzzy neural network (FNN) and RL. RL is used for structure identification and parameters tuning of FNN. On the other hand, the curse of dimensionality problem of RL can be solved by the function approximation characteristics of FNN. Furthermore, the residual algorithm is used to calculate the gradient of the FNN-RL method in order to guarantee the convergence and rapidity of learning. The complex decision-making task is divided into multiple learning subtasks that include dynamic role assignment, action selection, and action implementation. They constitute a hierarchical learning system. We apply the proposed FNN-RL method to the soccer agents who attempt to learn each subtask at the various layers. The effectiveness of the proposed method is demonstrated by the simulation and the real experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号