首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
When modeling real-world decision-theoretic planning problems in the Markov Decision Process (MDP) framework, it is often impossible to obtain a completely accurate estimate of transition probabilities. For example, natural uncertainty arises in the transition specification due to elicitation of MDP transition models from an expert or estimation from data, or non-stationary transition distributions arising from insufficient state knowledge. In the interest of obtaining the most robust policy under transition uncertainty, the Markov Decision Process with Imprecise Transition Probabilities (MDP-IPs) has been introduced to model such scenarios. Unfortunately, while various solution algorithms exist for MDP-IPs, they often require external calls to optimization routines and thus can be extremely time-consuming in practice. To address this deficiency, we introduce the factored MDP-IP and propose efficient dynamic programming methods to exploit its structure. Noting that the key computational bottleneck in the solution of factored MDP-IPs is the need to repeatedly solve nonlinear constrained optimization problems, we show how to target approximation techniques to drastically reduce the computational overhead of the nonlinear solver while producing bounded, approximately optimal solutions. Our results show up to two orders of magnitude speedup in comparison to traditional “flat” dynamic programming approaches and up to an order of magnitude speedup over the extension of factored MDP approximate value iteration techniques to MDP-IPs while producing the lowest error of any approximation algorithm evaluated.  相似文献   

2.
在智能规划问题上,寻找规划解都是NP甚至NP完全问题,如果动作的执行效果带有不确定性,如在Markov决策过程的规划问题中,规划的求解将会更加困难,现有的Markov决策过程的规划算法往往用一个整体状态节点来描述某个动作的实际执行效果,试图回避状态内部的复杂性,而现实中的大量动作往往都会产生多个命题效果,对应多个命题节点。为了能够处理和解决这个问题,提出了映像动作,映像路节和映像规划图等概念,并在其基础上提出了Markov决策过程的蚁群规划算法,从而解决了这一问题。并且证明了算法得到的解,即使在不确定的执行环境下,也具有不低于一定概率的可靠性。  相似文献   

3.
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法,而动态规划方法是Agent在具有Markov环境下与策略相关的值函数学习算法。但由于Agent在学习的过程中,需要记忆全部的值函数,这个记忆容量随着状态空间的增加会变得非常巨大。文章提出了一种基于动态规划方法的激励学习遗忘算法,这个算法是通过将记忆心理学中有关遗忘的基本原理引入到值函数的激励学习中,导出了一类用动态规划方法解决激励学习问题的比较好的方法,即Forget-DP算法。  相似文献   

4.
An approximation method is proposed that solves a class of Decentralized hybrid Markov Decision Processes (DEC-HMDPs). These DEC-HMDPs have both discrete and continuous state variables and represent individual agents with continuous measurable state-spaces, such as resources. Adding to the natural complexity of decentralized problems, continuous state variables lead to a blowup in potential decision points. Representing value functions as Rectangular Piecewise Constant (RPWC) functions, we formalize and detail an extension to the Coverage Set algorithm (CSA) (Becker et al., J Artif Intell Res, 22, 2004) that solves transition independent DEC-HMDPs with controlled error. The resource constraints of each agent lead to problems that are over-subscribed in the number of agents, that is where some agents have no role to play. Based on our extension to the CSA, two heuristics are proposed that allow A*-like search to find the minimal optimal team of agents that is solution to a given problem. We apply and test our algorithms on a range of multi-robot exploration problems with continuous resource constraints.  相似文献   

5.
不确定环境下的智能规划问题往往假设世界状态的转移概率是确切可知的,然而规划建模专家有时只能在信息不完备的条件下进行建模.从而只能通过猜测或者不完全统计的方法来获取不完备的有关状态转移不确定性的定量信息,有时甚至只能荻取相关的定性信息.在2004年概率规划比赛冠军LAO系统的基础上设计了JLU-RLAO系统和JLU-QLAO系统.它们可以在无法获得精确的状态转移概率条件下,依然保证规划求解的健壮性.实验结果表明,JLU-RLAO系统和JLU-QLA0系统可以快速高效地解决上述不确定智能规划问题.  相似文献   

6.
Verification of reachability properties for probabilistic systems is usually based on variants of Markov processes. Current methods assume an exact model of the dynamic behavior and are not suitable for realistic systems that operate in the presence of uncertainty and variability. This research note extends existing methods for Bounded-parameter Markov Decision Processes (BMDPs) to solve the reachability problem. BMDPs are a generalization of MDPs that allows modeling uncertainty. Our results show that interval value iteration converges in the case of an undiscounted reward criterion that is required to formulate the problems of maximizing the probability of reaching a set of desirable states or minimizing the probability of reaching an unsafe set. Analysis of the computational complexity is also presented.  相似文献   

7.
戴帅  殷苌茗  张欣 《计算机工程》2009,35(13):190-192
提出一种新的基于因素法方法的TD(λ)算法。其基本思想是状态因素化表示,通过动态贝叶斯网络表示Markov决策过程(MDP)中的状态转移概率函数,结合决策树表示TD(λ)算法中的状态值函数,降低状态空间的搜索与计算复杂度,因而适用于求解大状态空间的MDPs问题,实验证明该表示方法是有效的。  相似文献   

8.
We address the problem of reinforcement learning in which observations may exhibit an arbitrary form of stochastic dependence on past observations and actions, i.e. environments more general than (PO)MDPs. The task for an agent is to attain the best possible asymptotic reward where the true generating environment is unknown, but belongs to a known countable family of environments. We find some sufficient conditions on the class of environments under which an agent exists which attains the best asymptotic reward for any environment in the class. We analyze how tight these conditions are, and how they relate to different probabilistic assumptions known in reinforcement learning and related fields, such as Markov Decision Processes and mixing conditions.  相似文献   

9.
于丹宁  倪坤  刘云龙 《计算机工程》2021,47(2):90-94,102
基于卷积神经网络的部分可观测马尔科夫决策过程(POMDP)值迭代算法QMDP-net在无先验知识的情况下具有较好的性能表现,但其存在训练效果不稳定、参数敏感等优化难题.提出基于循环卷积神经网络的POMDP值迭代算法RQMDP-net,使用门控循环单元网络实现值迭代更新,在保留输入和递归权重矩阵卷积特性的同时增强网络时序...  相似文献   

10.
Communication is an important resource for multiagent coordination. Interactive Dynamic Influence Diagrams (I-DIDs) have been used extensively in multiagent planning when there is uncertainty, and they are recognized graphical representations of Interactive Partially Observable Markov Decision Processes (I-POMDPs). We establish a communication model among multiple agents based on the I-DID framework. We use the AND-communication method by assuming a separate communication and action phase in each step, rather than replacing domain actions, in order that communication facilitates better domain-action selection. We use a synchronized communication type: when an agent initiates communication, all of the agent’s teammates synchronize to share their recent observations. We give a general algorithm to calculate communicative decision from a single-agent perspective by comparing expected rewards with and without communication. Finally, we use multiagent “tiger” and “concert” problems to validate the model’s effectiveness.  相似文献   

11.
近些年来,群体动画在机器人学、电影、游戏等领域得到了广泛的研究和应用,但传统的群体动画技术均涉及复杂的运动规划或碰撞避免操作,计算效率较低.本文提出了一种基于马尔可夫决策过程(MDPs)的群体动画运动轨迹生成算法,该算法无需碰撞检测即可生成各智能体的无碰撞运动轨迹.同时本文还提出了一种改进的值迭代算法用于求解马尔可夫决策过程的状态-值,利用该算法在栅格环境中进行实验,结果表明该算法的计算效率明显高于使用欧氏距离作为启发式的值迭代算法和Dijkstra算法.利用本文提出的运动轨迹生成算法在三维(3D)动画场景中进行群体动画仿真实验,结果表明该算法可实现群体无碰撞地朝向目标运动,并具有多样性.  相似文献   

12.
A continuous-time Markov decision process (CTMDP) is a generalization of a continuous-time Markov chain in which both probabilistic and nondeterministic choices co-exist. This paper presents an efficient algorithm to compute the maximum (or minimum) probability to reach a set of goal states within a given time bound in a uniform CTMDP, i.e., a CTMDP in which the delay time distribution per state visit is the same for all states. It furthermore proves that these probabilities coincide for (time-abstract) history-dependent and Markovian schedulers that resolve nondeterminism either deterministically or in a randomized way.  相似文献   

13.
Belogolovsky  Stav  Korsunsky  Philip  Mannor  Shie  Tessler  Chen  Zahavy  Tom 《Machine Learning》2021,110(9):2295-2334
Machine Learning - We consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel,...  相似文献   

14.
15.
This research presents an optimization technique for route planning and exploration in unknown environments. It employs the hybrid architecture that implements detection, avoidance and planning using autonomous agents with coordination capabilities. When these agents work for a common objective, they require a robust information interchange module for coordination. They cannot achieve the goal when working independently. The coordination module enhances their performance and efficiency. The multi agent systems can be employed for searching items in unknown environments. The searching of unexploded ordinance such as the land mines is an important application where multi agent systems can be best employed. The hybrid architecture incorporates learning real time A* algorithm for route planning and compares it with A* searching algorithm. Learning real time A* shows better results for multi agent environment and proved to be efficient and robust algorithm. A simulated ant agent system is presented for route planning and optimization and proved to be efficient and robust for large and complex environments.  相似文献   

16.
Learning automata (LA) were recently shown to be valuable tools for designing multiagent reinforcement learning algorithms. One of the principal contributions of the LA theory is that a set of decentralized independent LA is able to control a finite Markov chain with unknown transition probabilities and rewards. In this paper, we propose to extend this algorithm to Markov games—a straightforward extension of single-agent Markov decision problems to distributed multiagent decision problems. We show that under the same ergodic assumptions of the original theorem, the extended algorithm will converge to a pure equilibrium point between agent policies.   相似文献   

17.
具有不确定性路径概率的闭排队网络鲁棒控制策略   总被引:1,自引:0,他引:1  
The paper is concerned with the robust control problems for exponential controlled closed queuing networks (CCQNs) under uncertain routing probabilities. As the rows of some parameter matrices such as infinitesimal generators may be dependent, we first transform the objective vector under discounted-cost criteria into a weighed-average cost. Through the solution to Poisson equation, i.e., Markov performance potentials, we then unify both discounted-cost and average-cost problems to study, and derive the gradient formula of the new objective function with respect to the routing probabilities. Some solution techniques are related for searching the optimal robust control policy. Finally, a numerical example is presented and analyzed.  相似文献   

18.
The paper is concerned with the robust control problems for exponential controlled closed queuing networks (CCQNs) under uncertain routing probabilities.As the rows of some parameter matrices such as infinitesimal generators may be dependent,we first transform the objective vector under discounted-cost criteria into a weighed-average cost.Through the solution to Poisson equation, i.e.,Markov performance potentials,we then unify both discounted-cost and average-cost problems to study,and derive the gradient formula of the new objective function with respect to the routing probabilities.Some solution techniques are related for searching the optimal robust control policy. Finally,a numerical example is presented and analyzed.  相似文献   

19.
为了解决分拣搬运机器人在路径规划过程中,遇到目标点众多的情况时存在路径寻优效率低、容易出错等问题。针对A*算法存在多个最小值时,无法实现路径最优化的问题进行研究,提出一种将蚁群算法与A*算法相结合的改进A*算法。首先使用A*算法筛选出一条最优化的路线来分布信息素,从而简化A*算法在路径规划上的运算。其次以筛选出的路线为基础,针对不同情况结合蚁群算法设计了三种通用方案,以此为基础进行具体的路径规划,从而解决A*算法本身存在的容易带入大量重复数据的问题。通过仿真与实际实验验证了本文提出的改进的A*算法能够满足自动分拣搬运的需求,值得推广与使用。  相似文献   

20.
We formalize the problem of Structured Prediction as a Reinforcement Learning task. We first define a Structured Prediction Markov Decision Process (SP-MDP), an instantiation of Markov Decision Processes for Structured Prediction and show that learning an optimal policy for this SP-MDP is equivalent to minimizing the empirical loss. This link between the supervised learning formulation of structured prediction and reinforcement learning (RL) allows us to use approximate RL methods for learning the policy. The proposed model makes weak assumptions both on the nature of the Structured Prediction problem and on the supervision process. It does not make any assumption on the decomposition of loss functions, on data encoding, or on the availability of optimal policies for training. It then allows us to cope with a large range of structured prediction problems. Besides, it scales well and can be used for solving both complex and large-scale real-world problems. We describe two series of experiments. The first one provides an analysis of RL on classical sequence prediction benchmarks and compares our approach with state-of-the-art SP algorithms. The second one introduces a tree transformation problem where most previous models fail. This is a complex instance of the general labeled tree mapping problem. We show that RL exploration is effective and leads to successful results on this challenging task. This is a clear confirmation that RL could be used for large size and complex structured prediction problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号