共查询到20条相似文献,搜索用时 15 毫秒
1.
Karina Valdivia Delgado Scott Sanner Leliane Nunes de Barros 《Artificial Intelligence》2011,175(9-10):1498-1527
When modeling real-world decision-theoretic planning problems in the Markov Decision Process (MDP) framework, it is often impossible to obtain a completely accurate estimate of transition probabilities. For example, natural uncertainty arises in the transition specification due to elicitation of MDP transition models from an expert or estimation from data, or non-stationary transition distributions arising from insufficient state knowledge. In the interest of obtaining the most robust policy under transition uncertainty, the Markov Decision Process with Imprecise Transition Probabilities (MDP-IPs) has been introduced to model such scenarios. Unfortunately, while various solution algorithms exist for MDP-IPs, they often require external calls to optimization routines and thus can be extremely time-consuming in practice. To address this deficiency, we introduce the factored MDP-IP and propose efficient dynamic programming methods to exploit its structure. Noting that the key computational bottleneck in the solution of factored MDP-IPs is the need to repeatedly solve nonlinear constrained optimization problems, we show how to target approximation techniques to drastically reduce the computational overhead of the nonlinear solver while producing bounded, approximately optimal solutions. Our results show up to two orders of magnitude speedup in comparison to traditional “flat” dynamic programming approaches and up to an order of magnitude speedup over the extension of factored MDP approximate value iteration techniques to MDP-IPs while producing the lowest error of any approximation algorithm evaluated. 相似文献
2.
在智能规划问题上,寻找规划解都是NP甚至NP完全问题,如果动作的执行效果带有不确定性,如在Markov决策过程的规划问题中,规划的求解将会更加困难,现有的Markov决策过程的规划算法往往用一个整体状态节点来描述某个动作的实际执行效果,试图回避状态内部的复杂性,而现实中的大量动作往往都会产生多个命题效果,对应多个命题节点。为了能够处理和解决这个问题,提出了映像动作,映像路节和映像规划图等概念,并在其基础上提出了Markov决策过程的蚁群规划算法,从而解决了这一问题。并且证明了算法得到的解,即使在不确定的执行环境下,也具有不低于一定概率的可靠性。 相似文献
3.
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法,而动态规划方法是Agent在具有Markov环境下与策略相关的值函数学习算法。但由于Agent在学习的过程中,需要记忆全部的值函数,这个记忆容量随着状态空间的增加会变得非常巨大。文章提出了一种基于动态规划方法的激励学习遗忘算法,这个算法是通过将记忆心理学中有关遗忘的基本原理引入到值函数的激励学习中,导出了一类用动态规划方法解决激励学习问题的比较好的方法,即Forget-DP算法。 相似文献
4.
Emmanuel Benazera 《Autonomous Agents and Multi-Agent Systems》2011,23(1):71-113
An approximation method is proposed that solves a class of Decentralized hybrid Markov Decision Processes (DEC-HMDPs). These
DEC-HMDPs have both discrete and continuous state variables and represent individual agents with continuous measurable state-spaces,
such as resources. Adding to the natural complexity of decentralized problems, continuous state variables lead to a blowup
in potential decision points. Representing value functions as Rectangular Piecewise Constant (RPWC) functions, we formalize
and detail an extension to the Coverage Set algorithm (CSA) (Becker et al., J Artif Intell Res, 22, 2004) that solves transition
independent DEC-HMDPs with controlled error. The resource constraints of each agent lead to problems that are over-subscribed
in the number of agents, that is where some agents have no role to play. Based on our extension to the CSA, two heuristics
are proposed that allow A*-like search to find the minimal optimal team of agents that is solution to a given problem. We
apply and test our algorithms on a range of multi-robot exploration problems with continuous resource constraints. 相似文献
5.
不确定环境下的智能规划问题往往假设世界状态的转移概率是确切可知的,然而规划建模专家有时只能在信息不完备的条件下进行建模.从而只能通过猜测或者不完全统计的方法来获取不完备的有关状态转移不确定性的定量信息,有时甚至只能荻取相关的定性信息.在2004年概率规划比赛冠军LAO系统的基础上设计了JLU-RLAO系统和JLU-QLAO系统.它们可以在无法获得精确的状态转移概率条件下,依然保证规划求解的健壮性.实验结果表明,JLU-RLAO系统和JLU-QLA0系统可以快速高效地解决上述不确定智能规划问题. 相似文献
6.
Verification of reachability properties for probabilistic systems is usually based on variants of Markov processes. Current methods assume an exact model of the dynamic behavior and are not suitable for realistic systems that operate in the presence of uncertainty and variability. This research note extends existing methods for Bounded-parameter Markov Decision Processes (BMDPs) to solve the reachability problem. BMDPs are a generalization of MDPs that allows modeling uncertainty. Our results show that interval value iteration converges in the case of an undiscounted reward criterion that is required to formulate the problems of maximizing the probability of reaching a set of desirable states or minimizing the probability of reaching an unsafe set. Analysis of the computational complexity is also presented. 相似文献
7.
8.
We address the problem of reinforcement learning in which observations may exhibit an arbitrary form of stochastic dependence on past observations and actions, i.e. environments more general than (PO)MDPs. The task for an agent is to attain the best possible asymptotic reward where the true generating environment is unknown, but belongs to a known countable family of environments. We find some sufficient conditions on the class of environments under which an agent exists which attains the best asymptotic reward for any environment in the class. We analyze how tight these conditions are, and how they relate to different probabilistic assumptions known in reinforcement learning and related fields, such as Markov Decision Processes and mixing conditions. 相似文献
9.
10.
Communication is an important resource for multiagent coordination. Interactive Dynamic Influence Diagrams (I-DIDs) have been used extensively in multiagent planning when there is uncertainty, and they are recognized graphical representations of Interactive Partially Observable Markov Decision Processes (I-POMDPs). We establish a communication model among multiple agents based on the I-DID framework. We use the AND-communication method by assuming a separate communication and action phase in each step, rather than replacing domain actions, in order that communication facilitates better domain-action selection. We use a synchronized communication type: when an agent initiates communication, all of the agent’s teammates synchronize to share their recent observations. We give a general algorithm to calculate communicative decision from a single-agent perspective by comparing expected rewards with and without communication. Finally, we use multiagent “tiger” and “concert” problems to validate the model’s effectiveness. 相似文献
11.
近些年来,群体动画在机器人学、电影、游戏等领域得到了广泛的研究和应用,但传统的群体动画技术均涉及复杂的运动规划或碰撞避免操作,计算效率较低.本文提出了一种基于马尔可夫决策过程(MDPs)的群体动画运动轨迹生成算法,该算法无需碰撞检测即可生成各智能体的无碰撞运动轨迹.同时本文还提出了一种改进的值迭代算法用于求解马尔可夫决策过程的状态-值,利用该算法在栅格环境中进行实验,结果表明该算法的计算效率明显高于使用欧氏距离作为启发式的值迭代算法和Dijkstra算法.利用本文提出的运动轨迹生成算法在三维(3D)动画场景中进行群体动画仿真实验,结果表明该算法可实现群体无碰撞地朝向目标运动,并具有多样性. 相似文献
12.
《Theoretical computer science》2005,345(1):2-26
A continuous-time Markov decision process (CTMDP) is a generalization of a continuous-time Markov chain in which both probabilistic and nondeterministic choices co-exist. This paper presents an efficient algorithm to compute the maximum (or minimum) probability to reach a set of goal states within a given time bound in a uniform CTMDP, i.e., a CTMDP in which the delay time distribution per state visit is the same for all states. It furthermore proves that these probabilities coincide for (time-abstract) history-dependent and Markovian schedulers that resolve nondeterminism either deterministically or in a randomized way. 相似文献
13.
Belogolovsky Stav Korsunsky Philip Mannor Shie Tessler Chen Zahavy Tom 《Machine Learning》2021,110(9):2295-2334
Machine Learning - We consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel,... 相似文献
14.
15.
This research presents an optimization technique for route planning and exploration in unknown environments. It employs the
hybrid architecture that implements detection, avoidance and planning using autonomous agents with coordination capabilities.
When these agents work for a common objective, they require a robust information interchange module for coordination. They
cannot achieve the goal when working independently. The coordination module enhances their performance and efficiency. The
multi agent systems can be employed for searching items in unknown environments. The searching of unexploded ordinance such
as the land mines is an important application where multi agent systems can be best employed. The hybrid architecture incorporates
learning real time A* algorithm for route planning and compares it with A* searching algorithm. Learning real time A* shows
better results for multi agent environment and proved to be efficient and robust algorithm. A simulated ant agent system is
presented for route planning and optimization and proved to be efficient and robust for large and complex environments. 相似文献
16.
Vrancx P. Verbeeck K. Nowe A. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2008,38(4):976-981
17.
具有不确定性路径概率的闭排队网络鲁棒控制策略 总被引:1,自引:0,他引:1
The paper is concerned with the robust control problems for exponential controlled closed queuing networks (CCQNs) under uncertain routing probabilities. As the rows of some parameter matrices such as infinitesimal generators may be dependent, we first transform the objective vector under discounted-cost criteria into a weighed-average cost. Through the solution to Poisson equation, i.e., Markov performance potentials, we then unify both discounted-cost and average-cost problems to study, and derive the gradient formula of the new objective function with respect to the routing probabilities. Some solution techniques are related for searching the optimal robust control policy. Finally, a numerical example is presented and analyzed. 相似文献
18.
Robust Control Policy for Closed Queuing Networks with Uncertain Routing Probabilities 总被引:1,自引:0,他引:1
The paper is concerned with the robust control problems for exponential controlled closed queuing networks (CCQNs) under uncertain routing probabilities.As the rows of some parameter matrices such as infinitesimal generators may be dependent,we first transform the objective vector under discounted-cost criteria into a weighed-average cost.Through the solution to Poisson equation, i.e.,Markov performance potentials,we then unify both discounted-cost and average-cost problems to study,and derive the gradient formula of the new objective function with respect to the routing probabilities.Some solution techniques are related for searching the optimal robust control policy. Finally,a numerical example is presented and analyzed. 相似文献
19.
为了解决分拣搬运机器人在路径规划过程中,遇到目标点众多的情况时存在路径寻优效率低、容易出错等问题。针对A*算法存在多个最小值时,无法实现路径最优化的问题进行研究,提出一种将蚁群算法与A*算法相结合的改进A*算法。首先使用A*算法筛选出一条最优化的路线来分布信息素,从而简化A*算法在路径规划上的运算。其次以筛选出的路线为基础,针对不同情况结合蚁群算法设计了三种通用方案,以此为基础进行具体的路径规划,从而解决A*算法本身存在的容易带入大量重复数据的问题。通过仿真与实际实验验证了本文提出的改进的A*算法能够满足自动分拣搬运的需求,值得推广与使用。 相似文献
20.
We formalize the problem of Structured Prediction as a Reinforcement Learning task. We first define a Structured Prediction Markov Decision Process (SP-MDP), an instantiation of Markov Decision Processes for Structured Prediction and show that learning an optimal policy for this SP-MDP is equivalent to minimizing the empirical loss. This link between the supervised learning formulation of structured prediction and reinforcement learning (RL) allows us to use approximate RL methods for learning the policy. The proposed model makes weak assumptions both on the nature of the Structured Prediction problem and on the supervision process. It does not make any assumption on the decomposition of loss functions, on data encoding, or on the availability of optimal policies for training. It then allows us to cope with a large range of structured prediction problems. Besides, it scales well and can be used for solving both complex and large-scale real-world problems. We describe two series of experiments. The first one provides an analysis of RL on classical sequence prediction benchmarks and compares our approach with state-of-the-art SP algorithms. The second one introduces a tree transformation problem where most previous models fail. This is a complex instance of the general labeled tree mapping problem. We show that RL exploration is effective and leads to successful results on this challenging task. This is a clear confirmation that RL could be used for large size and complex structured prediction problems. 相似文献