共查询到20条相似文献,搜索用时 78 毫秒
1.
Hyeong Soo Chang 《Automatic Control, IEEE Transactions on》2008,53(1):350-355
2.
We compare the computational performance of linear programming (LP) and the policy iteration algorithm (PIA) for solving discrete-time infinite-horizon Markov decision process (MDP) models with total expected discounted reward. We use randomly generated test problems as well as a real-life health-care problem to empirically show that, unlike previously reported, barrier methods for LP provide a viable tool for optimally solving such MDPs. The dimensions of comparison include transition probability matrix structure, state and action size, and the LP solution method. 相似文献
3.
Certain methods for constructing embedded Markov decision processes (MDPs) lead to performance measures that are the ratio of two long-run averages. For such MDPs with finite state and action spaces and under an ergodicity assumption, this note presents algorithms for computing optimal policies based on policy iterations, linear programming, value iterations and Q-learning. 相似文献
4.
Masanori Hosaka Jun-ichi Nakagami & Masami Kurano 《International Transactions in Operational Research》2002,9(1):113-123
In this paper, the average case of finite state controlled Markov set-chains with R p -set-valued rewards are considered. The optimization is done by a pseudo-order relation on the set of all convex and compact subsets of R p induced by a closed convex cone. We introduce a v -step contractive property (minorization condition) for the average case and by use of this method the average expected reward set from a periodic policy is characterized. And, applying the scalarization technique, a Pareto optimal policy is obtained. Also, a numerical example is given to comprehend our results. 相似文献
5.
Kan-Jian Zhang Author Vitae Yan-Kai Xu Author Vitae Xi Chen Author Vitae Xi-Ren Cao Author Vitae 《Automatica》2008,44(4):1055-1061
It is well known that stochastic control systems can be viewed as Markov decision processes (MDPs) with continuous state spaces. In this paper, we propose to apply the policy iteration approach in MDPs to the optimal control problem of stochastic systems. We first provide an optimality equation based on performance potentials and develop a policy iteration procedure. Then we apply policy iteration to the jump linear quadratic problem and obtain the coupled Riccati equations for their optimal solutions. The approach is applicable to linear as well as nonlinear systems and can be implemented on-line on real world systems without identifying all the system structure and parameters. 相似文献
6.
《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2008,38(6):1645-1651
7.
Dmitri A. Dolgov Edmund H. Durfee 《Annals of Mathematics and Artificial Intelligence》2006,47(3-4):273-293
A weakness of classical Markov decision processes (MDPs) is that they scale very poorly due to the flat state-space representation.
Factored MDPs address this representational problem by exploiting problem structure to specify the transition and reward functions
of an MDP in a compact manner. However, in general, solutions to factored MDPs do not retain the structure and compactness
of the problem representation, forcing approximate solutions, with approximate linear programming (ALP) emerging as a promising
MDP-approximation technique. To date, most ALP work has focused on the primal-LP formulation, while the dual LP, which forms
the basis for solving constrained Markov problems, has received much less attention. We show that a straightforward linear
approximation of the dual optimization variables is problematic, because some of the required computations cannot be carried
out efficiently. Nonetheless, we develop a composite approach that symmetrically approximates the primal and dual optimization
variables (effectively approximating both the objective function and the feasible region of the LP), leading to a formulation
that is computationally feasible and suitable for solving constrained MDPs. We empirically show that this new ALP formulation
also performs well on unconstrained problems.
相似文献
8.
Xin Xu Chunming Liu Dewen Hu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1055-1070
As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied
in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open
problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state
and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which
is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and
action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique
is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement.
To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed
so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel
machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed
approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better
performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI. 相似文献
9.
Stephen D. Patek 《Computers & Operations Research》2004,31(14):930
We introduce and analyze several new policy iteration type algorithms for average cost Markov decision processes (MDPs). We limit attention to “recurrent state” processes where there exists a state which is recurrent under all stationary policies, and our analysis applies to finite-state problems with compact constraint sets, continuous transition probability functions, and lower-semicontinuous cost functions. The analysis makes use of an underlying relationship between recurrent state MDPs and the so-called stochastic shortest path problems of Bertsekas and Tsitsiklis (Math. Oper. Res. 16(3) (1991) 580). After extending this relationship, we establish the convergence of the new policy iteration type algorithms either to optimality or to within >0 of the optimal average cost. 相似文献
10.
马尔可夫决策过程两种抽象模式 总被引:2,自引:1,他引:1
抽象层次上马尔可夫决策过程的引入,使得人们可简洁地、陈述地表达复杂的马尔可夫决策过程,解决常规马尔可夫决策过程(MDPs)在实际中所遇到的大型状态空间的表达问题.介绍了结构型和概括型两种不同类型抽象马尔可夫决策过程基本概念以及在各种典型抽象MDPs中的最优策略的精确或近似算法,其中包括与常规MDPs根本不同的一个算法:把Bellman方程推广到抽象状态空间的方法,并且对它们的研究历史进行总结和对它们的发展做一些展望,使得人们对它们有一个透彻的、全面而又重点的理解. 相似文献
11.
Daniel A. M. Moreira Karina Valdivia Delgado Leliane Nunes de Barros 《Applied Intelligence》2016,45(3):662-672
In probabilistic planning problems which are usually modeled as Markov Decision Processes (MDPs), it is often difficult, or impossible, to obtain an accurate estimate of the state transition probabilities. This limitation can be overcome by modeling these problems as Markov Decision Processes with imprecise probabilities (MDP-IPs). Robust LAO* and Robust LRTDP are efficient algorithms for solving a special class of MDP-IPs where the probabilities lie in a given interval, known as Bounded-Parameter Stochastic-Shortest Path MDP (BSSP-MDP). However, they do not make clear what assumptions must be made to find a robust solution (the best policy under the worst model). In this paper, we propose a new efficient algorithm for BSSP-MDPs, called Robust ILAO* which has a better performance than Robust LAO* and Robust LRTDP, considered the-state-of-the art of robust probabilistic planning. We also define the assumptions required to ensure a robust solution and prove that Robust ILAO* algorithm converges to optimal values if the initial value of all states is admissible. 相似文献
12.
Xi-Ren Cao 《Discrete Event Dynamic Systems》2003,13(1-2):9-39
The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed. 相似文献
13.
Yat-wah Wan Author Vitae Author Vitae 《Automatica》2006,42(3):393-403
The solution of Markov Decision Processes (MDPs) often relies on special properties of the processes. For two-level MDPs, the difference in the rates of state changes of the upper and lower levels has led to limiting or approximate solutions of such problems. In this paper, we solve a two-level MDP without making any assumption on the rates of state changes of the two levels. We first show that such a two-level MDP is a non-standard one where the optimal actions of different states can be related to each other. Then we give assumptions (conditions) under which such a specially constrained MDP can be solved by policy iteration. We further show that the computational effort can be reduced by decomposing the MDP. A two-level MDP with M upper-level states can be decomposed into one MDP for the upper level and M to M(M-1) MDPs for the lower level, depending on the structure of the two-level MDP. The upper-level MDP is solved by time aggregation, a technique introduced in a recent paper [Cao, X.-R., Ren, Z. Y., Bhatnagar, S., Fu, M., & Marcus, S. (2002). A time aggregation approach to Markov decision processes. Automatica, 38(6), 929-943.], and the lower-level MDPs are solved by embedded Markov chains. 相似文献
14.
Xi-Ren Cao Junyu Zhang 《Automatic Control, IEEE Transactions on》2008,53(2):496-508
In this paper, we propose a new approach to the theory of finite multichain Markov decision processes (MDPs) with different performance optimization criteria. We first propose the concept of nth-order bias; then, using the average reward and bias difference formulas derived in this paper, we develop an optimization theory for finite MDPs that covers a complete spectrum from average optimality, bias optimality, to all high-order bias optimality, in a unified way. The approach is simple, direct, natural, and intuitive; it depends neither on Laurent series expansion nor on discounted MDPs. We also propose one-phase policy iteration algorithms for bias and high-order bias optimal policies, which are more efficient than the two-phase algorithms in the literature. Furthermore, we derive high-order bias optimality equations. This research is a part of our effort in developing sensitivity-based learning and optimization theory. 相似文献
15.
16.
Reinforcement learning (RL) is concerned with the identification of optimal controls in Markov decision processes (MDPs) where no explicit model of the transition probabilities is available. We propose a class of RL algorithms which always produces stable estimates of the value function. In detail, we use "local averaging" methods to construct an approximate dynamic programming (ADP) algorithm. Nearest-neighbor regression, grid-based approximations, and trees can all be used as the basis of this approximation. We provide a thorough theoretical analysis of this approach and we demonstrate that ADP converges to a unique approximation in continuous-state average-cost MDPs. In addition, we prove that our method is consistent in the sense that an optimal approximate strategy is identified asymptotically. With regard to a practical implementation, we suggest a reduction of ADP to standard dynamic programming in an artificial finite-state MDP. 相似文献
17.
Reinforcement Learning (RL) is a simulation-based counterpart of stochastic dynamic programming. In recent years, it has been used in solving complex Markov decision problems (MDPs). Watkins’ Q-Learning is by far the most popular RL algorithm used for solving discounted-reward MDPs. The boundedness of the iterates in Q-Learning plays a critical role in its convergence analysis and in making the algorithm stable, which makes it extremely attractive in numerical solutions. Previous results show boundedness asymptotically in an almost sure sense. We present a new result that shows boundedness in an absolute sense under some weaker conditions for the step size. Also, our proof is based on some simple induction arguments. 相似文献
18.
增强学习属于机器学习的一种,它通过与环境的交互获得策略的改进,其在线学习和自适应学习的特点使其成为解决策略寻优问题有力的工具。多智能体系统是人工智能领域的一个研究热点,对于多智能体学习技术的研究需要建立在系统环境模型的基础之上,由于多个智能体的存在,智能体之间的相互影响使得多智能体系统高度复杂,多智能体系统环境属于非确定马尔可夫模型,因此直接把基于马尔可夫模型的增强学习技术引入多智能体系统是不合适的。论文基于智能体间独立的学习机制,提出了一种改进的多智能体Q学习算法,使其适用于非确定马尔可夫环境,并对该学习技术在多智能体系统RoboCup中的应用进行了研究,实验证明了该学习技术的有效性与泛化能力,最后简要给出了多智能体增强学习研究的方向及进一步的工作。 相似文献
19.
Robustness of policies in constrained Markov decision processes 总被引:1,自引:0,他引:1
We consider the optimization of finite-state, finite-action Markov decision processes (MDPs), under constraints. Cost and constraints are discounted. We introduce a new method for investigating the continuity, and a certain type of robustness, of the optimal cost and the optimal policy under changes in the constraints. This method is also applicable for other cost criteria such as finite horizon and infinite horizon average cost. 相似文献
20.
Markov Decision Processes (MDPs) are a formulation for optimization problems in sequential decision making. Solving MDPs often requires implementing a simulator for optimization algorithms to invoke when updating decision making rules known as policies. The combination of simulator and optimizer are subject to failures of specification, implementation, integration, and optimization that may produce invalid policies. We present these failures as queries for a visual analytic system (MDPVIS). MDPVIS addresses three visualization research gaps. First, the data acquisition gap is addressed through a general simulator-visualization interface. Second, the data analysis gap is addressed through a generalized MDP information visualization. Finally, the cognition gap is addressed by exposing model components to the user. MDPVIS generalizes a visualization for wildfire management. We use that problem to illustrate MDPVIS and show the visualization's generality by connecting it to two reinforcement learning frameworks that implement many different MDPs of interest in the research community. 相似文献