共查询到20条相似文献,搜索用时 234 毫秒
1.
This communique provides an exact iterative search algorithm for the NP-hard problem of obtaining an optimal feasible stationary Markovian pure policy that achieves the maximum value averaged over an initial state distribution in finite constrained Markov decision processes. It is based on a novel characterization of the entire feasible policy space and takes the spirit of policy iteration (PI) in that a sequence of monotonically improving feasible policies is generated and converges to an optimal policy in iterations of the size of the policy space at the worst case. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs. 相似文献
2.
Xin Xu Chunming Liu Dewen Hu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1055-1070
As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied
in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open
problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state
and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which
is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and
action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique
is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement.
To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed
so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel
machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed
approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better
performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI. 相似文献
3.
A value iteration algorithm for time-aggregated Markov-decision processes (MDPs) is developed to solve problems with large state spaces. The algorithm is based on a novel approach which solves a time aggregated MDP by incrementally solving a set of standard MDPs. Therefore, the algorithm converges under the same assumption as standard value iteration. Such assumption is much weaker than that required by the existing time aggregated value iteration algorithm. The algorithms developed in this paper are also applicable to MDPs with fractional costs. 相似文献
4.
马尔可夫决策过程两种抽象模式 总被引:1,自引:1,他引:1
抽象层次上马尔可夫决策过程的引入,使得人们可简洁地、陈述地表达复杂的马尔可夫决策过程,解决常规马尔可夫决策过程(MDPs)在实际中所遇到的大型状态空间的表达问题.介绍了结构型和概括型两种不同类型抽象马尔可夫决策过程基本概念以及在各种典型抽象MDPs中的最优策略的精确或近似算法,其中包括与常规MDPs根本不同的一个算法:把Bellman方程推广到抽象状态空间的方法,并且对它们的研究历史进行总结和对它们的发展做一些展望,使得人们对它们有一个透彻的、全面而又重点的理解. 相似文献
5.
Weighted Markov decision processes (MDPs) have long been used to model quantitative aspects of systems in the presence of uncertainty. However, much of the literature on such MDPs takes a monolithic approach, by modelling a system as a particular MDP; properties of the system are then inferred by analysis of that particular MDP. In contrast in this paper we develop compositional methods for reasoning about weighted MDPs, as a possible basis for compositional reasoning about their quantitative behaviour. In particular we approach these systems from a process algebraic point of view. For these we define a coinductive simulation-based behavioural preorder which is compositional in the sense that it is preserved by structural operators for constructing weighted MDPs from components. 相似文献
6.
《Computers & Operations Research》2005,32(1):127-142
The value iteration algorithm is a well-known technique for generating solutions to discounted Markov decision process (MDP) models. Although simple to implement, the approach is nevertheless limited in situations where many Markov decision processes must be solved, such as in real-time state-based control problems or in simulation/optimization problems, because of the potentially large number of iterations required for the value function to converge to an ε-optimal solution. Experimental results suggest, however, that the sequence of solution policies associated with each iteration of the algorithm converges much more rapidly than does the value function. This behavior has significant implications for designing solution approaches for MDPs, yet it has not been explicitly characterized in the literature nor generated significant discussion. This paper seeks to generate such discussion by providing comparative empirical convergence results and exploring several predictors that allow estimation of policy convergence speed based on existing MDP parameters. 相似文献
7.
Dmitri A. Dolgov Edmund H. Durfee 《Annals of Mathematics and Artificial Intelligence》2006,47(3-4):273-293
A weakness of classical Markov decision processes (MDPs) is that they scale very poorly due to the flat state-space representation.
Factored MDPs address this representational problem by exploiting problem structure to specify the transition and reward functions
of an MDP in a compact manner. However, in general, solutions to factored MDPs do not retain the structure and compactness
of the problem representation, forcing approximate solutions, with approximate linear programming (ALP) emerging as a promising
MDP-approximation technique. To date, most ALP work has focused on the primal-LP formulation, while the dual LP, which forms
the basis for solving constrained Markov problems, has received much less attention. We show that a straightforward linear
approximation of the dual optimization variables is problematic, because some of the required computations cannot be carried
out efficiently. Nonetheless, we develop a composite approach that symmetrically approximates the primal and dual optimization
variables (effectively approximating both the objective function and the feasible region of the LP), leading to a formulation
that is computationally feasible and suitable for solving constrained MDPs. We empirically show that this new ALP formulation
also performs well on unconstrained problems.
相似文献
8.
增强学习属于机器学习的一种,它通过与环境的交互获得策略的改进,其在线学习和自适应学习的特点使其成为解决策略寻优问题有力的工具。多智能体系统是人工智能领域的一个研究热点,对于多智能体学习技术的研究需要建立在系统环境模型的基础之上,由于多个智能体的存在,智能体之间的相互影响使得多智能体系统高度复杂,多智能体系统环境属于非确定马尔可夫模型,因此直接把基于马尔可夫模型的增强学习技术引入多智能体系统是不合适的。论文基于智能体间独立的学习机制,提出了一种改进的多智能体Q学习算法,使其适用于非确定马尔可夫环境,并对该学习技术在多智能体系统RoboCup中的应用进行了研究,实验证明了该学习技术的有效性与泛化能力,最后简要给出了多智能体增强学习研究的方向及进一步的工作。 相似文献
9.
Xi-Ren Cao Junyu Zhang 《Automatic Control, IEEE Transactions on》2008,53(2):496-508
In this paper, we propose a new approach to the theory of finite multichain Markov decision processes (MDPs) with different performance optimization criteria. We first propose the concept of nth-order bias; then, using the average reward and bias difference formulas derived in this paper, we develop an optimization theory for finite MDPs that covers a complete spectrum from average optimality, bias optimality, to all high-order bias optimality, in a unified way. The approach is simple, direct, natural, and intuitive; it depends neither on Laurent series expansion nor on discounted MDPs. We also propose one-phase policy iteration algorithms for bias and high-order bias optimal policies, which are more efficient than the two-phase algorithms in the literature. Furthermore, we derive high-order bias optimality equations. This research is a part of our effort in developing sensitivity-based learning and optimization theory. 相似文献
10.
11.
Markov Decision Processes (MDPs) are a formulation for optimization problems in sequential decision making. Solving MDPs often requires implementing a simulator for optimization algorithms to invoke when updating decision making rules known as policies. The combination of simulator and optimizer are subject to failures of specification, implementation, integration, and optimization that may produce invalid policies. We present these failures as queries for a visual analytic system (MDPVIS). MDPVIS addresses three visualization research gaps. First, the data acquisition gap is addressed through a general simulator-visualization interface. Second, the data analysis gap is addressed through a generalized MDP information visualization. Finally, the cognition gap is addressed by exposing model components to the user. MDPVIS generalizes a visualization for wildfire management. We use that problem to illustrate MDPVIS and show the visualization's generality by connecting it to two reinforcement learning frameworks that implement many different MDPs of interest in the research community. 相似文献
12.
We consider Markov decision processes (MDPs) where the state transition probability distributions are not uniquely known, but are known to belong to some intervals-so called "controlled Markov set-chains"-with infinite-horizon discounted reward criteria. We present formal methods to improve multiple policies for solving such controlled Markov set-chains. Our multipolicy improvement methods follow the spirit of parallel rollout and policy switching for solving MDPs. In particular, these methods are useful for online control of Markov set-chains and for designing policy iteration (PI) type algorithms. We develop a PI-type algorithm and prove that it converges to an optimal policy 相似文献
13.
Yat-wah Wan Author Vitae Author Vitae 《Automatica》2006,42(3):393-403
The solution of Markov Decision Processes (MDPs) often relies on special properties of the processes. For two-level MDPs, the difference in the rates of state changes of the upper and lower levels has led to limiting or approximate solutions of such problems. In this paper, we solve a two-level MDP without making any assumption on the rates of state changes of the two levels. We first show that such a two-level MDP is a non-standard one where the optimal actions of different states can be related to each other. Then we give assumptions (conditions) under which such a specially constrained MDP can be solved by policy iteration. We further show that the computational effort can be reduced by decomposing the MDP. A two-level MDP with M upper-level states can be decomposed into one MDP for the upper level and M to M(M-1) MDPs for the lower level, depending on the structure of the two-level MDP. The upper-level MDP is solved by time aggregation, a technique introduced in a recent paper [Cao, X.-R., Ren, Z. Y., Bhatnagar, S., Fu, M., & Marcus, S. (2002). A time aggregation approach to Markov decision processes. Automatica, 38(6), 929-943.], and the lower-level MDPs are solved by embedded Markov chains. 相似文献
14.
This article addresses reinforcement learning problems based on factored Markov decision processes (MDPs) in which the agent must choose among a set of candidate abstractions, each build up from a different combination of state components. We present and evaluate a new approach that can perform effective abstraction selection that is more resource‐efficient and/or more general than existing approaches. The core of the approach is to make selection of an abstraction part of the learning agent's decision‐making process by augmenting the agent's action space with internal actions that select the abstraction it uses. We prove that under certain conditions this approach results in a derived MDP whose solution yields both the optimal abstraction for the original MDP and the optimal policy under that abstraction. We examine our approach in three domains of increasing complexity: contextual bandit problems, episodic MDPs, and general MDPs with context‐specific structure. © 2013 Wiley Periodicals, Inc. 相似文献
15.
16.
研究了一种基于三角剖分的小脑模型的增强学习控制器设计方法,并应用于机器人足球中单机器人截球的学习控制中。该方法通过在Markov决策过程状态空间中引入基于单纯形的库恩三角化,实现基于三角剖分的线性值函数逼近,从而有效提高了增强学习控制器对连续状态空间马氏决策问题的泛化性能。针对机器人截球学习控制的仿真研究表明,采用基于三角剖分的小脑模型进行值函数逼近的增强学习控制器能够获得优于已有基于均匀编码的小脑模型方法的学习效率和泛化性能。 相似文献
17.
《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2008,38(6):1645-1651
18.
Kan-Jian Zhang Author Vitae Yan-Kai Xu Author Vitae Xi Chen Author Vitae Xi-Ren Cao Author Vitae 《Automatica》2008,44(4):1055-1061
It is well known that stochastic control systems can be viewed as Markov decision processes (MDPs) with continuous state spaces. In this paper, we propose to apply the policy iteration approach in MDPs to the optimal control problem of stochastic systems. We first provide an optimality equation based on performance potentials and develop a policy iteration procedure. Then we apply policy iteration to the jump linear quadratic problem and obtain the coupled Riccati equations for their optimal solutions. The approach is applicable to linear as well as nonlinear systems and can be implemented on-line on real world systems without identifying all the system structure and parameters. 相似文献
19.
Xi-Ren Cao 《Discrete Event Dynamic Systems》2003,13(1-2):9-39
The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed. 相似文献