期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An exact iterative search algorithm for constrained Markov decision processes

Hyeong Soo Chang 《Automatica》2014

This communique provides an exact iterative search algorithm for the NP-hard problem of obtaining an optimal feasible stationary Markovian pure policy that achieves the maximum value averaged over an initial state distribution in finite constrained Markov decision processes. It is based on a novel characterization of the entire feasible policy space and takes the spirit of policy iteration (PI) in that a sequence of monotonically improving feasible policies is generated and converges to an optimal policy in iterations of the size of the policy space at the worst case. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs. 相似文献

2.

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Xin Xu Chunming Liu Dewen Hu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1055-1070

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI. 相似文献

3.

Incremental Value Iteration for Time-Aggregated Markov-Decision Processes

Tao Sun Qianchuan Zhao Luh P.B. 《Automatic Control, IEEE Transactions on》2007,52(11):2177-2182

A value iteration algorithm for time-aggregated Markov-decision processes (MDPs) is developed to solve problems with large state spaces. The algorithm is based on a novel approach which solves a time aggregated MDP by incrementally solving a set of standard MDPs. Therefore, the algorithm converges under the same assumption as standard value iteration. Such assumption is much weaker than that required by the existing time aggregated value iteration algorithm. The algorithms developed in this paper are also applicable to MDPs with fractional costs. 相似文献

4.

马尔可夫决策过程两种抽象模式 总被引：1，自引：1，他引：1

王蓁蓁邢汉承张志政倪庆剑《计算机科学》2008,35(10):6-14

抽象层次上马尔可夫决策过程的引入,使得人们可简洁地、陈述地表达复杂的马尔可夫决策过程,解决常规马尔可夫决策过程(MDPs)在实际中所遇到的大型状态空间的表达问题.介绍了结构型和概括型两种不同类型抽象马尔可夫决策过程基本概念以及在各种典型抽象MDPs中的最优策略的精确或近似算法,其中包括与常规MDPs根本不同的一个算法:把Bellman方程推广到抽象状态空间的方法,并且对它们的研究历史进行总结和对它们的发展做一些展望,使得人们对它们有一个透彻的、全面而又重点的理解. 相似文献

5.

Compositional reasoning for weighted Markov decision processes

Yuxin Deng Matthew Hennessy 《Science of Computer Programming》2013

Weighted Markov decision processes (MDPs) have long been used to model quantitative aspects of systems in the presence of uncertainty. However, much of the literature on such MDPs takes a monolithic approach, by modelling a system as a particular MDP; properties of the system are then inferred by analysis of that particular MDP. In contrast in this paper we develop compositional methods for reasoning about weighted MDPs, as a possible basis for compositional reasoning about their quantitative behaviour. In particular we approach these systems from a process algebraic point of view. For these we define a coinductive simulation-based behavioural preorder which is compositional in the sense that it is preserved by structural operators for constructing weighted MDPs from components. 相似文献

6.

An empirical study of policy convergence in Markov decision process value iteration

《Computers & Operations Research》2005,32(1):127-142

The value iteration algorithm is a well-known technique for generating solutions to discounted Markov decision process (MDP) models. Although simple to implement, the approach is nevertheless limited in situations where many Markov decision processes must be solved, such as in real-time state-based control problems or in simulation/optimization problems, because of the potentially large number of iterations required for the value function to converge to an ε-optimal solution. Experimental results suggest, however, that the sequence of solution policies associated with each iteration of the algorithm converges much more rapidly than does the value function. This behavior has significant implications for designing solution approaches for MDPs, yet it has not been explicitly characterized in the literature nor generated significant discussion. This paper seeks to generate such discussion by providing comparative empirical convergence results and exploring several predictors that allow estimation of policy convergence speed based on existing MDP parameters. 相似文献

7.

Symmetric approximate linear programming for factored MDPs with application to constrained problems

Dmitri A. Dolgov Edmund H. Durfee 《Annals of Mathematics and Artificial Intelligence》2006,47(3-4):273-293

A weakness of classical Markov decision processes (MDPs) is that they scale very poorly due to the flat state-space representation. Factored MDPs address this representational problem by exploiting problem structure to specify the transition and reward functions of an MDP in a compact manner. However, in general, solutions to factored MDPs do not retain the structure and compactness of the problem representation, forcing approximate solutions, with approximate linear programming (ALP) emerging as a promising MDP-approximation technique. To date, most ALP work has focused on the primal-LP formulation, while the dual LP, which forms the basis for solving constrained Markov problems, has received much less attention. We show that a straightforward linear approximation of the dual optimization variables is problematic, because some of the required computations cannot be carried out efficiently. Nonetheless, we develop a composite approach that symmetrically approximates the primal and dual optimization variables (effectively approximating both the objective function and the feasible region of the LP), leading to a formulation that is computationally feasible and suitable for solving constrained MDPs. We empirically show that this new ALP formulation also performs well on unconstrained problems. 相似文献

8.

增强Q学习在非确定马尔可夫系统寻优问题中的应用

郭锐彭军吴敏《计算机工程与应用》2005,41(13):36-38,146

增强学习属于机器学习的一种,它通过与环境的交互获得策略的改进,其在线学习和自适应学习的特点使其成为解决策略寻优问题有力的工具。多智能体系统是人工智能领域的一个研究热点,对于多智能体学习技术的研究需要建立在系统环境模型的基础之上,由于多个智能体的存在,智能体之间的相互影响使得多智能体系统高度复杂,多智能体系统环境属于非确定马尔可夫模型,因此直接把基于马尔可夫模型的增强学习技术引入多智能体系统是不合适的。论文基于智能体间独立的学习机制,提出了一种改进的多智能体Q学习算法,使其适用于非确定马尔可夫环境,并对该学习技术在多智能体系统RoboCup中的应用进行了研究,实验证明了该学习技术的有效性与泛化能力,最后简要给出了多智能体增强学习研究的方向及进一步的工作。相似文献

9.

The $n$th-Order Bias Optimality for Multichain Markov Decision Processes

Xi-Ren Cao Junyu Zhang 《Automatic Control, IEEE Transactions on》2008,53(2):496-508

In this paper, we propose a new approach to the theory of finite multichain Markov decision processes (MDPs) with different performance optimization criteria. We first propose the concept of nth-order bias; then, using the average reward and bias difference formulas derived in this paper, we develop an optimization theory for finite MDPs that covers a complete spectrum from average optimality, bias optimality, to all high-order bias optimality, in a unified way. The approach is simple, direct, natural, and intuitive; it depends neither on Laurent series expansion nor on discounted MDPs. We also propose one-phase policy iteration algorithms for bias and high-order bias optimal policies, which are more efficient than the two-phase algorithms in the literature. Furthermore, we derive high-order bias optimality equations. This research is a part of our effort in developing sensitivity-based learning and optimization theory. 相似文献

10.

Game-Based Probabilistic Predicate Abstraction in PRISM

Mark Kattenbelt Marta Kwiatkowska Gethin Norman David Parker 《Electronic Notes in Theoretical Computer Science》2008,220(3):5

相似文献

11.

Interactive visualization for testing Markov Decision Processes: MDPVIS

《Journal of Visual Languages and Computing》2017

Markov Decision Processes (MDPs) are a formulation for optimization problems in sequential decision making. Solving MDPs often requires implementing a simulator for optimization algorithms to invoke when updating decision making rules known as policies. The combination of simulator and optimizer are subject to failures of specification, implementation, integration, and optimization that may produce invalid policies. We present these failures as queries for a visual analytic system (MDPVIS). MDPVIS addresses three visualization research gaps. First, the data acquisition gap is addressed through a general simulator-visualization interface. Second, the data analysis gap is addressed through a generalized MDP information visualization. Finally, the cognition gap is addressed by exposing model components to the user. MDPVIS generalizes a visualization for wildfire management. We use that problem to illustrate MDPVIS and show the visualization's generality by connecting it to two reinforcement learning frameworks that implement many different MDPs of interest in the research community. 相似文献

12.

Solving Controlled Markov Set-Chains With Discounting via Multipolicy Improvement

Hyeong Soo Chang Chong E.K.P. 《Automatic Control, IEEE Transactions on》2007,52(3):564-569

We consider Markov decision processes (MDPs) where the state transition probability distributions are not uniquely known, but are known to belong to some intervals-so called "controlled Markov set-chains"-with infinite-horizon discounted reward criteria. We present formal methods to improve multiple policies for solving such controlled Markov set-chains. Our multipolicy improvement methods follow the spirit of parallel rollout and policy switching for solving MDPs. In particular, these methods are useful for online control of Markov set-chains and for designing policy iteration (PI) type algorithms. We develop a PI-type algorithm and prove that it converges to an optimal policy 相似文献

13.

The control of a two-level Markov decision process by time aggregation

Yat-wah Wan Author Vitae Author Vitae 《Automatica》2006,42(3):393-403

The solution of Markov Decision Processes (MDPs) often relies on special properties of the processes. For two-level MDPs, the difference in the rates of state changes of the upper and lower levels has led to limiting or approximate solutions of such problems. In this paper, we solve a two-level MDP without making any assumption on the rates of state changes of the two levels. We first show that such a two-level MDP is a non-standard one where the optimal actions of different states can be related to each other. Then we give assumptions (conditions) under which such a specially constrained MDP can be solved by policy iteration. We further show that the computational effort can be reduced by decomposing the MDP. A two-level MDP with M upper-level states can be decomposed into one MDP for the upper level and M to M(M-1) MDPs for the lower level, depending on the structure of the two-level MDP. The upper-level MDP is solved by time aggregation, a technique introduced in a recent paper [Cao, X.-R., Ren, Z. Y., Bhatnagar, S., Fu, M., & Marcus, S. (2002). A time aggregation approach to Markov decision processes. Automatica, 38(6), 929-943.], and the lower-level MDPs are solved by embedded Markov chains. 相似文献

14.

EFFICIENT ABSTRACTION SELECTION IN REINFORCEMENT LEARNING

Harm van Seijen Shimon Whiteson Leon Kester 《Computational Intelligence》2014,30(4):657-699

This article addresses reinforcement learning problems based on factored Markov decision processes (MDPs) in which the agent must choose among a set of candidate abstractions, each build up from a different combination of state components. We present and evaluate a new approach that can perform effective abstraction selection that is more resource‐efficient and/or more general than existing approaches. The core of the approach is to make selection of an abstraction part of the learning agent's decision‐making process by augmenting the agent's action space with internal actions that select the abstraction it uses. We prove that under certain conditions this approach results in a derived MDP whose solution yields both the optimal abstraction for the original MDP and the optimal policy under that abstraction. We examine our approach in three domains of increasing complexity: contextual bandit problems, episodic MDPs, and general MDPs with context‐specific structure. © 2013 Wiley Periodicals, Inc. 相似文献

15.

基于因素化表示的TD（λ）算法

下载免费PDF全文

戴帅殷苌茗张欣《计算机工程》2009,35(13):190-192

提出一种新的基于因素法方法的TD（λ）算法。其基本思想是状态因素化表示,通过动态贝叶斯网络表示Markov决策过程（MDP）中的状态转移概率函数,结合决策树表示TD（λ）算法中的状态值函数,降低状态空间的搜索与计算复杂度,因而适用于求解大状态空间的MDPs问题,实验证明该表示方法是有效的。相似文献

16.

基于三角剖分的小脑模型在增强学习中的应用

孙方义郑志强《计算机应用》2009,29(3):871-873

研究了一种基于三角剖分的小脑模型的增强学习控制器设计方法,并应用于机器人足球中单机器人截球的学习控制中。该方法通过在Markov决策过程状态空间中引入基于单纯形的库恩三角化,实现基于三角剖分的线性值函数逼近,从而有效提高了增强学习控制器对连续状态空间马氏决策问题的泛化性能。针对机器人截球学习控制的仿真研究表明,采用基于三角剖分的小脑模型进行值函数逼近的增强学习控制器能够获得优于已有基于均匀编码的小脑模型方法的学习效率和泛化性能。相似文献

17.

Partially Observable Markov Decision Processes and Performance Sensitivity Analysis

《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2008,38(6):1645-1651

The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy. 相似文献

18.

Policy iteration based feedback control

Kan-Jian Zhang Author Vitae Yan-Kai Xu Author Vitae Xi Chen Author Vitae Xi-Ren Cao Author Vitae 《Automatica》2008,44(4):1055-1061

It is well known that stochastic control systems can be viewed as Markov decision processes (MDPs) with continuous state spaces. In this paper, we propose to apply the policy iteration approach in MDPs to the optimal control problem of stochastic systems. We first provide an optimality equation based on performance potentials and develop a policy iteration procedure. Then we apply policy iteration to the jump linear quadratic problem and obtain the coupled Riccati equations for their optimal solutions. The approach is applicable to linear as well as nonlinear systems and can be implemented on-line on real world systems without identifying all the system structure and parameters. 相似文献

19.

From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning 总被引：8，自引：0，他引：8

Xi-Ren Cao 《Discrete Event Dynamic Systems》2003,13(1-2):9-39

The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed. 相似文献

20.

一种新的多智能体Q学习算法 总被引：2，自引：0，他引：2

郭锐吴敏彭军彭姣曹卫华《自动化学报》2007,33(4):367-372

针对非确定马尔可夫环境下的多智能体系统,提出了一种新的多智能体Q学习算法.算法中通过对联合动作的统计来学习其它智能体的行为策略,并利用智能体策略向量的全概率分布保证了对联合最优动作的选择. 同时对算法的收敛性和学习性能进行了分析.该算法在多智能体系统RoboCup中的应用进一步表明了算法的有效性与泛化能力. 相似文献