期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The control of a two-level Markov decision process by time aggregation

Yat-wah Wan Author Vitae Author Vitae 《Automatica》2006,42(3):393-403

The solution of Markov Decision Processes (MDPs) often relies on special properties of the processes. For two-level MDPs, the difference in the rates of state changes of the upper and lower levels has led to limiting or approximate solutions of such problems. In this paper, we solve a two-level MDP without making any assumption on the rates of state changes of the two levels. We first show that such a two-level MDP is a non-standard one where the optimal actions of different states can be related to each other. Then we give assumptions (conditions) under which such a specially constrained MDP can be solved by policy iteration. We further show that the computational effort can be reduced by decomposing the MDP. A two-level MDP with M upper-level states can be decomposed into one MDP for the upper level and M to M(M-1) MDPs for the lower level, depending on the structure of the two-level MDP. The upper-level MDP is solved by time aggregation, a technique introduced in a recent paper [Cao, X.-R., Ren, Z. Y., Bhatnagar, S., Fu, M., & Marcus, S. (2002). A time aggregation approach to Markov decision processes. Automatica, 38(6), 929-943.], and the lower-level MDPs are solved by embedded Markov chains. 相似文献

2.

An exact iterative search algorithm for constrained Markov decision processes

Hyeong Soo Chang 《Automatica》2014

This communique provides an exact iterative search algorithm for the NP-hard problem of obtaining an optimal feasible stationary Markovian pure policy that achieves the maximum value averaged over an initial state distribution in finite constrained Markov decision processes. It is based on a novel characterization of the entire feasible policy space and takes the spirit of policy iteration (PI) in that a sequence of monotonically improving feasible policies is generated and converges to an optimal policy in iterations of the size of the policy space at the worst case. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs. 相似文献

3.

Policy set iteration for Markov decision processes

Hyeong Soo Chang 《Automatica》2013

相似文献

4.

Value set iteration for Markov decision processes

Hyeong Soo Chang 《Automatica》2014

This communique presents an algorithm called “value set iteration” (VSI) for solving infinite horizon discounted Markov decision processes with finite state and action spaces as a simple generalization of value iteration (VI) and as a counterpart to Chang’s policy set iteration. A sequence of value functions is generated by VSI based on manipulating a set of value functions at each iteration and it converges to the optimal value function. VSI preserves convergence properties of VI while converging no slower than VI and in particular, if the set used in VSI contains the value functions of independently generated sample-policies from a given distribution and a properly defined policy switching policy, a probabilistic exponential convergence rate of VSI can be established. Because the set used in VSI can contain the value functions of any policies generated by other existing algorithms, VSI is also a general framework of combining multiple solution methods. 相似文献

5.

Numerical analysis of continuous time Markov decision processes over finite horizons

Peter Buchholz Ingo Schulz 《Computers & Operations Research》2011

Continuous time Markov decision processes (CTMDPs) with a finite state and action space have been considered for a long time. It is known that under fairly general conditions the reward gained over a finite horizon can be maximized by a so-called piecewise constant policy which changes only finitely often in a finite interval. Although this result is available for more than 30 years, numerical analysis approaches to compute the optimal policy and reward are restricted to discretization methods which are known to converge to the true solution if the discretization step goes to zero. In this paper, we present a new method that is based on uniformization of the CTMDP and allows one to compute an ε-optimal

ε -optimal

policy up to a predefined precision in a numerically stable way using adaptive time steps. 相似文献

6.

Actor-critic algorithms for hierarchical Markov decision processes

Shalabh Bhatnagar Author Vitae J. Ranjan Panigrahi^{Author Vitae} 《Automatica》2006,42(4):637-644

We consider the problem of control of hierarchical Markov decision processes and develop a simulation based two-timescale actor-critic algorithm in a general framework. We also develop certain approximation algorithms that require less computation and satisfy a performance bound. One of the approximation algorithms is a three-timescale actor-critic algorithm while the other is a two-timescale algorithm, however, which operates in two separate stages. All our algorithms recursively update randomized policies using the simultaneous perturbation stochastic approximation (SPSA) methodology. We briefly present the convergence analysis of our algorithms. We then present numerical experiments on a problem of production planning in semiconductor fabs on which we compare the performance of all algorithms together with policy iteration. Algorithms based on certain Hadamard matrix based deterministic perturbations are found to show the best results. 相似文献

7.

Compositional reasoning for weighted Markov decision processes

Yuxin Deng Matthew Hennessy 《Science of Computer Programming》2013

Weighted Markov decision processes (MDPs) have long been used to model quantitative aspects of systems in the presence of uncertainty. However, much of the literature on such MDPs takes a monolithic approach, by modelling a system as a particular MDP; properties of the system are then inferred by analysis of that particular MDP. In contrast in this paper we develop compositional methods for reasoning about weighted MDPs, as a possible basis for compositional reasoning about their quantitative behaviour. In particular we approach these systems from a process algebraic point of view. For these we define a coinductive simulation-based behavioural preorder which is compositional in the sense that it is preserved by structural operators for constructing weighted MDPs from components. 相似文献

8.

Simulation-based optimization of Markov decision processes: An empirical process theory approach

Rahul Jain Author Vitae Pravin Varaiya^{Author Vitae} 《Automatica》2010,46(8):1297-1304

We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an ?-optimal policy from simulation. We provide sample complexity of such an approach. 相似文献

9.

Exact finite approximations of average-cost countable Markov decision processes

Arie Leizarowitz Author Vitae Adam Shwartz Author Vitae 《Automatica》2008,44(6):1480-1487

For a countable-state Markov decision process we introduce an embedding which produces a finite-state Markov decision process. The finite-state embedded process has the same optimal cost, and moreover, it has the same dynamics as the original process when restricting to the approximating set. The embedded process can be used as an approximation which, being finite, is more convenient for computation and implementation. 相似文献

10.

Probabilistic opacity for Markov decision processes

Béatrice Bérard Krishnendu Chatterjee Nathalie Sznajder 《Information Processing Letters》2015

Opacity is a generic security property, that has been defined on (non-probabilistic) transition systems and later on Markov chains with labels. For a secret predicate, given as a subset of runs, and a function describing the view of an external observer, the value of interest for opacity is a measure of the set of runs disclosing the secret. We extend this definition to the richer framework of Markov decision processes, where non-deterministic choice is combined with probabilistic transitions, and we study related decidability problems with partial or complete observation hypotheses for the schedulers. We prove that all questions are decidable with complete observation and ω-regular secrets. With partial observation, we prove that all quantitative questions are undecidable but the question whether a system is almost surely non-opaque becomes decidable for a restricted class of ω-regular secrets, as well as for all ω-regular secrets under finite-memory schedulers. 相似文献

11.

连续时间Markov决策过程在呼叫接入控制中的应用

周亚平奚宏生殷保群唐昊《控制与决策》2001,16(Z1):795-799

应用Markov决策过程与性能势相结合的方法,给出了呼叫接入控制的策略优化算法.所得到的最优策略是状态相关的策略,与基于节点已占用带宽决定行动的策略相比,状态相关策略具有更好的性能值,而且该算法具有很快的收敛速度. 相似文献

12.

Basic Ideas for Event-Based Optimization of Markov Systems 总被引：5，自引：0，他引：5

Xi-Ren?Cao Email author 《Discrete Event Dynamic Systems》2005,15(2):169-197

The goal of this paper is two-fold: First, we present a sensitivity point of view on the optimization of Markov systems. We show that Markov decision processes (MDPs) and the policy-gradient approach, or perturbation analysis (PA), can be derived easily from two fundamental sensitivity formulas, and such formulas can be flexibly constructed, by first principles, with performance potentials as building blocks. Second, with this sensitivity view we propose an event-based optimization approach, including the event-based sensitivity analysis and event-based policy iteration. This approach utilizes the special feature of a system characterized by events and illustrates how the potentials can be aggregated using the special feature and how the aggregated potential can be used in policy iteration. Compared with the traditional MDP approach, the event-based approach has its advantages: the number of aggregated potentials may scale to the system size despite that the number of states grows exponentially in the system size, this reduces the policy space and saves computation; the approach does not require actions at different states to be independent; and it utilizes the special feature of a system and does not need to know the exact transition probability matrix. The main ideas of the approach are illustrated by an admission control problem.Supported in part by a grant from Hong Kong UGC. 相似文献

13.

An actor-critic algorithm for constrained Markov decision processes 总被引：2，自引：0，他引：2

V.S. Borkar 《Systems & Control Letters》2005,54(3):207-213

An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes. The analysis uses multiscale stochastic approximation theory and the envelope theorem' of mathematical economics. 相似文献

14.

A unified approach to Markov decision problems and performance sensitivity analysis with discounted and average criteria: multichain cases

Xi-Ren Cao Author Vitae Xianping Guo^{Author Vitae} 《Automatica》2004,40(10):1749-1759

We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao, Automatica 36 (2000) 771). 相似文献

15.

Performance sensitivities for parameterized Markov systems

Xiren Cao Junyu Zhang 《控制理论与应用(英文版)》2004,2(1):65-68

It is known that the performance potentials (or equivalently, perturbation realization factors) can be used as building blocks for performance sensitivities of Markov systems. In parameterized systerns, the changes in parameters may only affect some states, and the explicit transition probability matrix may not be known. In this paper, we use an example to show that we can use potentials to construct performance sensitivities m a more flexible way; only the potentials at the affected states need to be estimated, and the transition probability matrix need not be known. Policy iteration algorithms, which are simpler than the standard one, can be established. 相似文献

16.

Denumerable continuous-time Markov decision processes with multiconstraints on average costs

Qiuli Liu Hangsheng Tan 《International journal of systems science》2013,44(3):576-585

This article deals with multiconstrained continuous-time Markov decision processes in a denumerable state space, with unbounded cost and transition rates. The criterion to be optimised is the long-run expected average cost, and several kinds of constraints are imposed on some associated costs. The existence of a constrained optimal policy is ensured under suitable conditions by using a martingale technique and introducing an occupation measure. Furthermore, for the unichain model, we transform this multiconstrained problem into an equivalent linear programming problem, then construct a constrained optimal policy from an optimal solution to the linear programming. Finally, we use an example of a controlled queueing system to illustrate an application of our results. 相似文献

17.

Stochastic learning and optimization—A sensitivity-based approach

Xi-Ren 《Annual Reviews in Control》2009,33(1):11-24

We introduce a sensitivity-based view to the area of learning and optimization of stochastic dynamic systems. We show that this sensitivity-based view provides a unified framework for many different disciplines in this area, including perturbation analysis, Markov decision processes, reinforcement learning, identification and adaptive control, and singular stochastic control; and that this unified framework applies to both the discrete event dynamic systems and continuous-time continuous-state systems. Many results in these disciplines can be simply derived and intuitively explained by using two performance sensitivity formulas. In addition, we show that this sensitivity-based view leads to new results and opens up new directions for future research. For example, the n th bias optimality of Markov processes has been established and the event-based optimization may be developed; this approach has computational and other advantages over the state-based approaches. 相似文献

18.

Markov 控制过程在紧致行动集上的迭代优化算法 总被引：5，自引：0，他引：5

唐昊奚宏生殷保群《控制与决策》2003,18(3):267-271

研究一类连续时间Markov控制过程(CTMCP)在紧致行动集上关于平均代价性能准则的优化算法。根据CTMCP的性能势公式和平均代价最优性方程,导出了求解最优或次最优平稳控制策略的策略迭代算法和数值迭代算法,在无需假设迭代算子是sp—压缩的条件下,给出了这两种算法的收敛性证明。最后通过分析一个受控排队网络的例子说明了这种方法的优越性。相似文献

19.

Strong 1-optimal stationary policies in denumerable Markov decision processes

R. Cavazos-Cadena 《Systems & Control Letters》1988,11(1)

Usual conditions for existence of stationary average optimal policies in denumerable MDPs with general bounded rewards are shown to be also sufficient for strong 1-optimality. Moreover, we prove that all limit points of discounted optimal stationary policies when the discount factor goes to 1 are strong 1-optimal. 相似文献

20.

Markov decision processes with iterated coherent risk measures

Shanyun Chu 《International journal of control》2013,86(11):2286-2293

This paper considers a Markov decision process in Borel state and action spaces with the aggregated (or say iterated) coherent risk measure to be minimised. For this problem, we establish the Bellman optimality equation as well as the value and policy iteration algorithms, and show the existence of a deterministic stationary optimal policy. The cost function, while being allowed to be unbounded from below (in the sense that its negative part needs be bounded by some nonnegative real-valued possibly unbounded weight function), can be arbitrarily unbounded from above and possibly infinitely valued. 相似文献