共查询到20条相似文献,搜索用时 15 毫秒
1.
Hyeong Soo Chang Hong-Gi Lee Fu M.C. Marcus S.I. 《Automatic Control, IEEE Transactions on》2005,50(11):1804-1808
We propose a novel algorithm called evolutionary policy iteration (EPI) for solving infinite horizon discounted reward Markov decision processes. EPI inherits the spirit of policy iteration but eliminates the need to maximize over the entire action space in the policy improvement step, so it should be most effective for problems with very large action spaces. EPI iteratively generates a "population" or a set of policies such that the performance of the "elite policy" for a population monotonically improves with respect to a defined fitness function. EPI converges with probability one to a population whose elite policy is an optimal policy. EPI is naturally parallelizable and along this discussion, a distributed variant of PI is also studied. 相似文献
2.
Stephen D. Patek 《Computers & Operations Research》2004,31(14):930
We introduce and analyze several new policy iteration type algorithms for average cost Markov decision processes (MDPs). We limit attention to “recurrent state” processes where there exists a state which is recurrent under all stationary policies, and our analysis applies to finite-state problems with compact constraint sets, continuous transition probability functions, and lower-semicontinuous cost functions. The analysis makes use of an underlying relationship between recurrent state MDPs and the so-called stochastic shortest path problems of Bertsekas and Tsitsiklis (Math. Oper. Res. 16(3) (1991) 580). After extending this relationship, we establish the convergence of the new policy iteration type algorithms either to optimality or to within >0 of the optimal average cost. 相似文献
4.
This communique presents an algorithm called “value set iteration” (VSI) for solving infinite horizon discounted Markov decision processes with finite state and action spaces as a simple generalization of value iteration (VI) and as a counterpart to Chang’s policy set iteration. A sequence of value functions is generated by VSI based on manipulating a set of value functions at each iteration and it converges to the optimal value function. VSI preserves convergence properties of VI while converging no slower than VI and in particular, if the set used in VSI contains the value functions of independently generated sample-policies from a given distribution and a properly defined policy switching policy, a probabilistic exponential convergence rate of VSI can be established. Because the set used in VSI can contain the value functions of any policies generated by other existing algorithms, VSI is also a general framework of combining multiple solution methods. 相似文献
5.
Ma. de Guadalupe Garcia-Hernandez Jose Ruiz-Pinales Eva Onaindia J. Gabriel Aviña-Cervantes Sergio Ledesma-Orozco Edgar Alvarado-Mendez Alberto Reyes-Ballesteros 《Artificial Intelligence Review》2012,37(2):157-167
The problem of solving large Markov decision processes accurately and quickly is challenging. Since the computational effort
incurred is considerable, current research focuses on finding superior acceleration techniques. For instance, the convergence
properties of current solution methods depend, to a great extent, on the order of backup operations. On one hand, algorithms
such as topological sorting are able to find good orderings but their overhead is usually high. On the other hand, shortest
path methods, such as Dijkstra’s algorithm which is based on priority queues, have been applied successfully to the solution
of deterministic shortest-path Markov decision processes. Here, we propose an improved value iteration algorithm based on
Dijkstra’s algorithm for solving shortest path Markov decision processes. The experimental results on a stochastic shortest-path
problem show the feasibility of our approach. 相似文献
6.
We consider the problem of control of hierarchical Markov decision processes and develop a simulation based two-timescale actor-critic algorithm in a general framework. We also develop certain approximation algorithms that require less computation and satisfy a performance bound. One of the approximation algorithms is a three-timescale actor-critic algorithm while the other is a two-timescale algorithm, however, which operates in two separate stages. All our algorithms recursively update randomized policies using the simultaneous perturbation stochastic approximation (SPSA) methodology. We briefly present the convergence analysis of our algorithms. We then present numerical experiments on a problem of production planning in semiconductor fabs on which we compare the performance of all algorithms together with policy iteration. Algorithms based on certain Hadamard matrix based deterministic perturbations are found to show the best results. 相似文献
7.
《Computers & Operations Research》2005,32(1):127-142
The value iteration algorithm is a well-known technique for generating solutions to discounted Markov decision process (MDP) models. Although simple to implement, the approach is nevertheless limited in situations where many Markov decision processes must be solved, such as in real-time state-based control problems or in simulation/optimization problems, because of the potentially large number of iterations required for the value function to converge to an ε-optimal solution. Experimental results suggest, however, that the sequence of solution policies associated with each iteration of the algorithm converges much more rapidly than does the value function. This behavior has significant implications for designing solution approaches for MDPs, yet it has not been explicitly characterized in the literature nor generated significant discussion. This paper seeks to generate such discussion by providing comparative empirical convergence results and exploring several predictors that allow estimation of policy convergence speed based on existing MDP parameters. 相似文献
8.
9.
Opacity is a generic security property, that has been defined on (non-probabilistic) transition systems and later on Markov chains with labels. For a secret predicate, given as a subset of runs, and a function describing the view of an external observer, the value of interest for opacity is a measure of the set of runs disclosing the secret. We extend this definition to the richer framework of Markov decision processes, where non-deterministic choice is combined with probabilistic transitions, and we study related decidability problems with partial or complete observation hypotheses for the schedulers. We prove that all questions are decidable with complete observation and ω-regular secrets. With partial observation, we prove that all quantitative questions are undecidable but the question whether a system is almost surely non-opaque becomes decidable for a restricted class of ω-regular secrets, as well as for all ω-regular secrets under finite-memory schedulers. 相似文献
10.
《Engineering Applications of Artificial Intelligence》2001,14(3):287-300
In this article we propose a synthesis of recent works concerning a qualitative approach, based on possibility theory, to multi-stage decision under uncertainty. Our framework is a qualitative possibilistic counterpart to Markov decision processes (MDP), for which we propose dynamic programming-like algorithms. The classical MDP algorithms and their possibilistic counterparts are then experimentally compared on a family of benchmark examples. Finally, we also explore the case of partial observability, thus providing qualitative counterparts to the partially observable Markov decision processes framework. 相似文献
11.
Weighted Markov decision processes (MDPs) have long been used to model quantitative aspects of systems in the presence of uncertainty. However, much of the literature on such MDPs takes a monolithic approach, by modelling a system as a particular MDP; properties of the system are then inferred by analysis of that particular MDP. In contrast in this paper we develop compositional methods for reasoning about weighted MDPs, as a possible basis for compositional reasoning about their quantitative behaviour. In particular we approach these systems from a process algebraic point of view. For these we define a coinductive simulation-based behavioural preorder which is compositional in the sense that it is preserved by structural operators for constructing weighted MDPs from components. 相似文献
12.
Hyeong Soo Chang Fard P.J. Marcus S.I. Shayman M. 《Automatic Control, IEEE Transactions on》2003,48(6):976-987
This paper proposes a simple analytical model called M time scale Markov decision process (MMDPs) for hierarchically structured sequential decision making processes, where decisions in each level in the M-level hierarchy are made in M different discrete time scales. In this model, the state-space and the control-space of each level in the hierarchy are nonoverlapping with those of the other levels, respectively, and the hierarchy is structured in a "pyramid" sense such that a decision made at level m (slower time scale) state and/or the state will affect the evolutionary decision making process of the lower level m+1 (faster time scale) until a new decision is made at the higher level but the lower level decisions themselves do not affect the transition dynamics of higher levels. The performance produced by the lower level decisions will affect the higher level decisions. A hierarchical objective function is defined such that the finite-horizon value of following a (nonstationary) policy at level m+1 over a decision epoch of level m plus an immediate reward at level m is the single-step reward for the decision making process at level m. From this we define "multi-level optimal value function" and derive "multi-level optimality equation." We discuss how to solve MMDPs exactly and study some approximation methods, along with heuristic sampling-based schemes, to solve MMDPs. 相似文献
13.
A basic formula for online policy gradient algorithms 总被引:2,自引:0,他引:2
Xi-Ren Cao 《Automatic Control, IEEE Transactions on》2005,50(5):696-699
This note presents a (new) basic formula for sample-path-based estimates for performance gradients for Markov systems (called policy gradients in reinforcement learning literature). With this basic formula, many policy-gradient algorithms, including those that have previously appeared in the literature, can be easily developed. The formula follows naturally from a sensitivity equation in perturbation analysis. New research direction is discussed. 相似文献
14.
A steady-state optimal control problem is considered for nearly completely decomposable Markov chains. In order to apply the policy iteration method of R.A. Howard (Dynamic Programming and Markov Processes, Cambridge, MA, MIT Press, 1960), a high-dimensional ill-conditioned system of algebraic equations must be solved in the value-determination step. Although algorithms exist for aggregation of the steady-state probability distribution problem, they only provide methods for computing the cost, not the dual variables. Using a singular perturbation approach, an aggregation method for the value-determination equation is developed. The aggregation method is developed in three steps. First, a class of similarity transformations that transform the system into a singularly perturbed form is developed. Second, an aggregation method to compute the steady-state probability distribution is derived. Third, this aggregation method is applied to the value determination step of Howard's method (1960) 相似文献
15.
16.
An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes. The analysis uses multiscale stochastic approximation theory and the envelope theorem' of mathematical economics. 相似文献
17.
Desirable properties of the infinite histories of a finite-state Markov decision process are specified in terms of a finite number of events represented as ω-regular sets. An infinite history of the process produces a reward which depends on the properties it satisfies. The authors investigate the existence of optimal policies and provide algorithms for the construction of such policies 相似文献
18.
19.
Machine Learning - Although the notion of task similarity is potentially interesting in a wide range of areas such as curriculum learning or automated planning, it has mostly been tied to transfer... 相似文献