共查询到20条相似文献,搜索用时 0 毫秒
2.
We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao, Automatica 36 (2000) 771). 相似文献
3.
This communique provides an exact iterative search algorithm for the NP-hard problem of obtaining an optimal feasible stationary Markovian pure policy that achieves the maximum value averaged over an initial state distribution in finite constrained Markov decision processes. It is based on a novel characterization of the entire feasible policy space and takes the spirit of policy iteration (PI) in that a sequence of monotonically improving feasible policies is generated and converges to an optimal policy in iterations of the size of the policy space at the worst case. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs. 相似文献
4.
This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov
Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete)
action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient
search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient
estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and
an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented.
Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs
TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed
to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed
algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance
comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).
相似文献
Shalabh Bhatnagar (Corresponding author)Email: |
5.
Average Reward Reinforcement Learning: Foundations,Algorithms, and Empirical Results 总被引:12,自引:0,他引:12
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains. 相似文献
6.
An actor-critic type reinforcement learning algorithm is proposed and analyzed for constrained controlled Markov decision processes. The analysis uses multiscale stochastic approximation theory and the envelope theorem' of mathematical economics. 相似文献
7.
Xi-Ren CaoAuthor Vitae Zhiyuan RenAuthor Vitae Shalabh BhatnagarAuthor Vitae Michael FuAuthor Vitae Steven MarcusAuthor Vitae 《Automatica》2002,38(6):929-943
We propose a time aggregation approach for the solution of infinite horizon average cost Markov decision processes via policy iteration. In this approach, policy update is only carried out when the process visits a subset of the state space. As in state aggregation, this approach leads to a reduced state space, which may lead to a substantial reduction in computational and storage requirements, especially for problems with certain structural properties. However, in contrast to state aggregation, which generally results in an approximate model due to the loss of Markov property, time aggregation suffers no loss of accuracy, because the Markov property is preserved. Single sample path-based estimation algorithms are developed that allow the time aggregation approach to be implemented on-line for practical systems. Some numerical and simulation examples are presented to illustrate the ideas and potential computational savings. 相似文献
8.
Policy iteration for customer-average performance optimization of closed queueing systems 总被引:1,自引:0,他引:1
Li Xia Author Vitae 《Automatica》2009,45(7):1639-304
We consider the optimization of queueing systems with service rates depending on system states. The optimization criterion is the long-run customer-average performance, which is an important performance metric, different from the traditional time-average performance. We first establish, with perturbation analysis, a difference equation of the customer-average performance in closed networks with exponentially distributed service times and state-dependent service rates. Then we propose a policy iteration optimization algorithm based on this difference equation. This algorithm can be implemented on-line with a single sample path and does not require knowing the routing probabilities of queueing systems. Finally, we give numerical experiments which demonstrate the efficiency of our algorithm. This paper gives a new direction to efficiently optimize the “customer-centric” performance in queueing systems. 相似文献
9.
This communique presents an algorithm called “value set iteration” (VSI) for solving infinite horizon discounted Markov decision processes with finite state and action spaces as a simple generalization of value iteration (VI) and as a counterpart to Chang’s policy set iteration. A sequence of value functions is generated by VSI based on manipulating a set of value functions at each iteration and it converges to the optimal value function. VSI preserves convergence properties of VI while converging no slower than VI and in particular, if the set used in VSI contains the value functions of independently generated sample-policies from a given distribution and a properly defined policy switching policy, a probabilistic exponential convergence rate of VSI can be established. Because the set used in VSI can contain the value functions of any policies generated by other existing algorithms, VSI is also a general framework of combining multiple solution methods. 相似文献
10.
Xi-Ren 《Annual Reviews in Control》2009,33(1):11-24
We introduce a sensitivity-based view to the area of learning and optimization of stochastic dynamic systems. We show that this sensitivity-based view provides a unified framework for many different disciplines in this area, including perturbation analysis, Markov decision processes, reinforcement learning, identification and adaptive control, and singular stochastic control; and that this unified framework applies to both the discrete event dynamic systems and continuous-time continuous-state systems. Many results in these disciplines can be simply derived and intuitively explained by using two performance sensitivity formulas. In addition, we show that this sensitivity-based view leads to new results and opens up new directions for future research. For example, the n th bias optimality of Markov processes has been established and the event-based optimization may be developed; this approach has computational and other advantages over the state-based approaches. 相似文献
11.
逻辑马尔可夫决策过程和关系马尔可夫决策过程的引入,使得人们可能简洁地、陈述地表达复杂的马尔可夫决策过程。本文首先介绍有关逻辑马尔可夫决策过程和关系马尔可夫决策过程的概念,然后重点介绍它们与普通的马尔可夫决策过程根本不同的一些算法:①依赖于基本状态空间RL的转换法;②把Bellman方程推广到抽象状态空间的方法;③利用策略偏置空间寻求近似最优策略方法。最后对它们的研究现状进行总结及其对它们发展的一些展望。 相似文献
12.
13.
Ronald Ortner 《Minds and Machines》2008,18(4):521-526
We give an example from the theory of Markov decision processes which shows that the “optimism in the face of uncertainty”
heuristics may fail to make any progress. This is due to the impossibility to falsify a belief that a (transition) probability
is larger than 0. Our example shows the utility of Popper’s demand of falsifiability of hypotheses in the area of artificial
intelligence.
相似文献
Ronald OrtnerEmail: |
14.
A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP. In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case. In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states. The running time is exponential in the horizon time (which depends only on the discount factor and the desired degree of approximation to the optimal policy). Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter.Our algorithm is based on the idea of sparse sampling. We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP. Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng. Neural information processing systems 13, to appear). 相似文献
15.
Yat-wah Wan Author Vitae Author Vitae 《Automatica》2006,42(3):393-403
The solution of Markov Decision Processes (MDPs) often relies on special properties of the processes. For two-level MDPs, the difference in the rates of state changes of the upper and lower levels has led to limiting or approximate solutions of such problems. In this paper, we solve a two-level MDP without making any assumption on the rates of state changes of the two levels. We first show that such a two-level MDP is a non-standard one where the optimal actions of different states can be related to each other. Then we give assumptions (conditions) under which such a specially constrained MDP can be solved by policy iteration. We further show that the computational effort can be reduced by decomposing the MDP. A two-level MDP with M upper-level states can be decomposed into one MDP for the upper level and M to M(M-1) MDPs for the lower level, depending on the structure of the two-level MDP. The upper-level MDP is solved by time aggregation, a technique introduced in a recent paper [Cao, X.-R., Ren, Z. Y., Bhatnagar, S., Fu, M., & Marcus, S. (2002). A time aggregation approach to Markov decision processes. Automatica, 38(6), 929-943.], and the lower-level MDPs are solved by embedded Markov chains. 相似文献
16.
Thomas J. Walsh Ali Nouri Lihong Li Michael L. Littman 《Autonomous Agents and Multi-Agent Systems》2009,18(1):83-105
This work considers the problems of learning and planning in Markovian environments with constant observation and reward delays.
We provide a hardness result for the general planning problem and positive results for several special cases with deterministic
or otherwise constrained dynamics. We present an algorithm, Model Based Simulation, for planning in such environments and
use model-based reinforcement learning to extend this approach to the learning setting in both finite and continuous environments.
Empirical comparisons show this algorithm holds significant advantages over others for decision making in delayed-observation
environments. 相似文献
17.
分析了折扣激励学习存在的问题,对MDPs的SARSA(λ)算法进行了折扣的比较实验分析,讨论了平均奖赏常量对无折扣SARSA(()算法的影响。 相似文献
18.
Shaping multi-agent systems with gradient reinforcement learning 总被引:1,自引:0,他引:1
Olivier Buffet Alain Dutech François Charpillet 《Autonomous Agents and Multi-Agent Systems》2007,15(2):197-220
An original reinforcement learning (RL) methodology is proposed for the design of multi-agent systems. In the realistic setting
of situated agents with local perception, the task of automatically building a coordinated system is of crucial importance.
To that end, we design simple reactive agents in a decentralized way as independent learners. But to cope with the difficulties
inherent to RL used in that framework, we have developed an incremental learning algorithm where agents face a sequence of
progressively more complex tasks. We illustrate this general framework by computer experiments where agents have to coordinate to reach a global goal.
This work has been conducted in part in NICTA’s Canberra laboratory. 相似文献
19.
人类在处理问题中往往分为两个层次,首先在整体上把握问题,即提出大体方案,然后再具体实施.也就是说人类就是具有多分辨率智能系统的极好例子,他能够在多个层次上从底向上泛化(即看问题角度粒度变"粗",它类似于抽象),并且又能从顶向下进行实例化(即看问题角度变"细",它类似于具体化).由此构造了由在双层(理想空间即泛化和实际空间即实例化)上各自运行的马尔可夫决策过程组成的半马尔可夫决策过程,称之为双马尔可夫决策过程联合模型.然后讨论该联合模型的最优策略算法,最后给出一个实例说明双马尔可夫决策联合模型能够经济地节约"思想",是运算有效性和可行性的一个很好的折中. 相似文献
20.
Basic Ideas for Event-Based Optimization of Markov Systems 总被引:5,自引:0,他引:5
The goal of this paper is two-fold: First, we present a sensitivity point of view on the optimization of Markov systems. We show that Markov decision processes (MDPs) and the policy-gradient approach, or perturbation analysis (PA), can be derived easily from two fundamental sensitivity formulas, and such formulas can be flexibly constructed, by first principles, with performance potentials as building blocks. Second, with this sensitivity view we propose an event-based optimization approach, including the event-based sensitivity analysis and event-based policy iteration. This approach utilizes the special feature of a system characterized by events and illustrates how the potentials can be aggregated using the special feature and how the aggregated potential can be used in policy iteration. Compared with the traditional MDP approach, the event-based approach has its advantages: the number of aggregated potentials may scale to the system size despite that the number of states grows exponentially in the system size, this reduces the policy space and saves computation; the approach does not require actions at different states to be independent; and it utilizes the special feature of a system and does not need to know the exact transition probability matrix. The main ideas of the approach are illustrated by an admission control problem.Supported in part by a grant from Hong Kong UGC. 相似文献