首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao, Automatica 36 (2000) 771).  相似文献   

2.
Semi-Markov decision problems and performance sensitivity analysis   总被引:1,自引:0,他引:1  
Recent research indicates that Markov decision processes (MDPs) can be viewed from a sensitivity point of view; and the perturbation analysis (PA), MDPs, and reinforcement learning (RL) are three closely related areas in optimization of discrete-event dynamic systems that can be modeled as Markov processes. The goal of this paper is two-fold. First, we develop the PA theory for semi-Markov processes (SMPs); and then we extend the aforementioned results about the relation among PA, MDP, and RL to SMPs. In particular, we show that performance sensitivity formulas and policy iteration algorithms of semi-Markov decision processes can be derived based on the performance potential and realization matrix. Both the long-run average and discounted-cost problems are considered. This approach provides a unified framework for both problems, and the long-run average problem corresponds to the discounted factor being zero. The results indicate that performance sensitivities and optimization depend only on first-order statistics. Single sample path-based implementations are discussed.  相似文献   

3.
The standard approach to stochastic control is dynamic programming. In this paper, we introduce an alternative approach based on direct comparison of the performance of any two policies. This is achieved by modeling the state process as a continuous-time and continuous-state Markov process and applying the same ideas as for the discrete-time and discrete-state case. This approach is simple and intuitively clear; it applies to different problems with, finite and infinite horizons, discounted and long-run-average performance, continuous and jump diffusions, in the same way. Discounting is not needed when dealing with long-run average performance. The approach provides a unified framework for stochastic control and other optimization theory and methodologies, including Markov decision processes, perturbation analysis, and reinforcement learning.  相似文献   

4.
The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy.   相似文献   

5.
In this paper, we propose a new approach to the theory of finite multichain Markov decision processes (MDPs) with different performance optimization criteria. We first propose the concept of nth-order bias; then, using the average reward and bias difference formulas derived in this paper, we develop an optimization theory for finite MDPs that covers a complete spectrum from average optimality, bias optimality, to all high-order bias optimality, in a unified way. The approach is simple, direct, natural, and intuitive; it depends neither on Laurent series expansion nor on discounted MDPs. We also propose one-phase policy iteration algorithms for bias and high-order bias optimal policies, which are more efficient than the two-phase algorithms in the literature. Furthermore, we derive high-order bias optimality equations. This research is a part of our effort in developing sensitivity-based learning and optimization theory.  相似文献   

6.
In many practical systems, the control or decision making is triggered by certain events. The performance optimization of such systems is generally different from the traditional optimization approaches, such as Markov decision processes or dynamic programming. The goal of this tutorial is to introduce, in an intuitive manner, a new optimization framework called event-based optimization. This framework has a wide applicability to aforementioned systems. With performance potential as building blocks, we develop two intuitive optimization algorithms to solve the event-based optimization problem. The optimization algorithms are proposed based on an intuitive principle, and theoretical justifications are given with a performance sensitivity based approach. Finally, we provide a few practical examples to demonstrate the effectiveness of the event-based optimization framework. We hope this framework may provide a new perspective to the optimization of the performance of event-triggered dynamic systems.  相似文献   

7.
Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time   总被引:2,自引:0,他引:2  
We present a new algorithm,prioritized sweeping, for efficient prediction and control of stochastic Markov systems. Incremental learning methods such as temporal differencing and Q-learning have real-time performance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of state-space. We compare prioritized sweeping with other reinforcement learning schemes for a number of different stochastic optimal control problems. It successfully solves large state-space real-time problems with which other methods have difficulty.  相似文献   

8.
We study the structure of sample paths of Markov systems by using performance potentials as the fundamental units. With a sample path-based approach, we show that performance sensitivity formulas (performance gradients and performance differences) of Markov systems can be constructed intuitively, by first principles, with performance potentials (or equivalently, perturbation realization factors) as building blocks. In particular, we derive sensitivity formulas for two Markov chains with possibly different state spaces. The proposed approach can be used to obtain flexibly the sensitivity formulas for a wide range of problems, including those with partial information. These formulas are the basis for performance optimization of discrete event dynamic systems, including perturbation analysis, Markov decision processes, and reinforcement learning. The approach thus provides insight on on-line learning and performance optimization and opens up new research directions. Sample path based algorithms can be developed.  相似文献   

9.
动态电源管理超时策略自适应优化算法   总被引:1,自引:0,他引:1  
基于强化学习的方法,提出一种动态电源管理超时策略自适应在线优化算法.构建基于超时策略动态电源管理系统的半Markov控制过程模型,将动态电源管理问题转化为一个带约束的优化问题.利用此模型的动态结构特性,结合在线梯度估计与髓机逼近推导超时策略的在线优化算法.该算法自适应性强,计算量小,具有全局收敛性.通过无线网络通信节点动态电源管理的应用仿真验证了算法的有效性.  相似文献   

10.
A goal-oriented action that is performed by a biocybernctical system, with minimum losses, subject to some constraints, is termed cybernetical action. Actions of this kind (often related to survival, e.g., defense or predatory actions) are widespread in nature. A mathematical model of these actions, based on a special kind of operators, termed message operators (or optimization operators), is investigated in connection with the customary transition matrices used in the theory of Markov processes. It is shown that, under certain conditions, transition matrices may be regarded as a special kind of message operators; hence, most of the properties of Markov chains can be expressed in terms of the latter ones. Accordingly, a new class of “stochastic processes with optimization” is defined, which enables Markovian as well as non-Markovian evolutionary processes to be modeled in a general way. Furthermore, by taking into account some recovery phenomena (related to biological rhythms) a new model of sequential actions is suggested. It is shown that the corresponding stochastic process with optimization always involves adaptive learning (that is, the action is stored in the memory of the system, in its optimal form). Thus, by including in the mathematical model the fundamental “optimization principle” which is inherent to the nature of any biological system, a new tool is obtained, permitting to investigate some unknown aspects of the biocybernctical systems in a more general manner as compared with the customary means.  相似文献   

11.
The goals of perturbation analysis (PA), Markov decision processes (MDPs), and reinforcement learning (RL) are common: to make decisions to improve the system performance based on the information obtained by analyzing the current system behavior. In this paper, we study the relations among these closely related fields. We show that MDP solutions can be derived naturally from performance sensitivity analysis provided by PA. Performance potential plays an important role in both PA and MDPs; it also offers a clear intuitive interpretation for many results. Reinforcement learning, TD(), neuro-dynamic programming, etc., are efficient ways of estimating the performance potentials and related quantities based on sample paths. The sensitivity point of view of PA, MDP, and RL brings in some new insight to the area of learning and optimization. In particular, gradient-based optimization can be applied to parameterized systems with large state spaces, and gradient-based policy iteration can be applied to some nonstandard MDPs such as systems with correlated actions, etc. Potential-based on-line approaches and their advantages are also discussed.  相似文献   

12.
We present a non-equilibrium analysis and control approach for the Active Queue Management (AQM) problem in communication networks. Using simplified fluid models, we carry out a bifurcation study of the complex dynamic queue behavior to show that non-equilibrium methods are essential for analysis and optimization in the AQM problem. We investigate an ergodic theoretic framework for stochastic modeling of the non-equilibrium behavior in deterministic models and use it to identify parameters of a fluid model from packet level simulations. For computational tractability, we use set-oriented numerical methods to construct finite-dimensional Markov models, including control Markov chains and hidden Markov models. Subsequently, we develop and analyze an example AQM algorithm using a Markov Decision Process (MDP) based control framework. The control scheme developed is optimal with respect to a reward function, defined over the queue size and aggregate flow rate. We implement and simulate our illustrative AQM algorithm in the ns-2 network simulator. The results obtained confirm the theoretical analysis and exhibit promising performance when compared with well-known alternative schemes under persistent non-equilibrium queue behavior.  相似文献   

13.
This paper concerns a new methodology for the adaptive optimization of piecewise deterministic non-Markovian systems via a simple example of interest in manufacturing. This methodology takes into account the fact that piecewise deterministic systems are rarely Markovian and that classical control theory based on dynamic programming of Markovian systems cannot provide quantitative answers in most realistic situations. The example considered is that of a single-machine/single-part production system, from the point of view of the control theory by Kimemia and Gershwin (1983). We obtain stochastic gradient estimates via the perturbation analysis of Ho and Cao (1991) by the “regenerative” method of Konstantopoulos and Zazanis (1992), in view of stochastic optimization. Its mathematical justification requires a careful study of the “regenerative” structure of the process. In particular, we give the necessary and sufficient conditions of stability of this system via the method of Loynes (1962)  相似文献   

14.
平均和折扣准则MDP基于TD(0)学习的统一NDP方法   总被引:3,自引:0,他引:3  
为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.  相似文献   

15.
Inspired by the great success of margin-based classifiers, there is a trend to incorporate the margin concept into hidden Markov modeling for speech recognition. Several attempts based on margin maximization were proposed recently. In this paper, a new discriminative learning framework, called soft margin estimation (SME), is proposed for estimating the parameters of continuous-density hidden Markov models. The proposed method makes direct use of the successful ideas of soft margin in support vector machines to improve generalization capability and decision feedback learning in minimum classification error training to enhance model separation in classifier design. SME is illustrated from a perspective of statistical learning theory. By including a margin in formulating the SME objective function, SME is capable of directly minimizing an approximate test risk bound. Frame selection, utterance selection, and discriminative separation are unified into a single objective function that can be optimized using the generalized probabilistic descent algorithm. Tested on the TIDIGITS connected digit recognition task, the proposed SME approach achieves a string accuracy of 99.43%. On the 5 k-word Wall Street Journal task, SME obtains relative word error rate reductions of about 10% over our best baseline results in different experimental configurations. We believe this is the first attempt to show the effectiveness of margin-based acoustic modeling for large vocabulary continuous speech recognition in a hidden Markov model framework. Further improvements are expected because the approximate test risk bound minimization principle offers a flexible and rigorous framework to facilitate incorporation of new margin-based optimization criteria into hidden Markov model training.  相似文献   

16.
Wennekers T  Ay N 《Neural computation》2005,17(10):2258-2290
We extend Linkser's Infomax principle for feedforward neural networks to a measure for stochastic interdependence that captures spatial and temporal signal properties in recurrent systems. This measure, stochastic interaction, quantifies the Kullback-Leibler divergence of a Markov chain from a product of split chains for the single unit processes. For unconstrained Markov chains, the maximization of stochastic interaction, also called Temporal Infomax, has been previously shown to result in almost deterministic dynamics. This letter considers Temporal Infomax on constrained Markov chains, where some of the units are clamped to prescribed stochastic processes providing input to the system. Temporal Infomax in that case leads to finite state automata, either completely deterministic or weakly nondeterministic. Transitions between internal states of these systems are almost perfectly predictable given the complete current state and the input, but the activity of each single unit alone is virtually random. The results are demonstrated by means of computer simulations and confirmed analytically. It is furthermore shown numerically that Temporal Infomax leads to a high information flow from the input to internal units and that a simple temporal learning rule can approximately achieve the optimization of temporal interaction. We relate these results to experimental data concerning the correlation dynamics and functional connectivities observed in multiple electrode recordings.  相似文献   

17.
We present a framework for performance prediction of distributed and mobile systems. We rely on process calculi and their structural operational semantics. The dynamic behaviour is described through transition systems whose transitions are labelled by encodings of their proofs that we then map into stochastic processes. We enhance related works by allowing general continuous distributions resorting to a notion of enabling between transitions. We also discuss how the number of resources available affects the overall model. Finally, we introduce a notion of bisimulation that takes stochastic information into account and prove it to be a congruence. When only exponential distributions are of interest our equivalence induces a lumpable partition on the underlying Markov process.  相似文献   

18.
We consider the dynamic control of two queues competing for the services of one server. The problem is to design a server time allocation strategy, when the sizes of the queues are not observable. The performance criterion used is total expected aggregate delay. The server is assumed to observe arrivals but not departures. This problem is formulated as a stochastic optimal control problem with partial observations. The framework we adopt is that of stochastic control in discrete time and countable "state space." The observations are modeled as discrete time, 0-1 point processes with rates that are influenced by a Markov chain. Examples from computer control of urban traffic are given, to illustrate the practical motivation behind the present work, and to relate to earlier work by us on the subject. A particular feature of the formulation is that the observations are influenced by transitions of the state of the Markov chain. The classical tools of simple Bayes rule and dynamic programming suffice for the analysis. In particular, we show that the "one step" predicted density for the state of the Markov chain, given the point process observations is a sufficient statistic for control. This framework is then applied to the specific problem of two queues competing for the services of one server. We obtain explicit solutions for the finite time expected aggregate delay problem. The implications of these results for practical applications as well as implementation aspects of the resulting optimal control laws are discussed.  相似文献   

19.
分析列车运行计划制定和速度调整的实际情况,研究列车运行调度的建模及求解技术。针对目前相对静态的列车运行调度模型,建立车站-闭塞分区一体化及考虑列车动态运行状况的优化模型,并提出一种具有预测控制、滚动优化、实时反馈的列车运行调度与控制框架,该框架可以提高行车密度、抵抗随机扰动、优化铁路网络的运营性能。  相似文献   

20.
动态电源管理的随机切换模型与策略优化   总被引:2,自引:0,他引:2  
提出一种基于连续时间Markov决策过程的动态电源管理策略优化方法.通过建立动态电源管理系统的随机切换模型,将动态电源管理问题转化为带约束的策略优化问题,并给出一种基于矢量合成的策略梯度优化算法.随机切换模型对动态电源管理系统的描述精确,策略优化算法简便有效,既能离线计算,也适用于在线优化.仿真实验验证了该方法的有效性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号