首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
We propose a novel approach, called parallel rollout, to solving (partially observable) Markov decision processes. Our approach generalizes the rollout algorithm of Bertsekas and Castanon (1999) by rolling out a set of multiple heuristic policies rather than a single policy. In particular, the parallel rollout approach aims at the class of problems where we have multiple heuristic policies available such that each policy performs near-optimal for a different set of system paths. Parallel rollout automatically combines the given multiple policies to create a new policy that adapts to the different system paths and improves the performance of each policy in the set. We formally prove this claim for two criteria: total expected reward and infinite horizon discounted reward. The parallel rollout approach also resolves the key issue of selecting which policy to roll out among multiple heuristic policies whose performances cannot be predicted in advance. We present two example problems to illustrate the effectiveness of the parallel rollout approach: a buffer management problem and a multiclass scheduling problem.  相似文献   

2.
Kearns  Michael  Mansour  Yishay  Ng  Andrew Y. 《Machine Learning》2002,49(2-3):193-208
A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP. In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case. In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states. The running time is exponential in the horizon time (which depends only on the discount factor and the desired degree of approximation to the optimal policy). Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter.Our algorithm is based on the idea of sparse sampling. We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP. Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng. Neural information processing systems 13, to appear).  相似文献   

3.
We will discuss an expected utility of rewards which are generated by Markov decision processes. This is applied to the optimal stopping problem with a utility treatment. Also a combined model of the decision processes and the stopping problem, called a stopped Markov decision, is considered under the utility.  相似文献   

4.
洪晔  边信黔 《计算机仿真》2007,24(6):146-149
自治式水下机器人在复杂海洋环境航行时要求寻找一条从给定起始点到终止点的较优的运动路径,安全、无碰撞地绕过所有的障碍物.提出了一种基于部分可观察马尔可夫决策过程,并结合预测障碍物运动的全局路径规划新方法; 给出了部分可观马尔可夫决策的数学模型;建立了树状的分层部分可观马尔可夫决策模型,并在路径规划中应用;提出了短期预测和长期预测两种针对水下障碍物运动轨迹预测的方法;最后通过仿真实验对AUV的全局路径规划能力进行了仿真验证,为今后的实艇试验打下了很好的基础.  相似文献   

5.
应用Markov决策过程与性能势相结合的方法,给出了呼叫接入控制的策略优化算法.所得到的最优策略是状态相关的策略,与基于节点已占用带宽决定行动的策略相比,状态相关策略具有更好的性能值,而且该算法具有很快的收敛速度.  相似文献   

6.
Adaptive sensing involves actively managing sensor resources to achieve a sensing task, such as object detection, classification, and tracking, and represents a promising direction for new applications of discrete event system methods. We describe an approach to adaptive sensing based on approximately solving a partially observable Markov decision process (POMDP) formulation of the problem. Such approximations are necessary because of the very large state space involved in practical adaptive sensing problems, precluding exact computation of optimal solutions. We review the theory of POMDPs and show how the theory applies to adaptive sensing problems. We then describe a variety of approximation methods, with examples to illustrate their application in adaptive sensing. The examples also demonstrate the gains that are possible from nonmyopic methods relative to myopic methods, and highlight some insights into the dependence of such gains on the sensing resources and environment.
Alfred O. Hero IIIEmail:

Edwin K. P. Chong   received the BE(Hons) degree with First Class Honors from the University of Adelaide, South Australia, in 1987; and the MA and PhD degrees in 1989 and 1991, respectively, both from Princeton University, where he held an IBM Fellowship. He joined the School of Electrical and Computer Engineering at Purdue University in 1991, where he was named a University Faculty Scholar in 1999, and was promoted to Professor in 2001. Since August 2001, he has been a Professor of Electrical and Computer Engineering and a Professor of Mathematics at Colorado State University. His research interests span the areas of communication and sensor networks, stochastic modeling and control, and optimization methods. He coauthored the recent best-selling book, An Introduction to Optimization, 3rd Edition, Wiley-Interscience, 2008. He is currently on the editorial board of the IEEE Transactions on Automatic Control, Computer Networks, Journal of Control Science and Engineering, and IEEE Expert Now. He is a Fellow of the IEEE, and served as an IEEE Control Systems Society Distinguished Lecturer. He received the NSF CAREER Award in 1995 and the ASEE Frederick Emmons Terman Award in 1998. He was a co-recipient of the 2004 Best Paper Award for a paper in the journal Computer Networks. He has served as Principal Investigator for numerous funded projects from NSF, DARPA, and other funding agencies. Christopher M. Kreucher   received the BS, MS, and PhD degrees in Electrical Engineering from the University of Michigan in 1997, 1998, and 2005, respectively. He is currently a Senior Systems Engineer at Integrity Applications Incorporated in Ann Arbor, Michigan. His current research interests include nonlinear filtering (specifically particle filtering), Bayesian methods of fusion and multitarget tracking, self localization, information theoretic sensor management, and distributed swarm management. Alfred O. Hero III   received the BS (summa cum laude) from Boston University (1980) and the PhD from Princeton University (1984), both in Electrical Engineering. Since 1984 he has been with the University of Michigan, Ann Arbor, where he is a Professor in the Department of Electrical Engineering and Computer Science and, by courtesy, in the Department of Biomedical Engineering and the Department of Statistics. He has held visiting positions at Massachusetts Institute of Technology (2006), Boston University, I3S University of Nice, Sophia-Antipolis, France (2001), Ecole Normale Superieure de Lyon (1999), Ecole Nationale Superieure des Telecommunications, Paris (1999), Scientific Research Labs of the Ford Motor Company, Dearborn, Michigan (1993), Ecole Nationale Superieure des Techniques Avancees (ENSTA), Ecole Superieure d’Electricite, Paris (1990), and M.I.T. Lincoln Laboratory (1987–1989). His recent research interests have been in areas including: inference for sensor networks, adaptive sensing, bioinformatics, inverse problems. and statistical signal and image processing. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), a member of Tau Beta Pi, the American Statistical Association (ASA), the Society for Industrial and Applied Mathematics (SIAM), and the US National Commission (Commission C) of the International Union of Radio Science (URSI). He has received a IEEE Signal Processing Society Meritorious Service Award (1998), IEEE Signal Processing Society Best Paper Award (1998), a IEEE Third Millenium Medal and a 2002 IEEE Signal Processing Society Distinguished Lecturership. He was President of the IEEE Signal Processing Society (2006–2007) and during his term served on the TAB Periodicals Committee (2006). He was a member of the IEEE TAB Society Review Committee (2008) and is Director-elect of IEEE for Division IX (2009).   相似文献   

7.
无线传感器网络近年来得到了较为广泛的应用,其中能耗问题为该领域的研究热点问题。同时,随着无线传感器网络技术的不断发展,现在在传感器网络中常使用多速率进行网络传输,此多速率的属性提供了可进一步提高网络能耗性能的机会。本文提出一种基于马尔可夫决策过程控制无线传感器网络的多速率之间的转换,进而达到使网络更加节能的目的。仿真结果表明,在不影响通信质量的情况下,网络能耗性能得到了提高。  相似文献   

8.
The goal of dialogue management in a spoken dialogue system is to take actions based on observations and inferred beliefs. To ensure that the actions optimize the performance or robustness of the system, researchers have turned to reinforcement learning methods to learn policies for action selection. To derive an optimal policy from data, the dynamics of the system is often represented as a Markov Decision Process (MDP), which assumes that the state of the dialogue depends only on the previous state and action. In this article, we investigate whether constraining the state space by the Markov assumption, especially when the structure of the state space may be unknown, truly affords the highest reward. In simulation experiments conducted in the context of a dialogue system for interacting with a speech-enabled web browser, models under the Markov assumption did not perform as well as an alternative model which classifies the total reward with accumulating features. We discuss the implications of the study as well as its limitations.  相似文献   

9.
This article proposes several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes with finite state-space under the average cost criterion. Two of the algorithms are for the compact (non-discrete) action setting while the rest are for finite-action spaces. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and an additional averaging is performed for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, we discuss a memory efficient implementation that uses a feature-based representation of the state-space and performs TD(0) learning along the faster timescale. The TD(0) algorithm does not follow an on-line sampling of states but is observed to do well on our setting. Numerical experiments on a problem of rate based flow control are presented using the proposed algorithms. We consider here the model of a single bottleneck node in the continuous time queueing framework. We show performance comparisons of our algorithms with the two-timescale actor-critic algorithms of Konda and Borkar (1999) and Bhatnagar and Kumar (2004). Our algorithms exhibit more than an order of magnitude better performance over those of Konda and Borkar (1999).
Shalabh Bhatnagar (Corresponding author)Email:
  相似文献   

10.
为了更好地解决一类特殊的Agent决策问题,提出行动驱动的马尔可夫决策过程的概念并分析了其理论模型.另外,提出行动驱动的马尔可夫决策过程相关问题的求解算法,并在RoboCup仿真2D比赛的不离身带球问题中对算法进行了实验.实验结果表明,新算法使Agent的带球性能有了较大的提高.新算法已经用于中国科大蓝鹰仿真2D机器人足球队,并在比赛中取得了较好的效果.  相似文献   

11.
Several algorithms for learning near-optimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Model-based Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents a theoretical analysis of MBIE and a new variation called MBIE-EB, proving their efficiency even under worst-case conditions. The paper also introduces a new performance metric, average loss, and relates it to its less “online” cousins from the literature.  相似文献   

12.
Insight into the global structure of a state space is of great help in the analysis of the underlying process. We advocate the use of visualization for this purpose and present a method to visualize the structure of very large state spaces with millions of nodes. The method uses a clustering based on an equivalence relation to obtain a simplified representation, which is used as a backbone for the display of the entire state space. With this visualization we are able to answer questions about the global structure of a state space that cannot easily be answered by conventional methods. We show this by presenting a number of visualizations of real-world protocols .  相似文献   

13.
Weighted Markov decision processes (MDPs) have long been used to model quantitative aspects of systems in the presence of uncertainty. However, much of the literature on such MDPs takes a monolithic approach, by modelling a system as a particular MDP; properties of the system are then inferred by analysis of that particular MDP. In contrast in this paper we develop compositional methods for reasoning about weighted MDPs, as a possible basis for compositional reasoning about their quantitative behaviour. In particular we approach these systems from a process algebraic point of view. For these we define a coinductive simulation-based behavioural preorder which is compositional in the sense that it is preserved by structural operators for constructing weighted MDPs from components.  相似文献   

14.
In this paper, two single sample path‐based recursive approaches for Markov decision problems are proposed. One is based on the simultaneous perturbation approach and can be applied to the general state problem, but its convergence rate is low. In this algorithm, the small perturbation on current parameters is necessary to get another sample path for comparison, but it may worsen the system. Hence, we introduce another approach, which directly estimates the gradient of the performance for optimization by “potential” theory. This algorithm, however, is limited to finite state space systems, but its convergence speed is higher than the first one. The estimate for gradient can be obtained by using the sample path with current parameters without any perturbation. This approach is more acceptable for practical applications.  相似文献   

15.
针对Web服务存在的业务逻辑与服务质量的不确定性,以及时序、时间窗约束,本文提出了利用马尔可夫决策理论来解决Web服务组合中最优策略规划问题的方法。该方法首先将Web服务组合描述为有向无环图表示的任务网络,网络中每个节点代表一个任务。任务是由相应的Web服务来实现,任务之间的弧线代表任务间时序的约束,任务执行应满足时间窗的约束。在此基础上,建立Web服务组合的马尔可夫决策模型,从而获得Web服务组合的最优策略。  相似文献   

16.
We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an ?-optimal policy from simulation. We provide sample complexity of such an approach.  相似文献   

17.
This paper considers a Markov decision process in Borel state and action spaces with the aggregated (or say iterated) coherent risk measure to be minimised. For this problem, we establish the Bellman optimality equation as well as the value and policy iteration algorithms, and show the existence of a deterministic stationary optimal policy. The cost function, while being allowed to be unbounded from below (in the sense that its negative part needs be bounded by some nonnegative real-valued possibly unbounded weight function), can be arbitrarily unbounded from above and possibly infinitely valued.  相似文献   

18.
It is known that the performance potentials (or equivalently, perturbation realization factors) can be used as building blocks for performance sensitivities of Markov systems. In parameterized systerns, the changes in parameters may only affect some states, and the explicit transition probability matrix may not be known. In this paper, we use an example to show that we can use potentials to construct performance sensitivities m a more flexible way; only the potentials at the affected states need to be estimated, and the transition probability matrix need not be known. Policy iteration algorithms, which are simpler than the standard one, can be established.  相似文献   

19.
For a countable-state Markov decision process we introduce an embedding which produces a finite-state Markov decision process. The finite-state embedded process has the same optimal cost, and moreover, it has the same dynamics as the original process when restricting to the approximating set. The embedded process can be used as an approximation which, being finite, is more convenient for computation and implementation.  相似文献   

20.
Abstract

Honeypots, which are traps designed to resemble easy-to- compromise computer systems, have become essential tools for security professionals and researchers because of their significant contribution in disclosing the underworld of cybercrimes. However, recent years have witnessed the development of several anti-honeypot technologies. Botmasters can exploit the fact that honeypots should not participate in illegal actions by commanding the compromised machine to act maliciously against specific targets which are used as sensors to measure the execution of these commands. A machine that is not allowing the execution of such attacks is more likely to be a honeypot. Consequently, honeypot operators need to choose the optimal response that balances between being disclosed and being liable for participating in illicit actions. In this paper, we consider the optimal response strategy for honeypot operators. In particular, we model the interaction between botmasters and honeypots by a Markov Decision Process (MDP) and then determine the optimal policy for honeypots responding to the commands of botmasters. The model is then extended using a Partially Observable Markov Decision Process (POMDP) which allows operators of honeypots to model the uncertainty of the honeypot state as determined by botmasters. The analysis of our model confirms that exploiting the legal liability of honeypots allows botmasters to have the upper hand in their conflict with honeypots. Despite this deficiency in current honeypot designs, our model can help operators of honeypots determine the optimal strategy for responding to botmasters’ commands. We also provide simulation results that show the honeypots’ optimal response strategies and their expected rewards under different attack scenarios.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号