首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
平均奖赏强化学习算法研究   总被引:7,自引:0,他引:7  
高阳  周如益  王皓  曹志新 《计算机学报》2007,30(8):1372-1378
顺序决策问题常用马尔可夫决策过程(MDP)建模.当决策行为执行从时刻点扩展到连续时间上时,经典的马尔可夫决策过程模型也扩展到半马尔可夫决策过程模型(SMDP).当系统参数未知时,强化学习技术被用来学习最优策略.文中基于性能势理论,证明了平均奖赏强化学习的逼近定理.通过逼近相对参考状态的性能势值函数,研究一个新的平均奖赏强化学习算法--G-学习算法.G-学习算法既可以用于MDP,也可以用于SMDP.不同于经典的R-学习算法,G-学习算法采用相对参考状态的性能势值函数替代相对平均奖赏和的相对值函数.在顾客访问控制和生产库存仿真实验中,G-学习算法表现出优于R-学习算法和SMART算法的性能.  相似文献   

2.
Markov控制过程基于性能势的平均代价最优策略   总被引:2,自引:1,他引:2  
研究了一类离散时间Markov控制过程平均代价性能最优控制决策问题.应用 Markov性能势的基本性质,在很一般性的假设条件下,直接导出了无限时间平均代价模型在紧 致行动集上的最优性方程及其解的存在性定理.提出了求解最优平稳控制策略的迭代算法,并 讨论了这种算法的收敛性问题.最后通过分析一个实例来说明这种算法的应用.  相似文献   

3.
具有可数状态空间的马尔可夫决策过程(Markov decision process, MDP)在平均准则下, 最优(平稳)策略不一定 存在. 本文研究平均准则可数状态MDP中满足最优不等式的最优策略. 不同于消去折扣(因子)方法, 利用离散的 Dynkin公式推导本文的主要结果. 首先给出遍历马氏链的泊松方程和两个零常返马氏链的例子, 证明了满足两个方向 相反的最优不等式的最优策略存在性. 其次, 通过两个比较引理和性能差分公式, 证明了正常返链和多链最优策略的存 在性, 并进一步推广到其他情形. 特别地, 本文通过几个应用举例, 说明平均准则性能敏感的本质. 本文的结果完善了可 数状态MDP在平均准则下的最优不等式的理论.  相似文献   

4.
讨论离散时间系统中不同服务率的并行服务器信息流的控制问题.建立了一个模糊控制器来决定排队系统信息流分配的最优策略,使得顾客在通信系统中的平均逗留时间最小.仿真结果表明了该模糊控制器的有效性.  相似文献   

5.
胡圣波  张建瑞 《计算机工程》2009,35(13):81-83,8
论述自主通信的基本组成和功能,将自主通信系统看成一个由采集、分析、判决和动作行为4个单元组成的特殊反馈控制系统,着重分析判决单元。基于QoS研究一种多控制变量、多观测变量马尔可夫最优判决策略。仿真结果表明,采用马尔可夫最优判决策略可以提高系统自适应网络环境变化的能力。  相似文献   

6.
高江  戴冠中 《自动化学报》1995,21(6):691-695
研究一类非因果系统的最小方差滤波与平滑.引入前、后向马尔可夫过程的等价性定理, 得到了等价系统实现最优滤波的条件,并给出最优滤波和平滑算法.该算法可应用于一系列 实际问题中.  相似文献   

7.
人类在处理问题中往往分为两个层次,首先在整体上把握问题,即提出大体方案,然后再具体实施.也就是说人类就是具有多分辨率智能系统的极好例子,他能够在多个层次上从底向上泛化(即看问题角度粒度变"粗",它类似于抽象),并且又能从顶向下进行实例化(即看问题角度变"细",它类似于具体化).由此构造了由在双层(理想空间即泛化和实际空间即实例化)上各自运行的马尔可夫决策过程组成的半马尔可夫决策过程,称之为双马尔可夫决策过程联合模型.然后讨论该联合模型的最优策略算法,最后给出一个实例说明双马尔可夫决策联合模型能够经济地节约"思想",是运算有效性和可行性的一个很好的折中.  相似文献   

8.
本文利用(1)中的马尔可夫链,用FORTRAN语言设计了股票价格上扬,下跌平均时间的计算程序,利用马尔可夫决策,实现了股票买进卖出最佳策略的实用程序。  相似文献   

9.
针对云计算系统在执行任务过程中的能量消耗过多、处理速度不理想等情况,提出一种基于云计算的多路网络流媒体分布式最优存储与分配策略.通过构建基于云计算的数学模型来分析执行任务时的能量消耗情况,采用基于虚拟调度机制的分布式最优存储策略来实现在满足存储需求的情况下减少服务器进行存储时的能量消耗总量,最小化存储成本,并采用基于动态决策规则的分配策略来根据服务器的功率、性能以及负载情况进行任务调度,使云计算系统在满足服务质量要求的条件下,充分利用系统运作能耗,不产生过多空闲能耗.实验及结果分析表明,所提出的分布式最优存储与分配策略在节省能量消耗、提升运行速度上发挥了较好的效果.  相似文献   

10.
研究了一类具有可数状态空间的Markov控制过程在无限水平平均代价准则下的最优平稳策略问题.对此类过程,引入了折扣Poisson方程,运用无穷小矩阵和性能势的基本性质,导出了平均代价模型在紧致行动集上的最优性方程,并证明了其解的一个存在性定理.  相似文献   

11.
We consider Markov decision processes with denumerable state space and finite control sets; the performance index of a control policy is a long-run expected average cost criterion and the cost function is bounded below. For these models, the existence of average optimal stationary policies was recently established in [11] under very general assumptions. Such a result was obtained via an optimality inequality. Here, we use a simple example to prove that the conditions in [11] do not imply the existence of a solution to the average cost optimality equation.  相似文献   

12.
This article deals with multiconstrained continuous-time Markov decision processes in a denumerable state space, with unbounded cost and transition rates. The criterion to be optimised is the long-run expected average cost, and several kinds of constraints are imposed on some associated costs. The existence of a constrained optimal policy is ensured under suitable conditions by using a martingale technique and introducing an occupation measure. Furthermore, for the unichain model, we transform this multiconstrained problem into an equivalent linear programming problem, then construct a constrained optimal policy from an optimal solution to the linear programming. Finally, we use an example of a controlled queueing system to illustrate an application of our results.  相似文献   

13.
Necessary conditions are given for the existence of a bounded solution to the optimality equation arising in Markov decision processes, under a long-run, expected average cost criterion. The relationships of some of our results to known sufficient conditions are also shown.  相似文献   

14.
This note deals with continuous-time Markov decision processes with a denumerable state space and the average cost criterion. The transition rates are allowed to be unbounded, and the action set is a Borel space. We give a new set of conditions under which the existence of optimal stationary policies is ensured by using the optimality inequality. Our results are illustrated with a controlled queueing model. Moreover, we use an example to show that our conditions do not imply the existence of a solution to the optimality equations in the previous literature  相似文献   

15.
We consider average reward Markov decision processes with discrete time parameter and denumerable state space. We are concerned with the following problem: Find necessary and sufficient conditions so that, for arbitrary bounded reward function, the corresponding average reward optimality equation has a bounded solution. This problem is solved for a class of systems including the case in which, under the action of any stationary policy, the state space is an irreducible positive recurrent class.  相似文献   

16.
This paper deals with discrete-time Markov control processes with Borel state space, allowing unbounded costs and noncompact control sets. For these models, the existence of average optimal stationary policies has been recently established under very general assumptions, using an optimality inequality. Here we give a condition, which is a strengtened version of a variant of the ‘vanishing discount factor’ approach, for the optimality equation to hold.  相似文献   

17.
针对由两种组件、三类顾客需求组成的按单装配系统, 本文研究了其中的组件生产控制与库存分配问题. 在各类顾客需 求是泊松到达过程, 各种组件加工时间服从指数分布的假设下, 我们运用马尔科夫决策理论建立了无限期折扣总成本模型, 根据Lippman转换得到了相应归一化后的离散最优方程, 在此基础之上分析了生产和库存分配联合最优控制策略的结构性质. 本文证明了最优策略是依赖于系统状态的动态策略. 组件的最优生产策略是动态基库存策略, 其中基库存水平是关于系统中其他组件库存水平的非减函数. 而最优的分配策略是动态的阈值策略, 对于只需一种组件构成的顾客需求, 组件的分配阈值是系统中另一组件库存水平的增函数; 而对于同时需要两种组件组成的顾客需求, 其各组件的分配阈值是另一组件库存水平的减函数. 最后通过数值试验给出了各个参数对联合最优控制策略的影响, 并得到了相应的管理启示.  相似文献   

18.
主要讨论一种基于动态模糊集的Agent强化学习策略,介绍Agent强化学习的目标,状态值函数和动作值函数,马尔可夫决策过程的优化以及学习策略等。  相似文献   

19.
Focuses on bias optimality in unichain, finite state, and action-space Markov decision processes. Using relative value functions, we present methods for evaluating optimal bias, this leads to a probabilistic analysis which transforms the original reward problem into a minimum average cost problem. The result is an explanation of how and why bias implicitly discounts future rewards  相似文献   

20.
In this paper, we apply two methods to derive necessary and sufficient decentralized optimality conditions for stochastic differential decision problems with multiple Decision Makers (DMs), which aim at optimizing a common pay-off, based on the notions of decentralized global optimality and decentralized person-by-person (PbP) optimality. Method 1: We utilize the stochastic maximum principle to derive necessary and sufficient conditions which consist of forward and backward Stochastic Differential Equations (SDEs), and conditional variational Hamiltonians, conditioned on the information structures of the DMs. The sufficient conditions for decentralized PbP optimality are local conditions, closely related to the necessary conditions for decentralized PbP optimality. However, under certain convexity condition on the Hamiltonian, and a global version of the sufficient conditions for decentralized PbP optimality, we show decentralized global optimality. Method 2: We utilize the value processes of decentralized PbP optimal policies, we relate them to solutions of backward SDEs, we identify sufficient conditions for decentralized PbP optimality, and we show these are precisely those derived via the maximum principle. For both methods, as usual, we utilize Girsanov’s theorem to transform the initial decentralized stochastic optimal decision problems, to equivalent decentralized stochastic optimal decision problems on a reference probability space, in which the controlled process and the information processes which generate part of the information structures of the DMs, are independent of any of the decisions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号