随机平稳策略下半Markov决策过程的仿真优化算法 Simulation optimization algorithm for SMDPs with parameterized randomized stationary policies期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

随机平稳策略下半Markov决策过程的仿真优化算法

引用本文：	代桂平,唐昊,奚宏生.随机平稳策略下半Markov决策过程的仿真优化算法[J].控制理论与应用,2006,23(4):547-551.

作者姓名：	代桂平唐昊奚宏生

作者单位：	北京工业大学,电子信息与控制学院,北京,100022;中国科学技术大学,自动化系,安徽,合肥,230027;合肥工业大学,计算机系,安徽,合肥,230009;中国科学技术大学,自动化系,安徽,合肥,230027

基金项目：	国家自然科学基金资助项目(60274012); 北京工业大学博士科研启动基金资助项目(00194)

摘要：	基于性能势理论和等价Markov过程方法,研究了一类半Markov决策过程(SMDP)在参数化随机平稳策略下的仿真优化算法,并简要分析了算法的收敛性．通过SMDP的等价Markov过程,定义了一个一致化Markov链,然后根据该一致化Markov链的单个样本轨道来估计SMDP的平均代价性能指标关于策略参数的梯度,以寻找最优(或次优)策略．文中给出的算法是利用神经元网络来逼近参数化随机平稳策略,以节省计算机内存,避免了“维数灾”问题,适合于解决大状态空间系统的性能优化问题．最后给出了一个仿真实例来说明算法的应用．
关键词：	随机平稳策略等价Markov过程一致化Markov链神经元动态规划仿真优化
文章编号：	1000-8152（2006）04-0547-05
收稿时间：	2004-10-10
修稿时间：	2004-10-102005-10-21
Simulation optimization algorithm for SMDPs with parameterized randomized stationary policies

DAI Gui-ping,TANG Hao,XI Hong-sheng.Simulation optimization algorithm for SMDPs with parameterized randomized stationary policies[J].Control Theory & Applications,2006,23(4):547-551.

Authors:	DAI Gui-ping TANG Hao XI Hong-sheng

Affiliation:	College of Electronic and Control Engineering, Beijing University of Technology, Beijing 100022, China; Department of Automation, University of Science and Technology of China, Hefei Anhui 230027, China; Department of Computer, Hefei University of Technology, Hefei Anhui 230009, China

Abstract:	Based on the theory of performance potentials and the method of equivalent Markov process, the performance optimization problem is discussed for a class of semi-Markov decision processes (SMDPs) with parameterized randomized stationary policies and a simulation optimization algorithm is proposed. Firstly, a uniform Markov chain is defined through the equivalent Markov process. Secondly, the gradient of the average cost performance with respect to the policy parameters is then estimated by simulating a single sample path of the uniformized Markov chain, so that an optimal (or suboptimal) randomized stationary policy can be found by iterating the parameters. The derived algorithm can meet the requirements of performance optimization of many different systems with large-scale state space, an artificial neural network is also used to approximate the parameterized randomized stationary policies and avoid the curse of dimensionality. Finally, convergence of the algorithm with probability one on an infinite sample path is considered, and a numerical example is provided to illustrate the application of the algorithm.

Keywords:	randomized stationary polices equivalent Markov process uniformized Markov chain neuro-dynamic programming simulation optimization
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《控制理论与应用》浏览原始摘要信息
	点击此处可从《控制理论与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏