首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 406 毫秒
1.
连续时间MCP在紧致行动集上的最优策略   总被引:8,自引:2,他引:8  
文中研究了一类连续时间Markov控制过程(CTMCP)无穷水平平均代价性能的最优控 制决策问题.文章采用无穷小生成元和性能势的基本性质,直接导出了平均代价模型在紧致行动 集上的最优性方程及其解的存在性定理,提出了求解ε-最优平稳控制策略的数值迭代算法,并给 出了这种算法的收敛性证明.最后通过分析一个数值例子来说明这种方法的应用.  相似文献   

2.
一类受控闭排队网络基于性能势的最优性方程   总被引:1,自引:0,他引:1  
研究一类受控闭排队网络系统的性能优化问题. 文章引进了两个基本概念: 折扣代价α 性能势和平均代价性能势, 并且讨论了这两个性能势之间的一个关系式. 在一般的假设条件下, 我们应用性能势的基本性质直接建立了无限时间水平平均代价模型的最优性方程, 并且证明了在紧致集上最优解的存在性. 最后给出了一个策略优化的迭代算法并通过一个实际算例以说明该算法的效果.  相似文献   

3.
基于LMI方法的保性能迭代学习算法设计   总被引:4,自引:0,他引:4  
研究基于性能的迭代学习算法设计与优化问题.首先定义了迭代域二次型性能函数,然后针对线性离散系统给出了迭代域最优迭代学习算法;基于线性矩阵不等式(LMI)方法,针对不确定线性离散系统给出了保性能迭代学习算法及其优化方法.对于这两类迭代学习算法,只要调整性能函数中的权系数矩阵,便可很好地调整迭代学习收敛速度.另外,保性能迭代学习算法设计及优化过程,可利用MATLAB工具箱很方便地求解.  相似文献   

4.
蔡霞  马社祥  孟鑫 《计算机应用研究》2012,29(11):4232-4234
针对传统算法在处理传感器网络的大规模信号时,运算复杂度显著增大,性能急剧下降的问题,提出了启发式同步自适应迭代阈值重构算法。采用启发式差错控制函数选择代价最少的方向逐行同步收缩逼近最优解,并结合由自适应递减幂指数参数所确定的非线性阈值函数,进一步判断修正重构信号。仿真结果表明,启发式同步自适应迭代阈值重构算法以更少的测量值和迭代次数重构信号,其信噪比提高了60 dB。  相似文献   

5.
设计了一种基于折扣广义值迭代的智能算法,用于解决一类复杂非线性系统的最优跟踪控制问题.通过选取合适的初始值,值迭代过程中的代价函数将以单调递减的形式收敛到最优代价函数.基于单调递减的值迭代算法,在不同折扣因子的作用下,讨论了迭代跟踪控制律的可容许性和误差系统的渐近稳定性.为了促进算法的实现,建立一个数据驱动的模型网络用...  相似文献   

6.
Markov控制过程基于性能势的平均代价最优策略   总被引:2,自引:1,他引:2  
研究了一类离散时间Markov控制过程平均代价性能最优控制决策问题.应用 Markov性能势的基本性质,在很一般性的假设条件下,直接导出了无限时间平均代价模型在紧 致行动集上的最优性方程及其解的存在性定理.提出了求解最优平稳控制策略的迭代算法,并 讨论了这种算法的收敛性问题.最后通过分析一个实例来说明这种算法的应用.  相似文献   

7.
研究了迭代优化方法在无线传感器网络节点定位中的应用,针对多维尺度分析定位技术和传统的梯度迭代优化方法,根据数值实验确定了迭代步长和网络连通度之间的函数关系,提出了一种基于连通度的分布式多维尺度分析节点定位算法(a connectivity-based distributed weighted multidimensional scaling algorithm,简称dwMDS(C)).该算法首先根据网络的平均连通度确定迭代步长,然后对每个未知节点的局部代价函数进行优化求解.实验表明该迭代算法收敛快速且稳定,比基于SMACOF算法的dwMDS(G)算法在定位精度上有明显的提高.  相似文献   

8.
大规模图的复杂挖掘算法通常需要高频迭代分析,而在计算与存储方面扩展性良好的分布式计算是提高处理效率的有效方案.然而,图顶点之间存在自由分布的边关系,会在分布式计算任务之间产生大量消息,由此在迭代过程中产生的巨大通信开销严重制约性能收益.已有工作在传统消息推送框架下采用合并和备份等技术降低通信代价,但主要面向结构简单、易优化的单维消息类算法,并不适用于结构复杂的多维消息类算法,也难以与当前最先进的消息按需拉取框架兼容.因此提出一种新型轻量级顶点备份机制,通过备份顶点的按需同步以及本地消息的按需生成,可完美继承拉取框架在容错和内存管控等方面的系统优势,同时显著降低通信代价.此外,通过考虑通信收益与负载偏斜代价,可计算最优阈值以提高整体性能.最后在大量真实数据集上验证了相关技术的有效性.  相似文献   

9.
为了减少在低信噪比区的平均迭代次数和削弱LLR(Logarithm Likelihood Ratio, LLR)值的振荡,分析了中短码长LDPC码错误帧对应校验节点对数似然比及校验和变化的规律,提出了一种基于消息振荡及校验更新的改进BP译码算法。该算法通过提前结束迭代译码的准则来减少在低信噪比区的平均迭代次数,并通过修正校验节点的更新来削弱LLR值的振荡来提高译码性能。仿真结果表明,相对于BP算法:在低信噪比区,该算法减少了平均迭代次数且译码性能没有损失;而在中高信噪比区,其提高了译码性能而平均迭代次数无需增加。  相似文献   

10.
为了从含噪声的测量矢量中重构信号,研究了稀疏补分析模型理论及其迭代硬阈值正交投影算法。通过采用稀疏补正交投影修改了稀疏补分析模型下迭代硬阈值算法的迭代追踪过程;分析了迭代步长和稀疏补取值大小对算法收敛速度和重构性能的影响,找出了选取最优迭代步长和最佳稀疏补取值方法;提出并实现了稀疏补分析模型下迭代硬阈值正交投影算法,给出了算法收敛的充分条件和重构信号误差范围。仿真实验结果表明,算法的平均运算时间仅仅为AIHT、AL1和GAP算法的19%、11%和10%;算法重构信号的综合平均峰值信噪比(PSNR)比AIHT算法提高了0.89dB,但比AIHT、AL1算法稍逊色。算法在满足给定条件下能够以高概率实现含噪信号重构,重构信号的综合平均PSNR与典型算相比没有明显下降,但运算时间大为缩短,收敛速度更快。  相似文献   

11.
部分可观察Markov决策过程是通过引入信念状态空间将非Markov链问题转化为Markov链问题来求解,其描述真实世界的特性使它成为研究随机决策过程的重要分支.介绍了部分可观察Markov决策过程的基本原理和决策过程,提出一种基于策略迭代和值迭代的部分可观察Markov决策算法,该算法利用线性规划和动态规划的思想,解决当信念状态空间较大时出现的"维数灾"问题,得到Markov决策的逼近最优解.实验数据表明该算法是可行的和有效的.  相似文献   

12.
In this paper, a novel iterative adaptive dynamic programming (ADP) algorithm, called generalised policy iteration ADP algorithm, is developed to solve optimal tracking control problems for discrete-time nonlinear systems. The idea is to use two iteration procedures, including an i-iteration and a j-iteration, to obtain the iterative tracking control laws and the iterative value functions. By system transformation, we first convert the optimal tracking control problem into an optimal regulation problem. Then the generalised policy iteration ADP algorithm, which is a general idea of interacting policy and value iteration algorithms, is introduced to deal with the optimal regulation problem. The convergence and optimality properties of the generalised policy iteration algorithm are analysed. Three neural networks are used to implement the developed algorithm. Finally, simulation examples are given to illustrate the performance of the present algorithm.  相似文献   

13.
We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method.  相似文献   

14.
This article proposes three novel time-varying policy iteration algorithms for finite-horizon optimal control problem of continuous-time affine nonlinear systems. We first propose a model-based time-varying policy iteration algorithm. The method considers time-varying solutions to the Hamiltonian–Jacobi–Bellman equation for finite-horizon optimal control. Based on this algorithm, value function approximation is applied to the Bellman equation by establishing neural networks with time-varying weights. A novel update law for time-varying weights is put forward based on the idea of iterative learning control, which obtains optimal solutions more efficiently compared to previous works. Considering that system models may be unknown in real applications, we propose a partially model-free time-varying policy iteration algorithm that applies integral reinforcement learning to acquiring the time-varying value function. Moreover, analysis of convergence, stability, and optimality is provided for every algorithm. Finally, simulations for different cases are given to verify the convenience and effectiveness of the proposed algorithms.  相似文献   

15.
We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao, Automatica 36 (2000) 771).  相似文献   

16.
This communique presents an algorithm called “value set iteration” (VSI) for solving infinite horizon discounted Markov decision processes with finite state and action spaces as a simple generalization of value iteration (VI) and as a counterpart to Chang’s policy set iteration. A sequence of value functions is generated by VSI based on manipulating a set of value functions at each iteration and it converges to the optimal value function. VSI preserves convergence properties of VI while converging no slower than VI and in particular, if the set used in VSI contains the value functions of independently generated sample-policies from a given distribution and a properly defined policy switching policy, a probabilistic exponential convergence rate of VSI can be established. Because the set used in VSI can contain the value functions of any policies generated by other existing algorithms, VSI is also a general framework of combining multiple solution methods.  相似文献   

17.
Optimization algorithms are studied for a class of semi-Markov control processes (SMCPs) with compact action set. Both the long-run average and discounted cost problems are considered. Formulas of performance potentials and optimality equations for SMCPs are derived. A policy iteration algorithm and a value iteration algorithm are proposed, which can lead to an optimal or suboptimal stationary policy in a finite number of iterations. The convergence of these algorithms is established, without the assumption of the corresponding iteration operator being a span-contraction. In addition, an online policy iteration optimization algorithm based on a single sample path is provided to escape ‘curse of dimensionality’. In the end, a numerical example is provided to illustrate the application of the algorithms.  相似文献   

18.
In this paper we discuss an online algorithm based on policy iteration for learning the continuous-time (CT) optimal control solution with infinite horizon cost for nonlinear systems with known dynamics. That is, the algorithm learns online in real-time the solution to the optimal control design HJ equation. This method finds in real-time suitable approximations of both the optimal cost and the optimal control policy, while also guaranteeing closed-loop stability. We present an online adaptive algorithm implemented as an actor/critic structure which involves simultaneous continuous-time adaptation of both actor and critic neural networks. We call this ‘synchronous’ policy iteration. A persistence of excitation condition is shown to guarantee convergence of the critic to the actual optimal value function. Novel tuning algorithms are given for both critic and actor networks, with extra nonstandard terms in the actor tuning law being required to guarantee closed-loop dynamical stability. The convergence to the optimal controller is proven, and the stability of the system is also guaranteed. Simulation examples show the effectiveness of the new algorithm.  相似文献   

19.
In this paper, we propose two multirate generalised policy iteration (GPI) algorithms applied to discrete-time linear quadratic regulation problems. The proposed algorithms are extensions of the existing GPI algorithm that consists of the approximate policy evaluation and policy improvement steps. The two proposed schemes, named heuristic dynamic programming (HDP) and dual HDP (DHP), based on multirate GPI, use multi-step estimation (M-step Bellman equation) at the approximate policy evaluation step for estimating the value function and its gradient called costate, respectively. Then, we show that these two methods with the same update horizon can be considered equivalent in the iteration domain. Furthermore, monotonically increasing and decreasing convergences, so called value iteration (VI)-mode and policy iteration (PI)-mode convergences, are proved to hold for the proposed multirate GPIs. Further, general convergence properties in terms of eigenvalues are also studied. The data-driven online implementation methods for the proposed HDP and DHP are demonstrated and finally, we present the results of numerical simulations performed to verify the effectiveness of the proposed methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号