首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 109 毫秒
1.
连续时间MCP在紧致行动集上的最优策略   总被引:8,自引:2,他引:8  
文中研究了一类连续时间Markov控制过程(CTMCP)无穷水平平均代价性能的最优控 制决策问题.文章采用无穷小生成元和性能势的基本性质,直接导出了平均代价模型在紧致行动 集上的最优性方程及其解的存在性定理,提出了求解ε-最优平稳控制策略的数值迭代算法,并给 出了这种算法的收敛性证明.最后通过分析一个数值例子来说明这种方法的应用.  相似文献   

2.
一类受控闭排队网络基于性能势的最优性方程   总被引:1,自引:0,他引:1  
研究一类受控闭排队网络系统的性能优化问题. 文章引进了两个基本概念: 折扣代价α 性能势和平均代价性能势, 并且讨论了这两个性能势之间的一个关系式. 在一般的假设条件下, 我们应用性能势的基本性质直接建立了无限时间水平平均代价模型的最优性方程, 并且证明了在紧致集上最优解的存在性. 最后给出了一个策略优化的迭代算法并通过一个实际算例以说明该算法的效果.  相似文献   

3.
蔡霞  马社祥  孟鑫 《计算机应用研究》2012,29(11):4232-4234
针对传统算法在处理传感器网络的大规模信号时,运算复杂度显著增大,性能急剧下降的问题,提出了启发式同步自适应迭代阈值重构算法。采用启发式差错控制函数选择代价最少的方向逐行同步收缩逼近最优解,并结合由自适应递减幂指数参数所确定的非线性阈值函数,进一步判断修正重构信号。仿真结果表明,启发式同步自适应迭代阈值重构算法以更少的测量值和迭代次数重构信号,其信噪比提高了60 dB。  相似文献   

4.
基于LMI方法的保性能迭代学习算法设计   总被引:4,自引:0,他引:4  
研究基于性能的迭代学习算法设计与优化问题.首先定义了迭代域二次型性能函数,然后针对线性离散系统给出了迭代域最优迭代学习算法;基于线性矩阵不等式(LMI)方法,针对不确定线性离散系统给出了保性能迭代学习算法及其优化方法.对于这两类迭代学习算法,只要调整性能函数中的权系数矩阵,便可很好地调整迭代学习收敛速度.另外,保性能迭代学习算法设计及优化过程,可利用MATLAB工具箱很方便地求解.  相似文献   

5.
设计了一种基于折扣广义值迭代的智能算法,用于解决一类复杂非线性系统的最优跟踪控制问题.通过选取合适的初始值,值迭代过程中的代价函数将以单调递减的形式收敛到最优代价函数.基于单调递减的值迭代算法,在不同折扣因子的作用下,讨论了迭代跟踪控制律的可容许性和误差系统的渐近稳定性.为了促进算法的实现,建立一个数据驱动的模型网络用...  相似文献   

6.
Markov控制过程基于性能势的平均代价最优策略   总被引:2,自引:1,他引:2  
研究了一类离散时间Markov控制过程平均代价性能最优控制决策问题.应用 Markov性能势的基本性质,在很一般性的假设条件下,直接导出了无限时间平均代价模型在紧 致行动集上的最优性方程及其解的存在性定理.提出了求解最优平稳控制策略的迭代算法,并 讨论了这种算法的收敛性问题.最后通过分析一个实例来说明这种算法的应用.  相似文献   

7.
研究了迭代优化方法在无线传感器网络节点定位中的应用,针对多维尺度分析定位技术和传统的梯度迭代优化方法,根据数值实验确定了迭代步长和网络连通度之间的函数关系,提出了一种基于连通度的分布式多维尺度分析节点定位算法(a connectivity-based distributed weighted multidimensional scaling algorithm,简称dwMDS(C)).该算法首先根据网络的平均连通度确定迭代步长,然后对每个未知节点的局部代价函数进行优化求解.实验表明该迭代算法收敛快速且稳定,比基于SMACOF算法的dwMDS(G)算法在定位精度上有明显的提高.  相似文献   

8.
大规模图的复杂挖掘算法通常需要高频迭代分析,而在计算与存储方面扩展性良好的分布式计算是提高处理效率的有效方案.然而,图顶点之间存在自由分布的边关系,会在分布式计算任务之间产生大量消息,由此在迭代过程中产生的巨大通信开销严重制约性能收益.已有工作在传统消息推送框架下采用合并和备份等技术降低通信代价,但主要面向结构简单、易优化的单维消息类算法,并不适用于结构复杂的多维消息类算法,也难以与当前最先进的消息按需拉取框架兼容.因此提出一种新型轻量级顶点备份机制,通过备份顶点的按需同步以及本地消息的按需生成,可完美继承拉取框架在容错和内存管控等方面的系统优势,同时显著降低通信代价.此外,通过考虑通信收益与负载偏斜代价,可计算最优阈值以提高整体性能.最后在大量真实数据集上验证了相关技术的有效性.  相似文献   

9.
为了减少在低信噪比区的平均迭代次数和削弱LLR(Logarithm Likelihood Ratio, LLR)值的振荡,分析了中短码长LDPC码错误帧对应校验节点对数似然比及校验和变化的规律,提出了一种基于消息振荡及校验更新的改进BP译码算法。该算法通过提前结束迭代译码的准则来减少在低信噪比区的平均迭代次数,并通过修正校验节点的更新来削弱LLR值的振荡来提高译码性能。仿真结果表明,相对于BP算法:在低信噪比区,该算法减少了平均迭代次数且译码性能没有损失;而在中高信噪比区,其提高了译码性能而平均迭代次数无需增加。  相似文献   

10.
迭代粒子群算法及其在间歇过程鲁棒优化中的应用   总被引:1,自引:0,他引:1  
针对无状态独立约束和终端约束的间歇过程鲁棒优化问题,将迭代方法与粒子群优化算法相结合,提出了迭代粒子群算法.对于该算法,首先将控制变量离散化,用标准粒子群优化算法搜索离散控制变量的最优解.然后在随后的迭代过程中将基准移到刚解得的最优值处,同时收缩控制变量的搜索域,使优化性能指标和控制轨线在迭代过程中不断趋于最优解.算法简洁、可行、高效,避免了求解大规模微分方程组的问题.对一个间歇过程的仿真结果证明了迭代粒子群算法可以有效地解决无状态独立约束和终端约束的间歇过程鲁棒优化问题.  相似文献   

11.
部分可观察Markov决策过程是通过引入信念状态空间将非Markov链问题转化为Markov链问题来求解,其描述真实世界的特性使它成为研究随机决策过程的重要分支.介绍了部分可观察Markov决策过程的基本原理和决策过程,提出一种基于策略迭代和值迭代的部分可观察Markov决策算法,该算法利用线性规划和动态规划的思想,解决当信念状态空间较大时出现的"维数灾"问题,得到Markov决策的逼近最优解.实验数据表明该算法是可行的和有效的.  相似文献   

12.
In this paper, a novel iterative adaptive dynamic programming (ADP) algorithm, called generalised policy iteration ADP algorithm, is developed to solve optimal tracking control problems for discrete-time nonlinear systems. The idea is to use two iteration procedures, including an i-iteration and a j-iteration, to obtain the iterative tracking control laws and the iterative value functions. By system transformation, we first convert the optimal tracking control problem into an optimal regulation problem. Then the generalised policy iteration ADP algorithm, which is a general idea of interacting policy and value iteration algorithms, is introduced to deal with the optimal regulation problem. The convergence and optimality properties of the generalised policy iteration algorithm are analysed. Three neural networks are used to implement the developed algorithm. Finally, simulation examples are given to illustrate the performance of the present algorithm.  相似文献   

13.
We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method.  相似文献   

14.
This article proposes three novel time-varying policy iteration algorithms for finite-horizon optimal control problem of continuous-time affine nonlinear systems. We first propose a model-based time-varying policy iteration algorithm. The method considers time-varying solutions to the Hamiltonian–Jacobi–Bellman equation for finite-horizon optimal control. Based on this algorithm, value function approximation is applied to the Bellman equation by establishing neural networks with time-varying weights. A novel update law for time-varying weights is put forward based on the idea of iterative learning control, which obtains optimal solutions more efficiently compared to previous works. Considering that system models may be unknown in real applications, we propose a partially model-free time-varying policy iteration algorithm that applies integral reinforcement learning to acquiring the time-varying value function. Moreover, analysis of convergence, stability, and optimality is provided for every algorithm. Finally, simulations for different cases are given to verify the convenience and effectiveness of the proposed algorithms.  相似文献   

15.
We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao, Automatica 36 (2000) 771).  相似文献   

16.
This communique presents an algorithm called “value set iteration” (VSI) for solving infinite horizon discounted Markov decision processes with finite state and action spaces as a simple generalization of value iteration (VI) and as a counterpart to Chang’s policy set iteration. A sequence of value functions is generated by VSI based on manipulating a set of value functions at each iteration and it converges to the optimal value function. VSI preserves convergence properties of VI while converging no slower than VI and in particular, if the set used in VSI contains the value functions of independently generated sample-policies from a given distribution and a properly defined policy switching policy, a probabilistic exponential convergence rate of VSI can be established. Because the set used in VSI can contain the value functions of any policies generated by other existing algorithms, VSI is also a general framework of combining multiple solution methods.  相似文献   

17.
Optimization algorithms are studied for a class of semi-Markov control processes (SMCPs) with compact action set. Both the long-run average and discounted cost problems are considered. Formulas of performance potentials and optimality equations for SMCPs are derived. A policy iteration algorithm and a value iteration algorithm are proposed, which can lead to an optimal or suboptimal stationary policy in a finite number of iterations. The convergence of these algorithms is established, without the assumption of the corresponding iteration operator being a span-contraction. In addition, an online policy iteration optimization algorithm based on a single sample path is provided to escape ‘curse of dimensionality’. In the end, a numerical example is provided to illustrate the application of the algorithms.  相似文献   

18.
In this paper, we propose two multirate generalised policy iteration (GPI) algorithms applied to discrete-time linear quadratic regulation problems. The proposed algorithms are extensions of the existing GPI algorithm that consists of the approximate policy evaluation and policy improvement steps. The two proposed schemes, named heuristic dynamic programming (HDP) and dual HDP (DHP), based on multirate GPI, use multi-step estimation (M-step Bellman equation) at the approximate policy evaluation step for estimating the value function and its gradient called costate, respectively. Then, we show that these two methods with the same update horizon can be considered equivalent in the iteration domain. Furthermore, monotonically increasing and decreasing convergences, so called value iteration (VI)-mode and policy iteration (PI)-mode convergences, are proved to hold for the proposed multirate GPIs. Further, general convergence properties in terms of eigenvalues are also studied. The data-driven online implementation methods for the proposed HDP and DHP are demonstrated and finally, we present the results of numerical simulations performed to verify the effectiveness of the proposed methods.  相似文献   

19.
We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号