首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
王涛  张化光 《控制与决策》2015,30(9):1674-1678

针对模型参数部分未知的随机线性连续时间系统, 通过策略迭代算法求解无限时间随机线性二次(LQ) 最优控制问题. 求解随机LQ最优控制问题等价于求随机代数Riccati 方程(SARE) 的解. 首先利用伊藤公式将随机微分方程转化为确定性方程, 通过策略迭代算法给出SARE 的解序列; 然后证明SARE 的解序列收敛到SARE 的解, 而且在迭代过程中系统是均方可镇定的; 最后通过仿真例子表明策略迭代算法的可行性.

  相似文献   

2.
Kernel-based least squares policy iteration for reinforcement learning.   总被引:4,自引:0,他引:4  
In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.  相似文献   

3.
As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI.  相似文献   

4.
Basic Ideas for Event-Based Optimization of Markov Systems   总被引:5,自引:0,他引:5  
The goal of this paper is two-fold: First, we present a sensitivity point of view on the optimization of Markov systems. We show that Markov decision processes (MDPs) and the policy-gradient approach, or perturbation analysis (PA), can be derived easily from two fundamental sensitivity formulas, and such formulas can be flexibly constructed, by first principles, with performance potentials as building blocks. Second, with this sensitivity view we propose an event-based optimization approach, including the event-based sensitivity analysis and event-based policy iteration. This approach utilizes the special feature of a system characterized by events and illustrates how the potentials can be aggregated using the special feature and how the aggregated potential can be used in policy iteration. Compared with the traditional MDP approach, the event-based approach has its advantages: the number of aggregated potentials may scale to the system size despite that the number of states grows exponentially in the system size, this reduces the policy space and saves computation; the approach does not require actions at different states to be independent; and it utilizes the special feature of a system and does not need to know the exact transition probability matrix. The main ideas of the approach are illustrated by an admission control problem.Supported in part by a grant from Hong Kong UGC.  相似文献   

5.
Semi-Markov decision problems and performance sensitivity analysis   总被引:1,自引:0,他引:1  
Recent research indicates that Markov decision processes (MDPs) can be viewed from a sensitivity point of view; and the perturbation analysis (PA), MDPs, and reinforcement learning (RL) are three closely related areas in optimization of discrete-event dynamic systems that can be modeled as Markov processes. The goal of this paper is two-fold. First, we develop the PA theory for semi-Markov processes (SMPs); and then we extend the aforementioned results about the relation among PA, MDP, and RL to SMPs. In particular, we show that performance sensitivity formulas and policy iteration algorithms of semi-Markov decision processes can be derived based on the performance potential and realization matrix. Both the long-run average and discounted-cost problems are considered. This approach provides a unified framework for both problems, and the long-run average problem corresponds to the discounted factor being zero. The results indicate that performance sensitivities and optimization depend only on first-order statistics. Single sample path-based implementations are discussed.  相似文献   

6.
This paper aims at characterizing the most destabilizing switching law for discrete-time switched systems governed by a set of bounded linear operators. The switched system is embedded in a special class of discrete-time bilinear control systems. This allows us to apply the variational approach to the bilinear control system associated with a Mayer-type optimal control problem, and a second-order necessary optimality condition is derived. Optimal equivalence between the bilinear system and the switched system is analyzed, which shows that any optimal control law can be equivalently expressed as a switching law. This specific switching law is most unstable for the switched system, and thus can be used to determine stability under arbitrary switching. Based on the second-order moment of the state, the proposed approach is applied to analyze uniform mean-square stability of discrete-time switched linear stochastic systems. Numerical simulations are presented to verify the usefulness of the theoretic results.  相似文献   

7.
路径积分方法源于随机最优控制,是一种数值迭代方法,可求解连续非线性系统的最优控制问题,不依赖于系统模型,快速收敛.文中将基于路径积分强化学习的策略改善方法用于蛇形机器人的目标导向运动.使用路径积分强化学习方法学习蛇形机器人步态方程的参数,不仅可以在仿真环境下使蛇形机器人规避障碍到达目标点,利用仿真环境的先验知识也能在实际环境下快速完成相同的任务.实验结果验证方法的正确性.  相似文献   

8.
平均和折扣准则MDP基于TD(0)学习的统一NDP方法   总被引:3,自引:0,他引:3  
为适应实际大规模M arkov系统的需要,讨论M arkov决策过程(MDP)基于仿真的学习优化问题.根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynam ic programm ing,NDP)优化方法.研究结果适用于半M arkov决策过程,并通过一个数值例子,说明了文中的神经元策略迭代算法对两种准则都适用,验证了平均问题是折扣问题当折扣因子趋近于零时的极限情况.  相似文献   

9.
In this paper, we examine the problem of optimal state estimation or filtering in stochastic systems using an approach based on information theoretic measures. In this setting, the traditional minimum mean-square measure is compared with information theoretic measures, Kalman filtering theory is reexamined, and some new interpretations are offered. We show that for a linear Gaussian system, the Kalman filter is the optimal filter not only for the mean-square error measure, but for several information theoretic measures which are introduced in this work. For nonlinear systems, these same measures generally are in conflict with each other, and the feedback control policy has a dual role with regard to regulation and estimation. For linear stochastic systems with general noise processes, a lower bound on the achievable mutual information between the estimation error and the observation are derived. The properties of an optimal (probing) control law and the associated optimal filter, which achieve this lower bound, and their relationships are investigated. It is shown that for a linear stochastic system with an affine linear filter for the homogeneous system, under some reachability and observability conditions, zero mutual information between estimation error and observations can be achieved only when the system is Gaussian  相似文献   

10.
In this paper, we propose a new approach to the theory of finite multichain Markov decision processes (MDPs) with different performance optimization criteria. We first propose the concept of nth-order bias; then, using the average reward and bias difference formulas derived in this paper, we develop an optimization theory for finite MDPs that covers a complete spectrum from average optimality, bias optimality, to all high-order bias optimality, in a unified way. The approach is simple, direct, natural, and intuitive; it depends neither on Laurent series expansion nor on discounted MDPs. We also propose one-phase policy iteration algorithms for bias and high-order bias optimal policies, which are more efficient than the two-phase algorithms in the literature. Furthermore, we derive high-order bias optimality equations. This research is a part of our effort in developing sensitivity-based learning and optimization theory.  相似文献   

11.
We introduce and analyze several new policy iteration type algorithms for average cost Markov decision processes (MDPs). We limit attention to “recurrent state” processes where there exists a state which is recurrent under all stationary policies, and our analysis applies to finite-state problems with compact constraint sets, continuous transition probability functions, and lower-semicontinuous cost functions. The analysis makes use of an underlying relationship between recurrent state MDPs and the so-called stochastic shortest path problems of Bertsekas and Tsitsiklis (Math. Oper. Res. 16(3) (1991) 580). After extending this relationship, we establish the convergence of the new policy iteration type algorithms either to optimality or to within >0 of the optimal average cost.  相似文献   

12.
《Automatica》2014,50(12):3281-3290
This paper addresses the model-free nonlinear optimal control problem based on data by introducing the reinforcement learning (RL) technique. It is known that the nonlinear optimal control problem relies on the solution of the Hamilton–Jacobi–Bellman (HJB) equation, which is a nonlinear partial differential equation that is generally impossible to be solved analytically. Even worse, most practical systems are too complicated to establish an accurate mathematical model. To overcome these difficulties, we propose a data-based approximate policy iteration (API) method by using real system data rather than a system model. Firstly, a model-free policy iteration algorithm is derived and its convergence is proved. The implementation of the algorithm is based on the actor–critic structure, where actor and critic neural networks (NNs) are employed to approximate the control policy and cost function, respectively. To update the weights of actor and critic NNs, a least-square approach is developed based on the method of weighted residuals. The data-based API is an off-policy RL method, where the “exploration” is improved by arbitrarily sampling data on the state and input domain. Finally, we test the data-based API control design method on a simple nonlinear system, and further apply it to a rotational/translational actuator system. The simulation results demonstrate the effectiveness of the proposed method.  相似文献   

13.
In this note, a stochastic production model containing processes with different time scales is developed. It is shown that if the time scales of the processes are very different, some hierarchical algorithms that are much more efficient than the standard policy iteration method can be developed to find the optimal production control. Moreover, if the time scales fall far apart, the optimal control of a deterministic limiting problem depending only on the mean characteristics of the processes can be used to approximate the optimal control of the original problem. The limiting problem has much lower dimension than its original counterpart and thus is much easier to solve. A numerical example is used to illustrate the potential of the proposed approach  相似文献   

14.
This paper deals with the optimal control problem for a class of affine nonlinear discrete‐time systems. By introducing a sensitivity parameter and expanding the system variables into a Maclaurin series around it, we transform the original optimal control problem for affine nonlinear discrete‐time systems into the optimal control problem for a sequence of linear discrete‐time systems. The optimal control law consists of an accurate linear term and a nonlinear compensating term, which is an infinite sequence of adjoint vectors. In the present approach, iteration is required only for the nonlinear compensation series. By intercepting a finite sum of the series, we obtain a suboptimal control law that reduces the complexity of the calculations. A numerical simulation shows that the algorithm can be easily implemented and has a fast convergence rate.  相似文献   

15.
This paper presents a novel Markov switching state space control model for dynamically switching resource configuration scheme to achieve power conservation for multimedia server cluster systems. This model exploits the hierarchical dynamic structure of network system and its construction is flexible and scalable. Using this analytical model, the problem of power conservation is posed as a constrained stochastic optimization problem with the goal of minimizing the average power consumption subject to the constraint on the average blocking ratio. Applying Lagrange approach and online estimation of the performance gradient, a policy iteration algorithm is proposed to search the optimal policy online. This algorithm does not depend on any prior knowledge of system parameters, and converges to the optimal solution. Simulation results demonstrate the convergence of the proposed algorithm and effectiveness to different access workloads.  相似文献   

16.
This paper studies a continuous-time stochastic linear-quadratic (SLQ) optimal control problem on infinite-horizon. Combining the Kronecker product theory with an existing policy iteration algorithm, a data-driven policy iteration algorithm is proposed to solve the problem. In contrast to most existing methods that need all information of system coefficients, the proposed algorithm eliminates the requirement of three system matrices by utilizing data of a stochastic system. More specifically, this algorithm uses the collected data to iteratively approximate the optimal control and a solution of the stochastic algebraic Riccati equation (SARE) corresponding to the SLQ optimal control problem. The convergence analysis of the obtained algorithm is given rigorously, and a simulation example is provided to illustrate the effectiveness and applicability of the algorithm.  相似文献   

17.
针对一类具有二次型性能指标的双线性系统的最优跟踪控制问题,提出了一种通过逐次逼近法设计最优控制律的近似方法。首先将状态向量含有时滞的双线性系统的最优跟踪问题转化为最优调节问题;然后利用逐次逼近算法,将既含有时滞项又含有超前项的两点边值问题转化为不含时滞项和超前项的线性两点边值问题族,得到调节系统的最优控制律,并可以通过截取最优控制序列的有限项得到调节系统的前馈-反馈次优控制律。最后,将最优控制问题转化为最优跟踪问题。仿真结果表明,此方法达到了较好的跟踪效果。  相似文献   

18.
Tuhin Das  Ranjan Mukherjee   《Automatica》2008,44(5):1437-1441
In this paper we address the problem of optimal switching for switched linear systems. The uniqueness of our approach lies in describing the switching action by multiple control inputs. This allows us to embed the switched system in a larger family of systems and apply Pontryagin’s Minimum Principle for solving the optimal control problem. This approach imposes no restriction on the switching sequence or the number of switchings. This is in contrast to search based algorithms where a fixed number of switchings is set a priori. In our approach, the optimal solution can be determined by solving the ensuing two-point boundary value problem. Results of numerical simulations are provided to support the proposed method.  相似文献   

19.
This paper presents a new design approach to achieve decentralized optimal control of high-dimension complex singular systems with dynamic uncertainties. Based on robust adaptive dynamic programming (robust ADP) method, controllers for solving the singular systems optimal control problem are designed. The proposed algorithm can work well when the system model is not exactly known but the input and output data can be measured. The policy iteration of each controller only uses their own states and input information for learning, and do not need to know the whole system dynamics. Simulation results on the New England 10-machine 39-bus test system show the effectiveness of the designed controller.   相似文献   

20.
ABSTRACT

In this paper, we investigate the optimal control problems for delayed doubly stochastic control systems. We first discuss the existence and uniqueness of the delayed doubly stochastic differential equation by martingale representation theorem and contraction mapping principle. As a necessary condition of the optimal control, we deduce a stochastic maximum principle under some assumption. At the same time, a sufficient condition of optimality is obtained by using the duality method. At the end of the paper, we apply our stochastic maximum principle to a class of linear quadratic optimal control problem and obtain the explicit expression of the optimal control.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号