首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Munos  Rémi 《Machine Learning》2000,40(3):265-299
This paper proposes a study of Reinforcement Learning (RL) for continuous state-space and time control problems, based on the theoretical framework of viscosity solutions (VSs). We use the method of dynamic programming (DP) which introduces the value function (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the Hamilton-Jacobi-Bellman (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control.In order to solve the HJB equation, we use the powerful framework of viscosity solutions and state that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods discretize, at some resolution, the HJB equation into a DP equation of a Markov Decision Process (MDP), which can be solved by DP methods (thanks to a strong contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, which learns from experience, the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only approximations (in a sense of satisfying some weak contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the Car on the Hill problem.  相似文献   

2.

In this technical note, we revisit the risk-sensitive optimal control problem for Markov jump linear systems (MJLSs). We first demonstrate the inherent difficulty in solving the risk-sensitive optimal control problem even if the system is linear and the cost function is quadratic. This is due to the nonlinear nature of the coupled set of Hamilton-Jacobi-Bellman (HJB) equations, stemming from the presence of the jump process. It thus follows that the standard quadratic form of the value function with a set of coupled Riccati differential equations cannot be a candidate solution to the coupled HJB equations. We subsequently show that there is no equivalence relationship between the problems of risk-sensitive control and H control of MJLSs, which are shown to be equivalent in the absence of any jumps. Finally, we show that there does not exist a large deviation limit as well as a risk-neutral limit of the risk-sensitive optimal control problem due to the presence of a nonlinear coupling term in the HJB equations.

  相似文献   

3.
This paper considers an infinite horizon investment-consumption model in which a single agent consumes and distributes his wealth between two assets, a bond and a stock. The problem of maximization of the total utility from consumption is treated, when state (amount allocated in assets) and control (consumption, rates of trading) constraints are present. The value function is characterized as the unique viscosity solution of the Hamilton-Jacobi-Bellman equation which, actually, is a Variational Inequality with gradient constraints. Numerical schemes are then constructed in order to compute the value function and the location of the free boundaries of the so-called transaction regions. These schemes are a combination of implicit and explicit schemes; their convergence is obtained from the uniqueness of viscosity solutions to the HJB equation.  相似文献   

4.
We discuss optimal control problems with integral state-control constraints. We rewrite the problem in an equivalent form as an optimal control problem with state constraints for an extended system, and prove that the value function, although possibly discontinuous, is the unique viscosity solution of the constrained boundary value problem for the corresponding Hamilton–Jacobi equation. The state constraint is the epigraph of the minimal solution of a second Hamilton–Jacobi equation. Our framework applies, for instance, to systems with design uncertainties.  相似文献   

5.
In this article, optimal control problems of differential equations with delays are investigated for which the associated Hamilton–Jacobi–Bellman (HJB) equations are nonlinear partial differential equations with delays. This type of HJB equation has not been previously studied and is difficult to solve because the state equations do not possess smoothing properties. We introduce a new notion of viscosity solutions and identify the value functional of the optimal control problems as the unique solution to the associated HJB equations. An analytical example is given as application.  相似文献   

6.
We consider a controlled Schrödinger equation with a dipolar and a polarizability term, used when the dipolar approximation is not valid. The control is the amplitude of the external electric field, it acts nonlinearly on the state. We extend in this infinite dimensional framework previous techniques used by Coron, Grigoriu, Lefter and Turinici for stabilization in finite dimension. We consider a highly oscillating control and prove the semi-global weak $H^2$ stabilization of the averaged system using a Lyapunov function introduced by Nersesyan. Then it is proved that the solutions of the Schrödinger equation and of the averaged equation stay close on every finite time horizon provided that the control is oscillating enough. Combining these two results, we get approximate controllability to the ground state for the polarizability system with explicit controls. Numerical simulations are presented to illustrate those theoretical results.  相似文献   

7.
The Hamilton–Jacobi–Bellman (HJB) equation can be solved to obtain optimal closed-loop control policies for general nonlinear systems. As it is seldom possible to solve the HJB equation exactly for nonlinear systems, either analytically or numerically, methods to build approximate solutions through simulation based learning have been studied in various names like neurodynamic programming (NDP) and approximate dynamic programming (ADP). The aspect of learning connects these methods to reinforcement learning (RL), which also tries to learn optimal decision policies through trial-and-error based learning. This study develops a model-based RL method, which iteratively learns the solution to the HJB and its associated equations. We focus particularly on the control-affine system with a quadratic objective function and the finite horizon optimal control (FHOC) problem with time-varying reference trajectories. The HJB solutions for such systems involve time-varying value, costate, and policy functions subject to boundary conditions. To represent the time-varying HJB solution in high-dimensional state space in a general and efficient way, deep neural networks (DNNs) are employed. It is shown that the use of DNNs, compared to shallow neural networks (SNNs), can significantly improve the performance of a learned policy in the presence of uncertain initial state and state noise. Examples involving a batch chemical reactor and a one-dimensional diffusion-convection-reaction system are used to demonstrate this and other key aspects of the method.  相似文献   

8.
Bounded operator abstraction is a language construct relevant to object oriented programming languages and to ML2000, the successor to Standard ML. In this paper, we introduce , a variant of F<:ω with this feature and with Cardelli and Wegner’s kernel Fun rule for quantifiers. We define a typed-operational semantics with subtyping and prove that it is equivalent with , using logical relations to prove soundness. The typed-operational semantics provides a powerful and uniform technique to study metatheoretic properties of , such as Church–Rosser, subject reduction, the admissibility of structural rules, and the equivalence with the algorithmic presentation of the system that performs weak-head reductions.Furthermore, we can show decidability of subtyping using the typed-operational semantics and its equivalence with the usual presentation. Hence, this paper demonstrates for the first time that logical relations can be used to show decidability of subtyping.  相似文献   

9.
An approach to solve finite time horizon suboptimal feedback control problems for partial differential equations is proposed by solving dynamic programming equations on adaptive sparse grids. A semi-discrete optimal control problem is introduced and the feedback control is derived from the corresponding value function. The value function can be characterized as the solution of an evolutionary Hamilton–Jacobi Bellman (HJB) equation which is defined over a state space whose dimension is equal to the dimension of the underlying semi-discrete system. Besides a low dimensional semi-discretization it is important to solve the HJB equation efficiently to address the curse of dimensionality. We propose to apply a semi-Lagrangian scheme using spatially adaptive sparse grids. Sparse grids allow the discretization of the value functions in (higher) space dimensions since the curse of dimensionality of full grid methods arises to a much smaller extent. For additional efficiency an adaptive grid refinement procedure is explored. The approach is illustrated for the wave equation and an extension to equations of Schrödinger type is indicated. We present several numerical examples studying the effect the parameters characterizing the sparse grid have on the accuracy of the value function and the optimal trajectory.  相似文献   

10.
In this paper, we present an empirical study of iterative least squares minimization of the Hamilton-Jacobi-Bellman (HJB) residual with a neural network (NN) approximation of the value function. Although the nonlinearities in the optimal control problem and NN approximator preclude theoretical guarantees and raise concerns of numerical instabilities, we present two simple methods for promoting convergence, the effectiveness of which is presented in a series of experiments. The first method involves the gradual increase of the horizon time scale, with a corresponding gradual increase in value function complexity. The second method involves the assumption of stochastic dynamics which introduces a regularizing second derivative term to the HJB equation. A gradual reduction of this term provides further stabilization of the convergence. We demonstrate the solution of several problems, including the 4-D inverted-pendulum system with bounded control. Our approach requires no initial stabilizing policy or any restrictive assumptions on the plant or cost function, only knowledge of the plant dynamics. In the Appendix, we provide the equations for first- and second-order differential backpropagation.  相似文献   

11.
-like control for nonlinear stochastic systems   总被引:1,自引:0,他引:1  
In this paper we develop a H-type theory, from the dissipation point of view, for a large class of time-continuous stochastic nonlinear systems. In particular, we introduce the notion of stochastic dissipative systems analogously to the familiar notion of dissipation associated with deterministic systems and utilize it as a basis for the development of our theory. Having discussed certain properties of stochastic dissipative systems, we consider time-varying nonlinear systems for which we establish a connection between what is called the L2-gain property and the solution to a certain Hamilton–Jacobi inequality (HJI), that may be viewed as a bounded real lemma for stochastic nonlinear systems. The time-invariant case with infinite horizon is also considered, where for this case we synthesize a worst case-based stabilizing controller. Stability in this case is taken to be in the mean-square sense. In the stationary case, the problem of robust state feedback control is considered in the case of norm-bounded uncertainties. A solution is then derived in terms of linear matrix inequalities.  相似文献   

12.
A central topic in query learning is to determine which classes of Boolean formulas are efficiently learnable with membership and equivalence queries. We consider the class kconsisting of conjunctions ofkunate DNF formulas. This class generalizes the class ofk-clause CNF formulas and the class of unate DNF formulas, both of which are known to be learnable in polynomial time with membership and equivalence queries. We prove that 2can be properly learned with a polynomial number of polynomial-size membership and equivalence queries, but can be properly learned in polynomial time with such queries if and only if P=NP. Thus the barrier to properly learning 2with membership and equivalence queries is computational rather than informational. Few results of this type are known. In our proofs, we use recent results of Hellersteinet al.(1997,J. Assoc. Comput. Mach.43(5), 840–862), characterizing the classes that are polynomial-query learnable, together with work of Bshouty on the monotone dimension of Boolean functions. We extend some of our results to kand pose open questions on learning DNF formulas of small monotone dimension. We also prove structural results for k. We construct, for any fixedk2, a class of functionsfthat cannot be represented by any formula in k, but which cannot be “easily” shown to have this property. More precisely, for any functionfonnvariables in the class, the value offon any polynomial-size set of points in its domain is not a witness thatfcannot be represented by a formula in k. Our construction is based on BCH codes.  相似文献   

13.
In this paper we study the problem of ergodic impulsive control of Feller processes with costly information. We prove continuity of the value functions for optimal stopping and impulsive control with long run average cost. We characterize the value functions as generalized solutions of respective quasi-variational inequalities and describe optimal policies. We study also an equation associated to impulsive control with long run average cost.  相似文献   

14.
本文研究了一类带有终端约束的切换系统在有限时间内的最优控制问题.终端约束的出现使得最优控制问题的值函数不再是处处可微的,甚至是不连续的.因此,原来关于无穷时间域上的值函数是Bensoussan-Lions拟变分不等式(QVI)的粘性解的这一结论已不再适用.本文采用了动态规划方法和生存定理将QVI的解延拓到了下半连续的情形,并且得到了有限时间最优切换控制问题的值函数是QVI的下半连续解的重要结论.  相似文献   

15.
In this paper, we study a contextual labelled transition semantics for Higher-Order process calculi. The labelled transition semantics are relatively clean and simple, and corresponding bisimulation equivalence can be easily formulated based on it. Besides we develop a novel approach to reason about behaviours of a higher-order substituted process P{Q/X}, based on which we can directly prove a very important result – factorisation theorem. To show the correspondence between our semantics and the well-established ones, we characterize our bisimulation in a version of barbed equivalence.  相似文献   

16.
In this paper we consider the capacitated lot-sizing problem (CLSP) with linear costs. It is known that this problem is NP-hard, but there exist special cases that can be solved in polynomial time. We derive a new O(T2) algorithm for the CLSP with non-increasing setup costs, general holding costs, non-increasing production costs and non-decreasing capacities over time, where T is the length of the model horizon. We show that in every iteration we do not consider more candidate solutions than the O(T2) algorithm proposed by [Chung and Lin, 1988. Management Science 34, 420–6]. We also develop a variant of our algorithm that is more efficient in the case of relatively large capacities. Numerical tests show the superior performance of our algorithms compared to the algorithm of [Chung and Lin, 1988. Management Science 34, 420–6].  相似文献   

17.
This paper considers mobile to base station power control for lognormal fading channels in wireless communication systems within a centralized information stochastic optimal control framework. Under a bounded power rate of change constraint, the stochastic control problem and its associated Hamilton-Jacobi-Bellman (HJB) equation are analyzed by the viscosity solution method; then the degenerate HJB equation is perturbed to admit a classical solution and a suboptimal control law is designed based on the perturbed HJB equation. When a quadratic type cost is used without a bound constraint on the control, the value function is a classical solution to the degenerate HJB equation and the feedback control is affine in the system power. In addition, in this case we develop approximate, but highly scalable, solutions to the HJB equation in terms of a local polynomial expansion of the exact solution. When the channel parameters are not known a priori, one can obtain on-line estimates of the parameters and get adaptive versions of the control laws. In numerical experiments with both of the above cost functions, the following phenomenon is observed: whenever the users have different initial conditions, there is an initial convergence of the power levels to a common level and then subsequent approximately equal behavior which converges toward a stochastically varying optimum.  相似文献   

18.
In this paper, we study statistical properties of fluid flows that are actively controlled. Statistical properties such as Lagrangian and Eulerian time-averages are important flow quantities in fluid flows, particularly during mixing processes. Due to the assumption of incompressibility, the transformations in the state space can be described by a sequence of measure preserving transformations on a measure space. The classical Birkhoff's pointwise ergodic theorem does not necessarily apply in the context of sequences of transformations. We call B-regular a sequence for which this theorem holds. Motivated by mixing control concepts, we define three notions of asymptotic equivalence for sequences of transformations. We show an example in which Birkhoff's pointwise ergodic theorem does not hold even when a ‘strong’ asymptotic equivalence to a B-regular sequence is assumed. Under a ‘very strong’ asymptotic equivalence condition, we prove B-regularity. In the context of optimize-then-stabilize strategy for mixing control, we also prove that very strong asymptotic equivalence to a mixing sequence implies mixing. The mean ergodic theorem and the Poincare’ recurrence theorem are also proven for sequences of transformations under suitable asymptotic equivalence assumptions.  相似文献   

19.
It is shown that any stabilizing, certainty equivalence control used within an adaptive control system, causes the familiar interconnection of a controlled process and associated output estimator to be detectable through the estimator’s output error ep, for every frozen value of the index or parameter vector p upon which both the estimator and controller dynamics depend. The fact that certainty equivalence implies detectability has been known for some time – this has been shown to be so whenever the process model is linear and the controller and estimator models are also linear for every frozen value of p. In this paper, use is made of recently introduced concepts of input-to-state stability and detectability for nonlinear systems to prove that the same implication is valid in a more general, nonlinear setting.  相似文献   

20.
A method is presented for solving the infinite time Hamilton-Jacobi-Bellman (HJB) equation for certain state-constrained stochastic problems. The HJB equation is reformulated as an eigenvalue problem, such that the principal eigenvalue corresponds to the expected cost per unit time, and the corresponding eigenfunction gives the value function (up to an additive constant) for the optimal control policy. The eigenvalue problem is linear and hence there are fast numerical methods available for finding the solution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号