首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We consider the revenue management problem of capacity control under customer choice behavior. An exact solution of the underlying stochastic dynamic program is difficult because of the multi-dimensional state space and, thus, approximate dynamic programming (ADP) techniques are widely used. The key idea of ADP is to encode the multi-dimensional state space by a small number of basis functions, often leading to a parametric approximation of the dynamic program’s value function. In general, two classes of ADP techniques for learning value function approximations exist: mathematical programming and simulation. So far, the literature on capacity control largely focuses on the first class.In this paper, we develop a least squares approximate policy iteration (API) approach which belongs to the second class. Thereby, we suggest value function approximations that are linear in the parameters, and we estimate the parameters via linear least squares regression. Exploiting both exact and heuristic knowledge from the value function, we enforce structural constraints on the parameters to facilitate learning a good policy. We perform an extensive simulation study to investigate the performance of our approach. The results show that it is able to obtain competitive revenues compared to and often outperforms state-of-the-art capacity control methods in reasonable computational time. Depending on the scarcity of capacity and the point in time, revenue improvements of around 1% or more can be observed. Furthermore, the proposed approach contributes to simulation-based ADP, bringing forth research on numerically estimating piecewise linear value function approximations and their application in revenue management environments.  相似文献   

2.
Approximate dynamic programming (ADP) commonly employs value function approximation to numerically solve complex dynamic programming problems. A statistical perspective of value function approximation employs a design and analysis of computer experiments (DACE) approach, where the “computer experiment” yields points on the value function curve. The DACE approach has been used to numerically solve high-dimensional, continuous-state stochastic dynamic programming, and performs two tasks primarily: (1) design of experiments and (2) statistical modeling. The use of design of experiments enables more efficient discretization. However, identifying the appropriate sample size is not straightforward. Furthermore, identifying the appropriate model structure is a well-known problem in the field of statistics. In this paper, we present a sequential method that can adaptively determine both sample size and model structure. Number-theoretic methods (NTM) are used to sequentially grow the experimental design because of their ability to fill the design space. Feed-forward neural networks (NNs) are used for statistical modeling because of their adjustability in structure-complexity . This adaptive value function approximation (AVFA) method must be automated to enable efficient implementation within ADP. An AVFA algorithm is introduced, that increments the size of the state space training data in each sequential step, and for each sample size a successive model search process is performed to find an optimal NN model. The new algorithm is tested on a nine-dimensional inventory forecasting problem.  相似文献   

3.
We address the problem of determining optimal stepsizes for estimating parameters in the context of approximate dynamic programming. The sufficient conditions for convergence of the stepsize rules have been known for 50 years, but practical computational work tends to use formulas with parameters that have to be tuned for specific applications. The problem is that in most applications in dynamic programming, observations for estimating a value function typically come from a data series that can be initially highly transient. The degree of transience affects the choice of stepsize parameters that produce the fastest convergence. In addition, the degree of initial transience can vary widely among the value function parameters for the same dynamic program. This paper reviews the literature on deterministic and stochastic stepsize rules, and derives formulas for optimal stepsizes for minimizing estimation error. This formula assumes certain parameters are known, and an approximation is proposed for the case where the parameters are unknown. Experimental work shows that the approximation provides faster convergence than other popular formulas. Editor: Prasad Tadepalli  相似文献   

4.
Polynomial-time approximation algorithms with nontrivial performance guarantees are presented for the problems of (a) partitioning the vertices of a weighted graph intok blocks so as to maximize the weight of crossing edges, and (b) partitioning the vertices of a weighted graph into two blocks of equal cardinality, again so as to maximize the weight of crossing edges. The approach, pioneered by Goemans and Williamson, is via a semidefinite programming relaxation. The first author was supported in part by NSF Grant CCR-9225008. The work described here was undertaken while the second author was visiting Carnegie Mellon University; at that time he was a Nuffield Science Research Fellow, and was supported in part by Grant GR/F 90363 of the UK Science and Engineering Research Council, and Esprit Working Group 7097 “RAND”.  相似文献   

5.
《Automatica》2014,50(12):3281-3290
This paper addresses the model-free nonlinear optimal control problem based on data by introducing the reinforcement learning (RL) technique. It is known that the nonlinear optimal control problem relies on the solution of the Hamilton–Jacobi–Bellman (HJB) equation, which is a nonlinear partial differential equation that is generally impossible to be solved analytically. Even worse, most practical systems are too complicated to establish an accurate mathematical model. To overcome these difficulties, we propose a data-based approximate policy iteration (API) method by using real system data rather than a system model. Firstly, a model-free policy iteration algorithm is derived and its convergence is proved. The implementation of the algorithm is based on the actor–critic structure, where actor and critic neural networks (NNs) are employed to approximate the control policy and cost function, respectively. To update the weights of actor and critic NNs, a least-square approach is developed based on the method of weighted residuals. The data-based API is an off-policy RL method, where the “exploration” is improved by arbitrarily sampling data on the state and input domain. Finally, we test the data-based API control design method on a simple nonlinear system, and further apply it to a rotational/translational actuator system. The simulation results demonstrate the effectiveness of the proposed method.  相似文献   

6.
The segment minimization problem consists of representing an integer matrix as the sum of the fewest number of integer matrices each of which have the property that the non-zeroes in each row are consecutive. This has direct applications to an effective form of cancer treatment. Using several insights, we extend previous results to obtain constant-factor improvements in the approximation guarantees. We show that these improvements yield better performance by providing an experimental evaluation of all known approximation algorithms using both synthetic and real-world clinical data. Our algorithms are superior for 76% of instances and we argue for their utility alongside the heuristic approaches used in practice.  相似文献   

7.
Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.  相似文献   

8.
The maximum weight matching problem is a fundamental problem in graph theory with a variety of important applications. Recently Manne and Mjelde presented the first self-stabilizing algorithm computing a 2-approximation of the optimal solution. They established that their algorithm stabilizes after O(2n) (resp. O(3n)) moves under a central (resp. distributed) scheduler. This paper contributes a new analysis, improving these bounds considerably. In particular it is shown that the algorithm stabilizes after O(nm) moves under the central scheduler and that a modified version of the algorithm also stabilizes after O(nm) moves under the distributed scheduler. The paper presents a new proof technique based on graph reduction for analyzing the complexity of self-stabilizing algorithms.  相似文献   

9.
This study investigates the global optimality of approximate dynamic programming (ADP) based solutions using neural networks for optimal control problems with fixed final time. Issues including whether or not the cost function terms and the system dynamics need to be convex functions with respect to their respective inputs are discussed and sufficient conditions for global optimality of the result are derived. Next, a new idea is presented to use ADP with neural networks for optimization of non-convex smooth functions. It is shown that any initial guess leads to direct movement toward the proximity of the global optimum of the function. This behavior is in contrast with gradient based optimization methods in which the movement is guided by the shape of the local level curves. Illustrative examples are provided with single and multi-variable functions that demonstrate the potential of the proposed method.  相似文献   

10.
We investigate the optimum control of a stochastic system, in the presence of both exogenous (control-independent) stochastic state variables and endogenous (control-dependent) state variables. Our solution approach relies on simulations and regressions with respect to the state variables, but also grafts the endogenous state variable into the simulation paths. That is, unlike most other simulation approaches found in the literature, no discretization of the endogenous variable is required. The approach is meant to handle several stochastic variables, offers a high level of flexibility in their modeling, and should be at its best in non time-homogenous cases, when the optimal policy structure changes with time. We provide numerical results for a dam-based hydropower application, where the exogenous variable is the stochastic spot price of power, and the endogenous variable is the water level in the reservoir.  相似文献   

11.
In this work, “policy iteration algorithm” (PIA) is applied for controlling arterial oxygen saturation that does not require mathematical models of the plant. This technique is based on nonlinear optimal control to solve the Hamilton–Jacobi–Bellman equation. The controller is synthesized using a state feedback configuration based on an unidentified model of complex pathophysiology of pulmonary system in order to control gas exchange in ventilated patients, as under some circumstances (like emergency situations), there may not be a proper and individualized model for designing and tuning controllers available in time. The simulation results demonstrate the optimal control of oxygenation based on the proposed PIA by iteratively evaluating the Hamiltonian cost functions and synthesizing the control actions until achieving the converged optimal criteria. Furthermore, as a practical example, we examined the performance of this control strategy using an interconnecting three-tank system as a real nonlinear system.  相似文献   

12.
In this paper, a finite-horizon neuro-optimal tracking control strategy for a class of discrete-time nonlinear systems is proposed. Through system transformation, the optimal tracking problem is converted into designing a finite-horizon optimal regulator for the tracking error dynamics. Then, with convergence analysis in terms of cost function and control law, the iterative adaptive dynamic programming (ADP) algorithm via heuristic dynamic programming (HDP) technique is introduced to obtain the finite-horizon optimal tracking controller which makes the cost function close to its optimal value within an ?-error bound. Three neural networks are used as parametric structures to implement the algorithm, which aims at approximating the cost function, the control law, and the error dynamics, respectively. Two simulation examples are included to complement the theoretical discussions.  相似文献   

13.
In this article, an adaptive critic scheme with a novel performance index function is developed to solve the tracking control problem, which eliminates the tracking error and possesses the adjustable convergence rate in the offline learning process. Under some conditions, the convergence and monotonicity of the accelerated value function sequence can be guaranteed. Combining the advantages of the adjustable and general value iteration schemes, an integrated algorithm is proposed with a fast guaranteed convergence, which involves two stages, namely the acceleration stage and the convergence stage. Moreover, an effective approach is given to adaptively determine the acceleration interval. With this operation, the fast convergence of the new value iteration scheme can be fully utilized. Finally, compared with the general value iteration, the numerical results are presented to verify the fast convergence and the tracking performance of the developed adaptive critic design.  相似文献   

14.
In this paper, we analyse the convergence and stability properties of generalised policy iteration (GPI) applied to discrete-time linear quadratic regulation problems. GPI is one kind of the generalised adaptive dynamic programming methods used for solving optimal control problems, and is composed of policy evaluation and policy improvement steps. To analyse the convergence and stability of GPI, the dynamic programming (DP) operator is defined. Then, GPI and its equivalent formulas are presented based on the notation of DP operator. The convergence of the approximate value function to the exact one in policy evaluation is proven based on the equivalent formulas. Furthermore, the positive semi-definiteness, stability, and the monotone convergence (PI-mode and VI-mode convergence) of GPI are presented under certain conditions on the initial value function. The online least square method is also presented for the implementation of GPI. Finally, some numerical simulations are carried out to verify the effectiveness of GPI as well as to further investigate the convergence and stability properties.  相似文献   

15.
Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and areversal takes an arbitrary substring of elements, and reverses their order.For this problem, we develop two algorithms: a greedy approximation algorithm, that finds a solution provably close to optimal inO(n 2) time and0(n) space forn-element permutations, and a branch- and-bound exact algorithm, that finds an optimal solution in0(mL(n, n)) time and0(n 2) space, wherem is the size of the branch- and-bound search tree, andL(n, n) is the time to solve a linear program ofn variables andn constraints. The greedy algorithm is the first to come within a constant factor of the optimum; it guarantees a solution that uses no more than twice the minimum number of reversals. The lower and upper bounds of the branch- and-bound algorithm are a novel application of maximum-weight matchings, shortest paths, and linear programming.In a series of experiments, we study the performance of an implementation on random permutations, and permutations generated by random reversals. For permutations differing byk random reversals, we find that the average upper bound on reversal distance estimatesk to within one reversal fork<1/2n andn<100. For the difficult case of random permutations, we find that the average difference between the upper and lower bounds is less than three reversals forn<50. Due to the tightness of these bounds, we can solve, to optimality, problems on 30 elements in a few minutes of computer time. This approaches the scale of mitochondrial genomes.This research was supported by a postdoctoral fellowship from the Program in Mathematics and Molecular Biology of the University of California at Berkeley under National Science Foundation Grant DMS-8720208, and by a fellowship from the Centre de recherches mathématiques of the Université de Montréal.This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada, and the Fonds pour la formation de chercheurs et l'aide à la recherche (Québec). The author is a fellow of the Canadian Institute for Advanced Research.  相似文献   

16.
本文讨论了在给定数字轮廓线顶点数目为N的情况下,从其上选择k个点来构造拟和多边形来近似原图,使得物体轮廓线的形状丢失最小。本文的贡献在于1)本文的轮廓线拟和是以物体轮廓线的形状的信息丢失最小为目标的。而传统的方法是以拟和多边形和原图的面积差距最小。2)本文认为每一个点具有一定的形状信息,然后建立了一个以丢失的形状信息最小的0—1规划模型:3)使用了两种方:去对轮廓线进行了多边形拟合,并把两个结果进行了比较。  相似文献   

17.
本文讨论了在给定数字轮廓线顶点数目为N的情况下,从其上选择k个点来构造拟和多边形来近似原图,使得物体轮廓线的形状丢失最小。本文的贡献在于1)本文的轮廓线拟和是以物体轮廓线的形状的信息丢失最小为目标的。而传统的方法是以拟和多边形和原图的面积差距最小。2)本文认为每一个点具有一定的形状信息,然后建立了一个以丢失的形状信息最小的0-1规划模型;3)使用了两种方法对轮廓线进行了多边形拟合,并把两个结果进行了比较。  相似文献   

18.
This paper investigates the choice of function approximator for an approximate dynamic programming (ADP) based control strategy. The ADP strategy allows the user to derive an improved control policy given a simulation model and some starting control policy (or alternatively, closed-loop identification data), while circumventing the ‘curse-of-dimensionality’ of the traditional dynamic programming approach. In ADP, one fits a function approximator to state vs. ‘cost-to-go’ data and solves the Bellman equation with the approximator in an iterative manner. A proper choice and design of function approximator is critical for convergence of the iteration and the quality of final learned control policy, because an approximation error can grow quickly in the loop of optimization and function approximation. Typical classes of approximators used in related approaches are parameterized global approximators (e.g. artificial neural networks) and nonparametric local averagers (e.g. k-nearest neighbor). In this paper, we assert on the basis of some case studies and a theoretical result that a certain type of local averagers should be preferred over global approximators as the former ensures monotonic convergence of the iteration. However, a converged cost-to-go function does not necessarily lead to a stable control policy on-line due to the problem of over-extrapolation. To cope with this difficulty, we propose that a penalty term be included in the objective function in each minimization to discourage the optimizer from finding a solution in the regions of state space where the local data density is inadequately low. A nonparametric density estimator, which can be naturally combined with a local averager, is employed for this purpose.  相似文献   

19.
In this paper, a comparative analysis of the performance of the Genetic Algorithm (GA) and Directed Grid Search (DGS) methods for optimal parametric design is presented. A genetic algorithm is a guided random search mechanism based on the principle of natural selection and population genetics. The Directed Grid Search method uses a selective directed search of grid points in the direction of descent to find the minimum of a real function, when the initial estimate of the location of the minimum and the bounds of the design variables are specified. An experimental comparison and a discussion on the performance of these two methods in solving a set of eight test functions is presented.  相似文献   

20.
This study investigates the application of learning-based and simulation-based Approximate Dynamic Programming (ADP) approaches to an inventory problem under the Generalized Autoregressive Conditional Heteroscedasticity (GARCH) model. Specifically, we explore the robustness of a learning-based ADP method, Sarsa, with a GARCH(1,1) demand model, and provide empirical comparison between Sarsa and two simulation-based ADP methods: Rollout and Hindsight Optimization (HO). Our findings assuage a concern regarding the effect of GARCH(1,1) latent state variables on learning-based ADP and provide practical strategies to design an appropriate ADP method for inventory problems. In addition, we expose a relationship between ADP parameters and conservative behavior. Our empirical results are based on a variety of problem settings, including demand correlations, demand variances, and cost structures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号