首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy.   相似文献   

2.
Semi-Markov decision problems and performance sensitivity analysis   总被引:1,自引:0,他引:1  
Recent research indicates that Markov decision processes (MDPs) can be viewed from a sensitivity point of view; and the perturbation analysis (PA), MDPs, and reinforcement learning (RL) are three closely related areas in optimization of discrete-event dynamic systems that can be modeled as Markov processes. The goal of this paper is two-fold. First, we develop the PA theory for semi-Markov processes (SMPs); and then we extend the aforementioned results about the relation among PA, MDP, and RL to SMPs. In particular, we show that performance sensitivity formulas and policy iteration algorithms of semi-Markov decision processes can be derived based on the performance potential and realization matrix. Both the long-run average and discounted-cost problems are considered. This approach provides a unified framework for both problems, and the long-run average problem corresponds to the discounted factor being zero. The results indicate that performance sensitivities and optimization depend only on first-order statistics. Single sample path-based implementations are discussed.  相似文献   

3.
It is well known that stochastic control systems can be viewed as Markov decision processes (MDPs) with continuous state spaces. In this paper, we propose to apply the policy iteration approach in MDPs to the optimal control problem of stochastic systems. We first provide an optimality equation based on performance potentials and develop a policy iteration procedure. Then we apply policy iteration to the jump linear quadratic problem and obtain the coupled Riccati equations for their optimal solutions. The approach is applicable to linear as well as nonlinear systems and can be implemented on-line on real world systems without identifying all the system structure and parameters.  相似文献   

4.
We introduce and analyze several new policy iteration type algorithms for average cost Markov decision processes (MDPs). We limit attention to “recurrent state” processes where there exists a state which is recurrent under all stationary policies, and our analysis applies to finite-state problems with compact constraint sets, continuous transition probability functions, and lower-semicontinuous cost functions. The analysis makes use of an underlying relationship between recurrent state MDPs and the so-called stochastic shortest path problems of Bertsekas and Tsitsiklis (Math. Oper. Res. 16(3) (1991) 580). After extending this relationship, we establish the convergence of the new policy iteration type algorithms either to optimality or to within >0 of the optimal average cost.  相似文献   

5.
Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discreteevent systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.  相似文献   

6.
Reinforcement learning (RL) is concerned with the identification of optimal controls in Markov decision processes (MDPs) where no explicit model of the transition probabilities is available. We propose a class of RL algorithms which always produces stable estimates of the value function. In detail, we use "local averaging" methods to construct an approximate dynamic programming (ADP) algorithm. Nearest-neighbor regression, grid-based approximations, and trees can all be used as the basis of this approximation. We provide a thorough theoretical analysis of this approach and we demonstrate that ADP converges to a unique approximation in continuous-state average-cost MDPs. In addition, we prove that our method is consistent in the sense that an optimal approximate strategy is identified asymptotically. With regard to a practical implementation, we suggest a reduction of ADP to standard dynamic programming in an artificial finite-state MDP.  相似文献   

7.
Many partitioned scientific programs can be modeled as iterative executions of computational tasks and represented by iterative task graphs (ITGs). An ITG may or may not have dependence cycles. In this paper, we consider the symbolic scheduling of ITGs on distributed memory architectures with nonzero communication overhead and propose heuristic algorithms for scheduling both cyclic and acyclic ITGs without searching an entire iteration space. Our approach incorporates techniques of software pipelining, graph unfolding, directed acyclic graph (DAG) scheduling, and load balancing. We analyze the asymptotic optimality of the algorithms to show that the derived schedules are competitive to optimal solutions. We also study the sensitivity of scheduling performance on inaccurate weights. Finally, we present experimental results to demonstrate the effectiveness of the optimization techniques  相似文献   

8.
The hyper-cube framework for ant colony optimization.   总被引:14,自引:0,他引:14  
Ant colony optimization is a metaheuristic approach belonging to the class of model-based search algorithms. In this paper, we propose a new framework for implementing ant colony optimization algorithms called the hyper-cube framework for ant colony optimization. In contrast to the usual way of implementing ant colony optimization algorithms, this framework limits the pheromone values to the interval [0,1]. This is obtained by introducing changes in the pheromone value update rule. These changes can in general be applied to any pheromone value update rule used in ant colony optimization. We discuss the benefits coming with this new framework. The benefits are twofold. On the theoretical side, the new framework allows us to prove that in Ant System, the ancestor of all ant colony optimization algorithms, the average quality of the solutions produced increases in expectation over time when applied to unconstrained problems. On the practical side, the new framework automatically handles the scaling of the objective function values. We experimentally show that this leads on average to a more robust behavior of ant colony optimization algorithms.  相似文献   

9.
In this paper, we propose a way of exploiting Operations Research techniques within global constraints for cost-based domain filtering. In Constraint Programming, constraint propagation is aimed at removing from variable domains combinations of values which are proven infeasible. Pruning derives from feasibility reasoning. When coping with optimization problems, pruning can be performed also on the basis of costs, i.e., optimality reasoning. Cost-based filtering removes combination of values which are proven sub-optimal. For this purpose, we encapsulate in global constraints optimization components representing suitable relaxations of the constraint itself. These components embed efficient Operations Research algorithms computing the optimal solution of the relaxed problem and a gradient function representing the estimated cost of each variable-value assignment. We exploit these pieces of information for pruning and for guiding the search. We have applied these techniques to a couple of ILOG Solver global constraints (a constraint of difference and a path constraint) and tested the approach on a variety of combinatorial optimization problems such as Timetabling, Travelling Salesman Problems and Scheduling Problems with sequence dependent setup times. Comparisons with pure Constraint Programming approaches and related literature clearly show the benefits of the proposed approach.  相似文献   

10.
Bandit problems and the exploration/exploitation tradeoff   总被引:1,自引:0,他引:1  
We explore the two-armed bandit with Gaussian payoffs as a theoretical model for optimization. The problem is formulated from a Bayesian perspective, and the optimal strategy for both one and two pulls is provided. We present regions of parameter space where a greedy strategy is provably optimal. We also compare the greedy and optimal strategies to one based on a genetic algorithm. In doing so, we correct a previous error in the literature concerning the Gaussian bandit problem and the supposed optimality of genetic algorithms for this problem. Finally, we provide an analytically simple bandit model that is more directly applicable to optimization theory than the traditional bandit problem and determine a near-optimal strategy for that model  相似文献   

11.
This paper tackles the difficult but important task of objective algorithm performance assessment for optimization. Rather than reporting average performance of algorithms across a set of chosen instances, which may bias conclusions, we propose a methodology to enable the strengths and weaknesses of different optimization algorithms to be compared across a broader instance space. The results reported in a recent Computers and Operations Research paper comparing the performance of graph coloring heuristics are revisited with this new methodology to demonstrate (i) how pockets of the instance space can be found where algorithm performance varies significantly from the average performance of an algorithm; (ii) how the properties of the instances can be used to predict algorithm performance on previously unseen instances with high accuracy; and (iii) how the relative strengths and weaknesses of each algorithm can be visualized and measured objectively.  相似文献   

12.
In this paper, we propose a practical and efficient method for finding the globally optimal solution to the problem of determining the pose of an object. We present a framework that allows us to use point-to-point, point-to-line, and point-to-plane correspondences for solving various types of pose and registration problems involving euclidean (or similarity) transformations. Traditional methods such as the iterative closest point algorithm or bundle adjustment methods for camera pose may get trapped in local minima due to the nonconvexity of the corresponding optimization problem. Our approach of solving the mathematical optimization problems guarantees global optimality. The optimization scheme is based on ideas from global optimization theory, in particular convex underestimators in combination with branch-and-bound methods. We provide a provably optimal algorithm and demonstrate good performance on both synthetic and real data. We also give examples of where traditional methods fail due to the local minima problem.  相似文献   

13.
Data-Flow models are attracting renewed attention because they lend themselves to efficient mapping on multi-core architectures. The key problem of finding a maximum-throughput allocation and scheduling of Synchronous Data-Flow graphs (SDFGs) onto a multi-core architecture is NP-hard and has been traditionally solved by means of heuristic (incomplete) algorithms with no guarantee of global optimality. In this paper we propose an exact (complete) algorithm for the computation of a maximum-throughput mapping of applications specified as SDFG onto multi-core architectures. This is, to the best of our knowledge, the first complete algorithm for generic SDF graphs, including those with loops and a finite iteration bound. Our approach is based on Constraint Programming, it guarantees optimality and can handle realistic instances in terms of size and complexity. Extensive experiments on a large number of SDFGs demonstrate that our approach is effective and robust.  相似文献   

14.
We propose a unified framework to Markov decision problems and performance sensitivity analysis for multichain Markov processes with both discounted and average-cost performance criteria. With the fundamental concept of performance potentials, we derive both performance-gradient and performance-difference formulas, which play the central role in performance optimization. The standard policy iteration algorithms for both discounted- and average-reward MDPs can be established using the performance-difference formulas in a simple and intuitive way; and the performance-gradient formulas together with stochastic approximation may lead to new optimization schemes. This sensitivity-based point of view of performance optimization provides some insights that link perturbation analysis, Markov decision processes, and reinforcement learning together. The research is an extension of the previous work on ergodic Markov chains (Cao, Automatica 36 (2000) 771).  相似文献   

15.
A weakness of classical Markov decision processes (MDPs) is that they scale very poorly due to the flat state-space representation. Factored MDPs address this representational problem by exploiting problem structure to specify the transition and reward functions of an MDP in a compact manner. However, in general, solutions to factored MDPs do not retain the structure and compactness of the problem representation, forcing approximate solutions, with approximate linear programming (ALP) emerging as a promising MDP-approximation technique. To date, most ALP work has focused on the primal-LP formulation, while the dual LP, which forms the basis for solving constrained Markov problems, has received much less attention. We show that a straightforward linear approximation of the dual optimization variables is problematic, because some of the required computations cannot be carried out efficiently. Nonetheless, we develop a composite approach that symmetrically approximates the primal and dual optimization variables (effectively approximating both the objective function and the feasible region of the LP), leading to a formulation that is computationally feasible and suitable for solving constrained MDPs. We empirically show that this new ALP formulation also performs well on unconstrained problems.   相似文献   

16.
Focuses on bias optimality in unichain, finite state, and action-space Markov decision processes. Using relative value functions, we present methods for evaluating optimal bias, this leads to a probabilistic analysis which transforms the original reward problem into a minimum average cost problem. The result is an explanation of how and why bias implicitly discounts future rewards  相似文献   

17.
A value iteration algorithm for time-aggregated Markov-decision processes (MDPs) is developed to solve problems with large state spaces. The algorithm is based on a novel approach which solves a time aggregated MDP by incrementally solving a set of standard MDPs. Therefore, the algorithm converges under the same assumption as standard value iteration. Such assumption is much weaker than that required by the existing time aggregated value iteration algorithm. The algorithms developed in this paper are also applicable to MDPs with fractional costs.  相似文献   

18.
In this paper the topology optimization problem is solved in a finite strain setting using a polyconvex hyperelastic material. Since finite strains is considered the definition of the stiffness is not unique. In the present contribution, the objective of the optimization is minimization of the end-displacement for a given amount of material. The problem is regularized using the phase-field approach which leads to that the optimality criterion is defined by a second order partial differential equation. Both the elastic boundary value problem and the optimality criterion is solved using the finite element method. To approach the optimal state a steepest descent approach is utilized. The interfaces between void and full material are resolved using an adaptive finite element scheme. The paper is closed by numerical examples that clearly illustrates that the presented method is able to find optimal solutions for finite strain topology optimization problems.  相似文献   

19.
Accurate streamline tracing and travel time computation are essential ingredients of streamline methods for groundwater transport and petroleum reservoir simulation. In this paper we present a unified formulation for the development of high-order accurate streamline tracing algorithms on unstructured triangular and quadrilateral grids. The main result of this paper is the identification of velocity spaces that are suitable for streamline tracing. The essential requirement is that the divergence-free part of the velocity must induce a stream function. We recognize several classes of velocity spaces satisfying this requirement from the theory of mixed finite element methods and, for each class, we obtain the precise functional form of the stream function. Not surprisingly, the most widely used tracing algorithm (Pollock’s method) emanates in fact from the lowest-order admissible velocity approximation. Therefore, we provide a sound theoretical justification for the low-order algorithms currently in use, and we show how to achieve higher-order accuracy both in the streamline tracing and the travel time computation.  相似文献   

20.
Robustness of policies in constrained Markov decision processes   总被引:1,自引:0,他引:1  
We consider the optimization of finite-state, finite-action Markov decision processes (MDPs), under constraints. Cost and constraints are discounted. We introduce a new method for investigating the continuity, and a certain type of robustness, of the optimal cost and the optimal policy under changes in the constraints. This method is also applicable for other cost criteria such as finite horizon and infinite horizon average cost.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号