首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到6条相似文献,搜索用时 0 毫秒
1.
This communique provides an exact iterative search algorithm for the NP-hard problem of obtaining an optimal feasible stationary Markovian pure policy that achieves the maximum value averaged over an initial state distribution in finite constrained Markov decision processes. It is based on a novel characterization of the entire feasible policy space and takes the spirit of policy iteration (PI) in that a sequence of monotonically improving feasible policies is generated and converges to an optimal policy in iterations of the size of the policy space at the worst case. Unlike PI, an unconstrained MDP needs to be solved at iterations involved with feasible policies and the current best policy improves all feasible policies included in the union of the policy spaces associated with the unconstrained MDPs.  相似文献   

2.
3.
This communique presents an algorithm called “value set iteration” (VSI) for solving infinite horizon discounted Markov decision processes with finite state and action spaces as a simple generalization of value iteration (VI) and as a counterpart to Chang’s policy set iteration. A sequence of value functions is generated by VSI based on manipulating a set of value functions at each iteration and it converges to the optimal value function. VSI preserves convergence properties of VI while converging no slower than VI and in particular, if the set used in VSI contains the value functions of independently generated sample-policies from a given distribution and a properly defined policy switching policy, a probabilistic exponential convergence rate of VSI can be established. Because the set used in VSI can contain the value functions of any policies generated by other existing algorithms, VSI is also a general framework of combining multiple solution methods.  相似文献   

4.
This paper deals with Markov decision processes with a target set for nonpositive rewards. Two types of threshold probability criteria are discussed. The first criterion is a probability that a total reward is not greater than a given initial threshold value, and the second is a probability that the total reward is less than it. Our first (resp. second) optimizing problem is to minimize the first (resp. second) threshold probability. These problems suggest that the threshold value is a permissible level of the total reward to reach a goal (the target set), that is, we would reach this set over the level, if possible. For the both problems, we show that 1) the optimal threshold probability is a unique solution to an optimality equation, 2) there exists an optimal deterministic stationary policy, and 3) a value iteration and a policy space iteration are given. In addition, we prove that the first (resp. second) optimal threshold probability is a monotone increasing and right (resp. left) continuous function of the initial threshold value and propose a method to obtain an optimal policy and the optimal threshold probability in the first problem by using them in the second problem.  相似文献   

5.
We propose a time aggregation approach for the solution of infinite horizon average cost Markov decision processes via policy iteration. In this approach, policy update is only carried out when the process visits a subset of the state space. As in state aggregation, this approach leads to a reduced state space, which may lead to a substantial reduction in computational and storage requirements, especially for problems with certain structural properties. However, in contrast to state aggregation, which generally results in an approximate model due to the loss of Markov property, time aggregation suffers no loss of accuracy, because the Markov property is preserved. Single sample path-based estimation algorithms are developed that allow the time aggregation approach to be implemented on-line for practical systems. Some numerical and simulation examples are presented to illustrate the ideas and potential computational savings.  相似文献   

6.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号