首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 343 毫秒
1.
In this paper we study the average cost criterion induced by the regular utility function (U-average cost criterion) for continuous-time Markov decision processes. This criterion is a generalization of the risk-sensitive average cost and expected average cost criteria. We first introduce an auxiliary risk-sensitive first passage optimization problem and obtain the properties of the corresponding optimal value function under the slight conditions. Then we show that the pair of the optimal value functions of the risk-sensitive average cost criterion and the risk-sensitive first passage criterion is a solution to the optimality equation of the risk-sensitive average cost criterion allowing the risk-sensitivity parameter to take any nonzero value. Moreover, we have that the optimal value function of the risk-sensitive average cost criterion is continuous with respect to the risk-sensitivity parameter. Finally, we give the connections between the U-average cost criterion and the average cost criteria induced by the identity function and the exponential utility function, and prove the existence of a U-average optimal deterministic stationary policy in the class of all randomized Markov policies.  相似文献   

2.
In this paper, we consider the problem of optimal control for a class of nonlinear stochastic systems with multiplicative noise. The nonlinearity consists of quadratic terms in the state and control variables. The optimality criteria are of a risk-sensitive and generalised risk-sensitive type. The optimal control is found in an explicit closed-form by the completion of squares and the change of measure methods. As applications, we outline two special cases of our results. We show that a subset of the class of models which we consider leads to a generalised quadratic–affine term structure model (QATSM) for interest rates. We also demonstrate how our results lead to generalisation of exponential utility as a criterion in optimal investment.  相似文献   

3.
《Systems & Control Letters》2007,56(11-12):663-668
According to Assaf, a dynamic programming problem is called invariant if its transition mechanism depends only on the chosen action. This paper studies properties of risk-sensitive invariant problems with a general state space. The main result establishes the optimality equation for the risk-sensitive average cost criterion without any restrictions on the risk factor. Moreover, a practical algorithm is provided for solving the optimality equation in case of a finite action space.  相似文献   

4.
Near-Optimal Reinforcement Learning in Polynomial Time   总被引:1,自引:0,他引:1  
Kearns  Michael  Singh  Satinder 《Machine Learning》2002,49(2-3):209-232
We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.  相似文献   

5.
In this paper, the H tracking control of linear discrete‐time systems is studied via reinforcement learning. By defining an improved value function, the tracking game algebraic Riccati equation with a discount factor is obtained, which is solved by iteration learning algorithms. In particular, Q‐learning based on value iteration is presented for H tracking control, which does not require the system model information and the initial allowable control policy. In addition, to improve the practicability of algorithm, the convergence analysis of proposed algorithm with a discount factor is given. Finally, the feasibility of proposed algorithms is verified by simulation examples.  相似文献   

6.
Kearns  Michael  Mansour  Yishay  Ng  Andrew Y. 《Machine Learning》2002,49(2-3):193-208
A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP. In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case. In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states. The running time is exponential in the horizon time (which depends only on the discount factor and the desired degree of approximation to the optimal policy). Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter.Our algorithm is based on the idea of sparse sampling. We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP. Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng. Neural information processing systems 13, to appear).  相似文献   

7.
In this paper, a data-based feedback relearning algorithm is proposed for the robust control problem of uncertain nonlinear systems. Motivated by the classical on-policy and off-policy algorithms of reinforcement learning, the online feedback relearning (FR) algorithm is developed where the collected data includes the influence of disturbance signals. The FR algorithm has better adaptability to environmental changes (such as the control channel disturbances) compared with the off-policy algorithm, and has higher computational efficiency and better convergence performance compared with the on-policy algorithm. Data processing based on experience replay technology is used for great data efficiency and convergence stability. Simulation experiments are presented to illustrate convergence stability, optimality and algorithmic performance of FR algorithm by comparison.   相似文献   

8.
Long-Ji Lin 《Machine Learning》1992,8(3-4):293-321
To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus two-fold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning.This paper compares eight reinforcement learning frameworks: adaptive heuristic critic (AHC) learning due to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The environment is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.  相似文献   

9.
Instance-Based Learning Algorithms   总被引:46,自引:1,他引:45  
Storing and using specific instances improves the performance of several supervised learning algorithms. These include algorithms that learn decision trees, classification rules, and distributed networks. However, no investigation has analyzed algorithms that use only specific instances to solve incremental learning tasks. In this paper, we describe a framework and methodology, called instance-based learning, that generates classification predictions using only specific instances. Instance-based learning algorithms do not maintain a set of abstractions derived from specific instances. This approach extends the nearest neighbor algorithm, which has large storage requirements. We describe how storage requirements can be significantly reduced with, at most, minor sacrifices in learning rate and classification accuracy. While the storage-reducing algorithm performs well on several real-world databases, its performance degrades rapidly with the level of attribute noise in training instances. Therefore, we extended it with a significance test to distinguish noisy instances. This extended algorithm's performance degrades gracefully with increasing noise levels and compares favorably with a noise-tolerant decision tree algorithm.  相似文献   

10.
Mahadevan  Sridhar 《Machine Learning》1996,22(1-3):159-195
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号