共查询到10条相似文献,搜索用时 125 毫秒
1.
2.
3.
样本有限关联值递归Q学习算法及其收敛性证明 总被引:5,自引:0,他引:5
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决问题,求解最优决策一般有两种途径:一种是求最大奖赏方法,另一种最求最优费用方法,利用求解最优费用函数的方法给出了一种新的Q学习算法,Q学习算法是求解信息不完全Markov决策问题的一种有效激励学习方法。Watkins提出了Q学习的基本算法,尽管他证明了在满足一定条件下Q值学习的迭代公式的收敛性,但是在他给出的算法中,没有考虑到在迭代过程中初始状态与初始动作的选取对后继学习的影响,因此提出的关联值递归Q学习算法改进了原来的Q学习算法,并且这种算法有比较好的收敛性质,从求解最优费用函数的方法出发,给出了Q学习的关联值递归算法,这种方法的建立可以使得动态规划(DP)算法中的许多结论直接应用到Q学习的研究中来。 相似文献
4.
基于有限样本的最优费用关联值递归Q学习算法 总被引:4,自引:2,他引:4
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来求解决策问题。求解最优决策一般有两种途径,一种是求最大奖赏方法,另一种是求最优费用方法。该文利用求解最优费用函数的方法给出了一种新的Q学习算法。Q学习算法是求解信息不完全Markov决策问题的一种有效激励学习方法。文章从求解最优费用函数的方法出发,给出了Q学习的关联值递归算法,这种方法的建立,可以使得动态规划(DP)算法中的许多结论直接应用到Q学习的研究中来。 相似文献
5.
Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning. 相似文献
6.
Xi-Ren Cao Junyu Zhang 《Automatic Control, IEEE Transactions on》2008,53(2):496-508
In this paper, we propose a new approach to the theory of finite multichain Markov decision processes (MDPs) with different performance optimization criteria. We first propose the concept of nth-order bias; then, using the average reward and bias difference formulas derived in this paper, we develop an optimization theory for finite MDPs that covers a complete spectrum from average optimality, bias optimality, to all high-order bias optimality, in a unified way. The approach is simple, direct, natural, and intuitive; it depends neither on Laurent series expansion nor on discounted MDPs. We also propose one-phase policy iteration algorithms for bias and high-order bias optimal policies, which are more efficient than the two-phase algorithms in the literature. Furthermore, we derive high-order bias optimality equations. This research is a part of our effort in developing sensitivity-based learning and optimization theory. 相似文献
7.
8.
平均准则问题的即时差分学习算法 总被引:2,自引:0,他引:2
考虑平均准则随机动态规划(SDP)问题的一族在线即时差分(TD)学习算法.在学
习中,平均问题的相对值函数是控制器所要学习的目标函数.所提出的算法是已有的TD(λ)
算法及R-学习算法的一种推广. 相似文献
9.
We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms. 相似文献
10.
P. L. Lanzi 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2002,6(3-4):162-170
We analyze learning classifier systems in the light of tabular reinforcement learning. We note that although genetic algorithms
are the most distinctive feature of learning classifier systems, it is not clear whether genetic algorithms are important
to learning classifiers systems. In fact, there are models which are strongly based on evolutionary computation (e.g., Wilson's
XCS) and others which do not exploit evolutionary computation at all (e.g., Stolzmann's ACS). To find some clarifications,
we try to develop learning classifier systems “from scratch”, i.e., starting from one of the most known reinforcement learning
technique, Q-learning. We first consider thebasics of reinforcement learning: a problem modeled as a Markov decision process
and tabular Q-learning. We introduce a formal framework to define a general purpose rule-based representation which we use
to implement tabular Q-learning. We formally define generalization within rules and discuss the possible approaches to extend
our rule-based Q-learning with generalization capabilities. We suggest that genetic algorithms are probably the most general
approach for adding generalization although they might be not the only solution. 相似文献