Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Authors:	Mohammad Gheshlaghi Azar Rémi Munos Hilbert J. Kappen

Affiliation:	1. Department of Biophysics, Radboud University Nijmegen, 6525, EZ Nijmegen, The Netherlands 2. School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA 3. INRIA Lille, SequeL Project, 40 avenue Halley, 59650, Villeneuve d’Ascq, France

Abstract:	We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γ∈[0,1) only O(Nlog(N/δ)/((1?γ)³ ε ²)) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1?δ. Further, we prove that, for small values of ε, an order of O(Nlog(N/δ)/((1?γ)³ ε ²)) samples is required to find an ε-optimal policy w.p. 1?δ. We also prove a matching lower bound of Θ(Nlog(N/δ)/((1?γ)³ ε ²)) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N, ε, δ and 1/(1?γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1?γ).

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏