共查询到20条相似文献,搜索用时 0 毫秒
1.
学习、交互及其结合是建立健壮、自治agent的关键必需能力。强化学习是agent学习的重要部分,agent强化学习包括单agent强化学习和多agent强化学习。文章对单agent强化学习与多agent强化学习进行了比较研究,从基本概念、环境框架、学习目标、学习算法等方面进行了对比分析,指出了它们的区别和联系,并讨论了它们所面临的一些开放性的问题。 相似文献
2.
Ann Maria Bell 《Computational Economics》2001,18(1):89-110
This paper examines the performance of simple reinforcement learningalgorithms in a stationary environment and in a repeated game where theenvironment evolves endogenously based on the actions of other agents. Sometypes of reinforcement learning rules can be extremely sensitive to smallchanges in the initial conditions, consequently, events early in a simulationcan affect the performance of the rule over a relatively long time horizon.However, when multiple adaptive agents interact, algorithms that performedpoorly in a stationary environment often converge rapidly to a stableaggregate behaviors despite the slow and erratic behavior of individuallearners. Algorithms that are robust in stationary environments can exhibitslow convergence in an evolving environment. 相似文献
3.
Relational reinforcement learning is presented, a learning technique that combines reinforcement learning with relational learning or inductive logic programming. Due to the use of a more expressive representation language to represent states, actions and Q-functions, relational reinforcement learning can be potentially applied to a new range of learning tasks. One such task that we investigate is planning in the blocks world, where it is assumed that the effects of the actions are unknown to the agent and the agent has to learn a policy. Within this simple domain we show that relational reinforcement learning solves some existing problems with reinforcement learning. In particular, relational reinforcement learning allows us to employ structural representations, to abstract from specific goals pursued and to exploit the results of previous learning phases when addressing new (more complex) situations. 相似文献
4.
结合强化学习技术讨论了单移动Agent学习的过程,然后扩展到多移动Agent学习领域,提出一个多移动Agent学习算法MMAL(MultiMobileAgentLearning)。算法充分考虑了移动Agent学习的特点,使得移动Agent能够在不确定和有冲突目标的上下文中进行决策,解决在学习过程中Agent对移动时机的选择,并且能够大大降低计算代价。目的是使Agent能在随机动态的环境中进行自主、协作的学习。最后,通过仿真试验表明这种学习算法是一种高效、快速的学习方法。 相似文献
5.
Embedding a Priori Knowledge in Reinforcement Learning 总被引:2,自引:0,他引:2
Carlos H. C. Ribeiro 《Journal of Intelligent and Robotic Systems》1998,21(1):51-71
In the last years, temporal differences methods have been put forward as convenient tools for reinforcement learning. Techniques based on temporal differences, however, suffer from a serious drawback: as stochastic adaptive algorithms, they may need extensive exploration of the state-action space before convergence is achieved. Although the basic methods are now reasonably well understood, it is precisely the structural simplicity of the reinforcement learning principle – learning through experimentation – that causes these excessive demands on the learning agent. Additionally, one must consider that the agent is very rarely a tabula rasa: some rough knowledge about characteristics of the surrounding environment is often available. In this paper, I present methods for embedding a priori knowledge in a reinforcement learning technique in such a way that both the mathematical structure of the basic learning algorithm and the capacity to generalise experience across the state-action space are kept. Extensive experimental results show that the resulting variants may lead to good performance, provided a sensible balance between risky use of prior imprecise knowledge and cautious use of learning experience is adopted. 相似文献
6.
Kernel-Based Reinforcement Learning 总被引:5,自引:0,他引:5
We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeoff in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem. 相似文献
7.
Reinforcement learning, and Q-learning in particular, encounter two major problems when dealing with large state spaces. First, learning the Q-function in tabular form may be infeasible because of the excessive amount of memory needed to store the table, and because the Q-function only converges after each state has been visited multiple times. Second, rewards in the state space may be so sparse that with random exploration they will only be discovered extremely slowly. The first problem is often solved by learning a generalization of the encountered examples (e.g., using a neural net or decision tree). Relational reinforcement learning (RRL) is such an approach; it makes Q-learning feasible in structural domains by incorporating a relational learner into Q-learning. The problem of sparse rewards has not been addressed for RRL. This paper presents a solution based on the use of reasonable policies to provide guidance. Different types of policies and different strategies to supply guidance through these policies are discussed and evaluated experimentally in several relational domains to show the merits of the approach. 相似文献
8.
9.
《国际自动化与计算杂志》2024,21(3)
With the breakthrough of AlphaGo,deep reinforcement learning has become a recognized technique for solving sequential decision-making problems.Despite its reputation,data inefficiency caused by its trial and error learning mechanism makes deep rein-forcement learning difficult to apply in a wide range of areas.Many methods have been developed for sample efficient deep reinforce-ment learning,such as environment modelling,experience transfer,and distributed modifications,among which distributed deep rein-forcement learning has shown its potential in various applications,such as human-computer gaming and intelligent transportation.In this paper,we conclude the state of this exciting field,by comparing the classical distributed deep reinforcement learning methods and studying important components to achieve efficient distributed learning,covering single player single agent distributed deep reinforce-ment learning to the most complex multiple players multiple agents distributed deep reinforcement learning.Furthermore,we review re-cently released toolboxes that help to realize distributed deep reinforcement learning without many modifications of their non-distrib-uted versions.By analysing their strengths and weaknesses,a multi-player multi-agent distributed deep reinforcement learning toolbox is developed and released,which is further validated on Wargame,a complex environment,showing the usability of the proposed tool-box for multiple players and multiple agents distributed deep reinforcement learning under complex games.Finally,we try to point out challenges and future trends,hoping that this brief review can provide a guide or a spark for researchers who are interested in distrib-uted deep reinforcement learning. 相似文献
10.
Reinforcement Learning with Replacing Eligibility Traces 总被引:26,自引:0,他引:26
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator. 相似文献
11.
AGV(automated guided vehicle)路径规划问题已成为货物运输、快递分拣等领域中一项关键技术问题。由于在此类场景中需要较多的AGV合作完成,传统的规划模型难以协调多AGV之间的相互作用,采用分而治之的思想或许能获得系统的最优性能。基于此,该文提出一种最大回报频率的多智能体独立强化学习MRF(maximum reward frequency)Q-learning算法,对任务调度和路径规划同时进行优化。在学习阶段AGV不需要知道其他AGV的动作,减轻了联合动作引起的维数灾问题。采用Boltzmann与ε-greedy结合策略,避免收敛到较差路径,另外算法提出采用获得全局最大累积回报的频率作用于Q值更新公式,最大化多AGV的全局累积回报。仿真实验表明,该算法能够收敛到最优解,以最短的时间步长完成路径规划任务。 相似文献
12.
An agent that must learn to act in the world by trial and error faces the reinforcement learning problem, which is quite different from standard concept learning. Although good algorithms exist for this problem in the general case, they are often quite inefficient and do not exhibit generalization. One strategy is to find restricted classes of action policies that can be learned more efficiently. This paper pursues that strategy by developing algorithms that can efficiently learn action maps that are expressible in k-DNF. The algorithms are compared with existing methods in empirical trials and are shown to have very good performance. 相似文献
13.
A multi-agent reinforcement learning algorithm with fuzzy policy is addressed in this paper. This algorithm is used to deal
with some control problems in cooperative multi-robot systems. Specifically, a leader-follower robotic system and a flocking
system are investigated. In the leader-follower robotic system, the leader robot tries to track a desired trajectory, while
the follower robot tries to follow the reader to keep a formation. Two different fuzzy policies are developed for the leader
and follower, respectively. In the flocking system, multiple robots adopt the same fuzzy policy to flock. Initial fuzzy policies
are manually crafted for these cooperative behaviors. The proposed learning algorithm finely tunes the parameters of the fuzzy
policies through the policy gradient approach to improve control performance. Our simulation results demonstrate that the
control performance can be improved after the learning. 相似文献
14.
15.
An agent that must learn to act in the world by trial and error faces thereinforcement learning problem, which is quite different from standard concept learning. Although good algorithms exist for this problem in the general case, they are often quite inefficient and do not exhibit generalization. One strategy is to find restricted classes of action policies that can be learned more efficiently. This paper pursues that strategy by developing an algorithm that performans an on-line search through the space of action mappings, expressed as Boolean formulae. The algorithm is compared with existing methods in empirical trials and is shown to have very good performance. 相似文献
16.
Risk-Sensitive Reinforcement Learning 总被引:3,自引:0,他引:3
Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions. 相似文献
17.
18.
强化学习(Reinforcement Learning)是学习环境状态到动作的一种映射,并且能够获得最大的奖赏信号。强化学习中有三种方法可以实现回报的最大化:值迭代、策略迭代、策略搜索。该文介绍了强化学习的原理、算法,并对有环境模型和无环境模型的离散空间值迭代算法进行研究,并且把该算法用于固定起点和随机起点的格子世界问题。实验结果表明,相比策略迭代算法,该算法收敛速度快,实验精度好。 相似文献
19.
Reinforcement Learning in the Multi-Robot Domain 总被引:16,自引:4,他引:16
Maja J. Matarić 《Autonomous Robots》1997,4(1):73-83
This paper describes a formulation of reinforcement learning that enables learning in noisy, dynamic environments such as in the complex concurrent multi-robot learning domain. The methodology involves minimizing the learning space through the use of behaviors and conditions, and dealing with the credit assignment problem through shaped reinforcement in the form of heterogeneous reinforcement functions and progress estimators. We experimentally validate the approach on a group of four mobile robots learning a foraging task. 相似文献
20.
交通信号的智能控制是智能交通研究中的热点问题。为更加及时有效地自适应协调交通,文中提出了一种基于分布式深度强化学习的交通信号控制模型,采用深度神经网络框架,利用目标网络、双Q网络、价值分布提升模型表现。将交叉路口的高维实时交通信息离散化建模并与相应车道上的等待时间、队列长度、延迟时间、相位信息等整合作为状态输入,在对相位序列及动作、奖励做出恰当定义的基础上,在线学习交通信号的控制策略,实现交通信号Agent的自适应控制。为验证所提算法,在SUMO(Simulation of Urban Mobility)中相同设置下,将其与3种典型的深度强化学习算法进行对比。实验结果表明,基于分布式的深度强化学习算法在交通信号Agent的控制中具有更好的效率和鲁棒性,且在交叉路口车辆的平均延迟、行驶时间、队列长度、等待时间等方面具有更好的性能表现。 相似文献