期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈飞王本年高阳陈兆乾陈世福《计算机科学》2006,33(2):173-177

强化学习的研究需要解决的重要难点之一是:探索未知的动作和采用已知的最优动作之间的平衡。贝叶斯学习是一种基于已知的概率分布和观察到的数据进行推理,做出最优决策的概率手段。因此,把强化学习和贝叶斯学习相结合,使 Agent 可以根据已有的经验和新学到的知识来选择采用何种策略:探索未知的动作还是采用已知的最优动作。本文分别介绍了单 Agent 贝叶斯强化学习方法和多 Agent 贝叶斯强化学习方法:单 Agent 贝叶斯强化学习包括贝叶斯 Q 学习、贝叶斯模型学习以及贝叶斯动态规划等;多 Agent 贝叶斯强化学习包括贝叶斯模仿模型、贝叶斯协同方法以及在不确定下联合形成的贝叶斯学习等。最后,提出了贝叶斯在强化学习中进一步需要解决的问题。相似文献

2.

Learning Intelligent Behavior in a Non-stationary and Partially Observable Environment

SelÇuk şenkul Faruk Polat 《Artificial Intelligence Review》2002,18(2):97-115

Individual learning in an environment where more than one agent exist is a chal-lengingtask. In this paper, a single learning agent situated in an environment where multipleagents exist is modeled based on reinforcement learning. The environment is non-stationaryand partially accessible from an agents' point of view. Therefore, learning activities of anagent is influenced by actions of other cooperative or competitive agents in the environment.A prey-hunter capture game that has the above characteristics is defined and experimentedto simulate the learning process of individual agents. Experimental results show that thereare no strict rules for reinforcement learning. We suggest two new methods to improve theperformance of agents. These methods decrease the number of states while keeping as muchstate as necessary. 相似文献

3.

基于平均序列累计奖赏的自适应ε-greedy策略

下载免费PDF全文

杨彤秦进《计算机工程与应用》2021,57(11):148-155

探索与利用的权衡是强化学习的挑战之一。探索使智能体为进一步改进策略而采取新的动作,而利用使智能体采用历史经验中的信息以最大化累计奖赏。深度强化学习中常用“[ε]-greedy”策略处理探索与利用的权衡问题,未考虑影响智能体做出决策的其他因素,具有一定的盲目性。针对此问题提出一种自适应调节探索因子的[ε]-greedy策略,该策略依据智能体每完成一次任务所获得的序列累计奖赏值指导智能体进行合理的探索或利用。序列累计奖赏值越大,说明当前智能体所采用的有效动作越多,减小探索因子以便更多地利用历史经验。反之,序列累计奖赏值越小,说明当前策略还有改进的空间,增大探索因子以便探索更多可能的动作。实验结果证明改进的策略在Playing Atari 2600视频游戏中取得了更高的平均奖赏值,说明改进的策略能更好地权衡探索与利用。相似文献

4.

一种基于启发式轮廓表的逻辑强化学习方法

刘全高阳陈道蓄孙吉贵姚望舒《计算机研究与发展》2008,45(11)

强化学习通过试错与环境交互获得策略的改进,其自学习和在线学习的特点使其成为机器学习研究的一个重要分支.针对强化学习一直被"维数灾"问题所困扰的问题,提出在关系强化学习的基础上,引入启发式轮廓表的方法,采用含轮廓表的一阶谓词表示状态、活动和Q-函数,充分发挥Prolog表的优势,将逻辑谓词规则与强化学习相结合,形成一种新的逻辑强化学习方法--CCLORRL,并对其收敛性进行了证明.该方法使用轮廓形状谓词产生形状状态表,大幅度地减少状态空间;利用启发式规则指导动作的选择,减少了样本中不存在状态选择的盲目性.CCLORRL算法应用于俄罗斯方块中,实验表明,该方法是比较高效的. 相似文献

5.

Reinforcement learning: Architectures and algorithms

Mieczyslaw M. Kokar Spiridon A. Reveliotis 《国际智能系统杂志》1993,8(8):875-894

This article is related to the research effort of constructing an intelligent agent, i.e., a computer system that is able to sense its environment (world), reason utilizing its internal knowledge and execute actions upon the world (act). the specific part of this effor presented in this article is reinforcement learning, i.e., the process of acquiring new knowledge based upon an evaluative feedback, called reinforcement, received by tht agent through interactions with the world. This article has two objectives: (1) to give a compact overview of reinforcement learning, and (2) to show that the evolution of the reinforcement learning paradigm has been driven by the need for more efficient learning through the addition of more structure to the learning agent. Therefore, both main ideas of reinforcement learning are introduced, and structural solutions to reinforcemen learning are reviewed. Several architectural enhancements of the RL paradigm are discussed. These include incorporation of state information in the learning process, architectural solutions to learning with delayed reinforcement, dealing with structurally changing worlds through utilization of multiple models of the world, and focusing attention of the learning agent through active perception. the paper closes with an overview of directions for applications and for future research in this area. © 1993 John Wiley & Sons, Inc. 相似文献

6.

A teaching method using a self-organizing map for reinforcement learning

Takeshi?Tateyama Email author Seiichi?Kawata Toshiki?Oguchi 《Artificial Life and Robotics》2004,7(4):193-197

We described a new preteaching method for re-inforcement learning using a self-organizing map (SOM). The purpose is to increase the learning rate using a small amount of teaching data generated by a human expert. In our proposed method, the SOM is used to generate the initial teaching data for the reinforcement learning agent from a small amount of teaching data. The reinforcement learning function of the agent is initialized by using the teaching data generated by the SOM in order to increase the probability of selecting the optimal actions it estimates. Because the agent can get high rewards from the start of reinforcement learning, it is expected that the learning rate will increase. The results of a mobile robot simulation showed that the learning rate had increased even though the human expert had showed only a small amount of teaching data. This work was presented in part at the 7th International Symposium on Artificial Life and Robotics, Oita, Japan, January 16–18, 2002 相似文献

7.

提高强化学习速度的方法研究 总被引：4，自引：0，他引：4

张汝波《计算机工程与应用》2001,37(22):38-40

强化学习一词出自于行为心理学,这门学科把学习看作为反复试验的过程,以便把环境的状态映射为动作。强化学习的这种特性必然增加智能系统的困难性,学习时间增长。强化学习学习速度较慢的原因是没有明确的监督信号。因此,强化学习系统在与环境交互时不得不采取反复试验的方法依靠外部评价信号来调整自己的行为。智能系统必然经过很长的学习过程。如何提高强化学习速度是一个最重要的研究问题。该文从几个方面来讨论提高强化学习速度的方法。相似文献

8.

基于强化学习的多Agent协作研究 总被引：2，自引：0，他引：2

郑淑丽韩江洪骆祥峰蒋建文《小型微型计算机系统》2003,24(11):1986-1988

强化学习为多Agent之间的协作提供了鲁棒的学习方法．本文首先介绍了强化学习的原理和组成要素，其次描述了多Agent马尔可夫决策过程MMDP，并给出了Agent强化学习模型．在此基础上，对多Agent协作过程中存在的两种强化学习方式：IL(独立学习)和JAL(联合动作学习)进行了比较．最后分析了在有多个最优策略存在的情况下，协作多Agent系统常用的几种协调机制．相似文献

9.

交互协调强化学习下的城市交通信号配时决策

下载免费PDF全文

夏新海《计算机工程与应用》2018,54(11):265-270

针对应用传统强化学习进行城市自适应交通信号配时决策时存在维数灾难和缺乏协调机制等问题,提出引入交互协调机制的强化学习算法。以车均延误为性能指标设计了针对城市交通信号配时决策的独立Q-强化学习算法。在此基础上,通过引入直接交互机制对独立强化学习算法进行了延伸,即相邻交叉口交通信号控制agent间直接交换配时动作和交互点值。通过仿真实验分析表明,引入交互协调机制的强化学习的控制效果明显优于独立强化学习算法,协调更有效,并且其学习算法具有较好的收敛性能,交互点值趋向稳定。相似文献

10.

Towards a Self-Learning Agent: Using Ranking Functions as a Belief Representation in Reinforcement Learning

Klaus Häming Gabriele Peters 《Neural Processing Letters》2013,38(2):117-129

We propose a combination of belief revision and reinforcement learning which leads to a self-learning agent. The agent shows six qualities we deem necessary for a successful and adaptive learner. This is achieved by representing the agent’s belief in two different levels, one numerical and one symbolical. While the former is implemented using basic reinforcement learning techniques, the latter is represented by Spohn’s ranking functions. To make these ranking functions fit into a reinforcement learning framework, we studied the revision process and identified key weaknesses of the to-date approach. Despite the fact that the revision was modeled to support frequent updates, we propose and justify an alternative revision which leads to more plausible results. We show in an example application the benefits of the new approach, including faster learning and the extraction of learned rules. 相似文献

11.

Graph kernels and Gaussian processes for relational reinforcement learning

Kurt Driessens Jan Ramon Thomas Gärtner 《Machine Learning》2006,64(1-3):91-119

RRL is a relational reinforcement learning system based on Q-learning in relational state-action spaces. It aims to enable agents to learn how to act in an environment that has no natural representation as a tuple of constants. For relational reinforcement learning, the learning algorithm used to approximate the mapping between state-action pairs and their so called Q(uality)-value has to be very reliable, and it has to be able to handle the relational representation of state-action pairs. In this paper we investigate the use of Gaussian processes to approximate the Q-values of state-action pairs. In order to employ Gaussian processes in a relational setting we propose graph kernels as a covariance function between state-action pairs. The standard prediction mechanism for Gaussian processes requires a matrix inversion which can become unstable when the kernel matrix has low rank. These instabilities can be avoided by employing QR-factorization. This leads to better and more stable performance of the algorithm and a more efficient incremental update mechanism. Experiments conducted in the blocks world and with the Tetris game show that Gaussian processes with graph kernels can compete with, and often improve on, regression trees and instance based regression as a generalization algorithm for RRL. Editors: David Page and Akihiro Yamamoto 相似文献

12.

The Agent Web Model: modeling web hacking for reinforcement learning

Erdődi László Zennaro Fabio Massimo 《International Journal of Information Security》2022,21(2):293-309

Website hacking is a frequent attack type used by malicious actors to obtain confidential information, modify the integrity of web pages or make websites unavailable. The tools used by attackers are becoming more and more automated and sophisticated, and malicious machine learning agents seem to be the next development in this line. In order to provide ethical hackers with similar tools, and to understand the impact and the limitations of artificial agents, we present in this paper a model that formalizes web hacking tasks for reinforcement learning agents. Our model, named Agent Web Model, considers web hacking as a capture-the-flag style challenge, and it defines reinforcement learning problems at seven different levels of abstraction. We discuss the complexity of these problems in terms of actions and states an agent has to deal with, and we show that such a model allows to represent most of the relevant web vulnerabilities. Aware that the driver of advances in reinforcement learning is the availability of standardized challenges, we provide an implementation for the first three abstraction layers, in the hope that the community would consider these challenges in order to develop intelligent web hacking agents.

相似文献

13.

Robots learn to dance through interaction with humans

Qinggang Meng Ibrahim Tholley Paul W. H. Chung 《Neural computing & applications》2014,24(1):117-124

In this paper, we investigated an approach for robots to learn to adapt dance actions to human’s preferences through interaction and feedback. Human’s preferences were extracted by analysing the common action patterns with positive or negative feedback from the human during robot dancing. By using a buffering technique to store the dance actions before a feedback, each individual’s preferences can be extracted even when a reward is received late. The extracted preferred dance actions from different people were then combined to generate improved dance sequences, i.e. performing more of what was preferred and less of that was not preferred. Together with Softmax action-selection method, the Sarsa reinforcement learning algorithm was used as the underlining learning algorithm and to effectively control the trade-off between exploitation of the learnt dance skills and exploration of new dance actions. The results showed that the robot learnt, using interactive reinforcement learning, the preferences of human partners, and the dance improved with the extracted preferences from more human partners. 相似文献

14.

A two-layered multi-agent reinforcement learning model and algorithm

Ben-Nian Wang Yang Gao Zhao-Qian Chen Jun-Yuan Xie Shi-Fu Chen 《Journal of Network and Computer Applications》2007,30(4):1366-1376

Multi-agent reinforcement learning technologies are mainly investigated from two perspectives of the concurrence and the game theory. The former chiefly applies to cooperative multi-agent systems, while the latter usually applies to coordinated multi-agent systems. However, there exist such problems as the credit assignment and the multiple Nash equilibriums for agents with them. In this paper, we propose a new multi-agent reinforcement learning model and algorithm LMRL from a layer perspective. LMRL model is composed of an off-line training layer that employs a single agent reinforcement learning technology to acquire stationary strategy knowledge and an online interaction layer that employs a multi-agent reinforcement learning technology and the strategy knowledge that can be revised dynamically to interact with the environment. An agent with LMRL can improve its generalization capability, adaptability and coordination ability. Experiments show that the performance of LMRL can be better than those of a single agent reinforcement learning and Nash-Q. 相似文献

15.

基于动态规划方法的激励学习遗忘算法

殷苌茗王汉兴陈焕文《计算机工程与应用》2004,40(16):75-78,81

一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法,而动态规划方法是Agent在具有Markov环境下与策略相关的值函数学习算法。但由于Agent在学习的过程中,需要记忆全部的值函数,这个记忆容量随着状态空间的增加会变得非常巨大。文章提出了一种基于动态规划方法的激励学习遗忘算法,这个算法是通过将记忆心理学中有关遗忘的基本原理引入到值函数的激励学习中,导出了一类用动态规划方法解决激励学习问题的比较好的方法,即Forget-DP算法。相似文献

16.

A multi-agent reinforcement learning approach to robot soccer

Yong Duan Bao Xia Cui Xin He Xu 《Artificial Intelligence Review》2012,38(3):193-211

In this paper, a multi-agent reinforcement learning method based on action prediction of other agent is proposed. In a multi-agent system, action selection of the learning agent is unavoidably impacted by other agents’ actions. Therefore, joint-state and joint-action are involved in the multi-agent reinforcement learning system. A novel agent action prediction method based on the probabilistic neural network (PNN) is proposed. PNN is used to predict the actions of other agents. Furthermore, the sharing policy mechanism is used to exchange the learning policy of multiple agents, the aim of which is to speed up the learning. Finally, the application of presented method to robot soccer is studied. Through learning, robot players can master the mapping policy from the state information to the action space. Moreover, multiple robots coordination and cooperation are well realized. 相似文献

17.

A hybrid agent architecture integrating desire,intention and reinforcement learning

Ah-Hwee Tan Yew-Soon Ong Akejariyawong Tapanuj 《Expert systems with applications》2011,38(7):8477-8487

This paper presents a hybrid agent architecture that integrates the behaviours of BDI agents, specifically desire and intention, with a neural network based reinforcement learner known as Temporal Difference-Fusion Architecture for Learning and COgNition (TD-FALCON). With the explicit maintenance of goals, the agent performs reinforcement learning with the awareness of its objectives instead of relying on external reinforcement signals. More importantly, the intention module equips the hybrid architecture with deliberative planning capabilities, enabling the agent to purposefully maintain an agenda of actions to perform and reducing the need of constantly sensing the environment. Through reinforcement learning, plans can also be learned and evaluated without the rigidity of user-defined plans as used in traditional BDI systems. For intention and reinforcement learning to work cooperatively, two strategies are presented for combining the intention module and the reactive learning module for decision making in a real time environment. Our case study based on a minefield navigation domain investigates how the desire and intention modules may cooperatively enhance the capability of a pure reinforcement learner. The empirical results show that the hybrid architecture is able to learn plans efficiently and tap both intentional and reactive action execution to yield a robust performance. 相似文献

18.

Cooperative behavior acquisition for mobile robots in dynamically changing real worlds via vision-based reinforcement learning and development

Minoru Asada Eiji Uchibe Koh Hosoda 《Artificial Intelligence》1999,110(2):275

In this paper, we first discuss the meaning of physical embodiment and the complexity of the environment in the context of multi-agent learning. We then propose a vision-based reinforcement learning method that acquires cooperative behaviors in a dynamic environment. We use the robot soccer game initiated by RoboCup (Kitano et al., 1997) to illustrate the effectiveness of our method. Each agent works with other team members to achieve a common goal against opponents. Our method estimates the relationships between a learner's behaviors and those of other agents in the environment through interactions (observations and actions) using a technique from system identification. In order to identify the model of each agent, Akaike's Information Criterion is applied to the results of Canonical Variate Analysis to clarify the relationship between the observed data in terms of actions and future observations. Next, reinforcement learning based on the estimated state vectors is performed to obtain the optimal behavior policy. The proposed method is applied to a soccer playing situation. The method successfully models a rolling ball and other moving agents and acquires the learner's behaviors. Computer simulations and real experiments are shown and a discussion is given. 相似文献

19.

Approximation spaces in off-policy Monte Carlo learning

《Engineering Applications of Artificial Intelligence》2007,20(5):667-675

This paper introduces an approach to off-policy Monte Carlo (MC) learning guided by behaviour patterns gleaned from approximation spaces and rough set theory introduced by Zdzisław Pawlak in 1981. During reinforcement learning, an agent makes action selections in an effort to maximize a reward signal obtained from the environment. The problem considered in this paper is how to estimate the expected value of cumulative future discounted rewards in evaluating agent actions during reinforcement learning. The solution to this problem results from a form of weighted sampling using a combination of MC methods and approximation spaces to estimate the expected value of returns on actions. This is made possible by considering behaviour patterns of an agent in the context of approximation spaces. The framework provided by an approximation space makes it possible to measure the degree that agent behaviours are a part of (“covered by”) a set of accepted agent behaviours that serve as a behaviour evaluation norm. Furthermore, this article introduces an adaptive action control strategy called run-and-twiddle (RT) (a form of adaptive learning introduced by Oliver Selfridge in 1984), where approximate spaces are constructed on a “need by need” basis. Finally, a monocular vision system has been selected to facilitate the evaluation of the reinforcement learning methods. The goal of the vision system is to track a moving object, and rewards are based on the proximity of the object to the centre of the camera field of view. The contribution of this article is the introduction of a RT form of off-policy MC learning. 相似文献

20.

An Integrated Approach of Learning, Planning, and Execution

Ramón García-Martínez Daniel Borrajo 《Journal of Intelligent and Robotic Systems》2000,29(1):47-78

相似文献