首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
A study on expertise of agents and its effects on cooperative Q-learning.   总被引:1,自引:0,他引:1  
Cooperation in learning (CL) can be realized in a multiagent system, if agents are capable of learning from both their own experiments and other agents' knowledge and expertise. Extra resources are exploited into higher efficiency and faster learning in CL as compared to that of individual learning (IL). In the real world, however, implementation of CL is not a straightforward task, in part due to possible differences in area of expertise (AOE). In this paper, reinforcement-learning homogenous agents are considered in an environment with multiple goals or tasks. As a result, they become expert in different domains with different amounts of expertness. Each agent uses a one-step Q-learning algorithm and is capable of exchanging its Q-table with those of its teammates. Two crucial questions are addressed in this paper: "How the AOE of an agent can be extracted?" and "How agents can improve their performance in CL by knowing their AOEs?" An algorithm is developed to extract the AOE based on state transitions as a gold standard from a behavioral point of view. Moreover, it is discussed that the AOE can be implicitly obtained through agents' expertness in the state level. Three new methods for CL through the combination of Q-tables are developed and examined for overall performance after CL. The performances of developed methods are compared with that of IL, strategy sharing (SS), and weighted SS (WSS). Obtained results show the superior performance of AOE-based methods as compared to that of existing CL methods, which do not use the notion of AOE. These results are very encouraging in support of the idea that "cooperation based on the AOE" performs better than the general CL methods.  相似文献   

2.
神经模糊系统在机器人的智能控制中具有巨大的应用潜力,但已有的系统构造方法几乎都面临着样本资源匮乏这一巨大困难。为克服传统系统构造方法可能因样本获取困难而引起的“维数灾难”等问题,该文在模糊神经网络中引入了Q-学习机制,提出了一种基于Q-学习的模糊神经网络模型,从而赋予神经模糊系统自学习能力。文章最后给出了其在菅野模糊小车控制中的仿真结果。实验表明,在神经模糊系统中融入智能学习机制Q-学习是行之有效的;它可以被用来实现机器人智能行为的自学习。值得一提的是,该文的仿真实验在真实系统上同样是容易实现的,只要系统能提供作为评价信号的传感信息即可。  相似文献   

3.
Reinforcement learning (RL) has been applied to many fields and applications, but there are still some dilemmas between exploration and exploitation strategy for action selection policy. The well-known areas of reinforcement learning are the Q-learning and the Sarsa algorithms, but they possess different characteristics. Generally speaking, the Sarsa algorithm has faster convergence characteristics, while the Q-learning algorithm has a better final performance. However, Sarsa algorithm is easily stuck in the local minimum and Q-learning needs longer time to learn. Most literatures investigated the action selection policy. Instead of studying an action selection strategy, this paper focuses on how to combine Q-learning with the Sarsa algorithm, and presents a new method, called backward Q-learning, which can be implemented in the Sarsa algorithm and Q-learning. The backward Q-learning algorithm directly tunes the Q-values, and then the Q-values will indirectly affect the action selection policy. Therefore, the proposed RL algorithms can enhance learning speed and improve final performance. Finally, three experimental results including cliff walk, mountain car, and cart–pole balancing control system are utilized to verify the feasibility and effectiveness of the proposed scheme. All the simulations illustrate that the backward Q-learning based RL algorithm outperforms the well-known Q-learning and the Sarsa algorithm.  相似文献   

4.
A dynamic channel assignment policy through Q-learning   总被引:2,自引:0,他引:2  
One of the fundamental issues in the operation of a mobile communication system is the assignment of channels to cells and to calls. This paper presents a novel approach to solving the dynamic channel assignment (DCA) problem by using a form of real-time reinforcement learning known as Q-learning in conjunction with neural network representation. Instead of relying on a known teacher the system is designed to learn an optimal channel assignment policy by directly interacting with the mobile communication environment. The performance of the Q-learning based DCA was examined by extensive simulation studies on a 49-cell mobile communication system under various conditions. Comparative studies with the fixed channel assignment (FCA) scheme and one of the best dynamic channel assignment strategies, MAXAVAIL, have revealed that the proposed approach is able to perform better than the FCA in various situations and capable of achieving a performance similar to that achieved by the MAXAVAIL, but with a significantly reduced computational complexity.  相似文献   

5.
6.
多Agent系统是近年来比较热门的一个研究领域,而Q-learning算法是强化学习算法中比较著名的算法,也是应用最广泛的一种强化学习算法。以单Agent强化学习Qlearning算法为基础,提出了一种新的学习协作算法,并根据此算法提出了一种新的多Agent系统体系结构模型,该结构的最大特点是提出了知识共享机制、团队结构思想和引入了服务商概念,最后通过仿真实验说明了该结构体系的优越性。  相似文献   

7.
This research focuses on the study of the relationships between sample data characteristics and metamodel performance considering different types of metamodeling methods. In this work, four types of metamodeling methods, including multivariate polynomial method, radial basis function method, kriging method and Bayesian neural network method, three sample quality merits, including sample size, uniformity and noise, and four performance evaluation measures considering accuracy, confidence, robustness and efficiency, are considered. Different from other comparative studies, quantitative measures, instead of qualitative ones, are used in this research to evaluate the characteristics of the sample data. In addition, the Bayesian neural network method, which is rarely used in metamodeling and has never been considered in comparative studies, is selected in this research as a metamodeling method and compared with other metamodeling methods. A simple guideline is also developed for selecting candidate metamodeling methods based on sample quality merits and performance requirements.  相似文献   

8.
Network congestion has a negative impact on the performance of on-chip networks due to the increased packet latency. Many congestion-aware routing algorithms have been developed to alleviate traffic congestion over the network. In this paper, we propose a congestion-aware routing algorithm based on the Q-learning approach for avoiding congested areas in the network. By using the learning method, local and global congestion information of the network is provided for each switch. This information can be dynamically updated, when a switch receives a packet. However, Q-learning approach suffers from high area overhead in NoCs due to the need for a large routing table in each switch. In order to reduce the area overhead, we also present a clustering approach that decreases the number of routing tables by the factor of 4. Results show that the proposed approach achieves a significant performance improvement over the traditional Q-learning, C-routing, DBAR and Dynamic XY algorithms.  相似文献   

9.
A new Q-learning algorithm based on the metropolis criterion   总被引:4,自引:0,他引:4  
The balance between exploration and exploitation is one of the key problems of action selection in Q-learning. Pure exploitation causes the agent to reach the locally optimal policies quickly, whereas excessive exploration degrades the performance of the Q-learning algorithm even if it may accelerate the learning process and allow avoiding the locally optimal policies. In this paper, finding the optimum policy in Q-learning is described as search for the optimum solution in combinatorial optimization. The Metropolis criterion of simulated annealing algorithm is introduced in order to balance exploration and exploitation of Q-learning, and the modified Q-learning algorithm based on this criterion, SA-Q-learning, is presented. Experiments show that SA-Q-learning converges more quickly than Q-learning or Boltzmann exploration, and that the search does not suffer of performance degradation due to excessive exploration.  相似文献   

10.
This paper studies a multi-goal Q-learning algorithm of cooperative teams. Member of the cooperative teams is simulated by an agent. In the virtual cooperative team, agents adapt its knowledge according to cooperative principles. The multi-goal Q-learning algorithm is approached to the multiple learning goals. In the virtual team, agents learn what knowledge to adopt and how much to learn (choosing learning radius). The learning radius is interpreted in Section 3.1. Five basic experiments are manipulated proving the validity of the multi-goal Q-learning algorithm. It is found that the learning algorithm causes agents to converge to optimal actions, based on agents’ continually updated cognitive maps of how actions influence learning goals. It is also proved that the learning algorithm is beneficial to the multiple goals. Furthermore, the paper analyzes how sensitive the learning performance is affected by the parameter values of the learning algorithm.  相似文献   

11.
We propose two algorithms for Q-learning that use the two-timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments using the proposed algorithms on an application of routing in communication networks are presented on a few different settings.  相似文献   

12.
一种基于Q学习的有限理性博弈模型及其应用   总被引:1,自引:0,他引:1  
传统博弈理论模型建立在人的完全理性基础之上,难以切合实际。有限理性博弈则能够很好地描述实际问题。有限理性的博弈者参与到不完全信息博弈中,对博弈的规则、结构以及对手等博弈信息有一个逐渐适应和了解的过程,因此博弈应是动态进化的模型。针对这一问题,提出了一种基于Q学习算法的不完全信息博弈模型,根据Littman的最大最小原则建立了多指标体系下的策略选择概率分布;构建了Q学习与博弈融合的数学模型,使用Q学习机制来实现博弈模型的动态进化;最后将模型应用于两人追逐的仿真实验,结果表明所提出的模型能够很好地再现追逐情景。  相似文献   

13.
Design optimization of layered plate bonding process is conducted by considering uncertainties in a manufacturing process, to reduce the crack failure arising due to the difference of thermal expansion coefficients of the adherents. Robust optimization is performed to minimize the mean and variance of the residual stress, which is the major cause of the failure, while constraining the distortion and the instantaneous maximum stress to the allowable limits. In this optimization, the dimension reduction (DR) method is employed to quantify the uncertainty of the responses in the bonding process. It is expected that the DR method benefits the optimization from the perspectives of efficiency, accuracy, and simplicity. Response surface method (RSM) combined with sequential approximate optimization (SAO) technique is employed as an optimization tool. The obtained robust optimal solution is verified by the Monte Carlo simulation.  相似文献   

14.
由于无线信道具有共享特性,节点之间竞争不可避免。传统P-坚持载波感知多址(CSMA)的传输概率对吞吐量有很大影响[1]。笔者设计了一种多态增强学习(RL)方法,在多态Q学习模型中提出了三种学习类型,定义传输概率Q为节点学习策略,节点没有关于网络的先验信息,只利用历史感知信息包括碰撞次数及成功传输率,来学习最佳策略[2-3]。并通过综合仿真,对不同状态定义下的Q学习模型性能进行了比较。  相似文献   

15.
This work describes a novel algorithm that integrates an adaptive resonance method (ARM), i.e. an ART-based algorithm with a self-organized design, and a Q-learning algorithm. By dynamically adjusting the size of sensitivity regions of each neuron and adaptively eliminating one of the redundant neurons, ARM can preserve resources, i.e. available neurons, to accommodate additional categories. As a dynamic programming-based reinforcement learning method, Q-learning involves use of the learned action-value function, Q, which directly approximates Q, i.e. the optimal action-value function, which is independent of the policy followed. In the proposed algorithm, ARM functions as a cluster to classify input vectors from the outside world. Clustered results are then sent to the Q-learning design in order to learn how to implement the optimum actions to the outside world. Simulation results of the well-known control algorithm of balancing an inverted pendulum on a cart demonstrates the effectiveness of the proposed algorithm.  相似文献   

16.
A modular robot can be built with a shape and function that matches the working environment. We developed a four-arm modular robot system which can be configured in a planar structure. A learning mechanism is incorporated in each module constituting the robot. We aim to control the overall shape of the robot by an accumulation of the autonomous actions resulting from the individual learning functions. Considering that the overall shape of a modular robot depends on the learning conditions in each module, this control method can be treated as a dispersion control learning method. The learning object is cooperative motion between adjacent modules. The learning process proceeds based on Q-learning by trial and error. We confirmed the effectiveness of the proposed technique by computer simulation.  相似文献   

17.
Production systems continuously deteriorate with age and usage due to corrosion, fatigue, and cumulative wear in production processes, resulting in an increasing possibility of producing defective products. To prevent selling defective products, inspection is usually carried out to ensure that the performance of a sold product satisfies the customer requirements. Nevertheless, some defective products may still be sold in practice. In such a case, warranties are essential in marketing products and can improve the unfavorable image by applying higher product quality and better customer service. The purpose of this paper is to provide manufacturers with an effective inspection strategy in which the task of quality management is performed under the considerations of related costs for production, sampling, inventory, and warranty. A Weibull power law process is used to describe the imperfection of the production system, and a negative binomial sampling is adopted to learn the operational states of the production process. A free replacement warranty policy is assumed in this paper, and the reworking of defective products before shipment is also discussed. A numerical application is employed to demonstrate the usefulness of the proposed approach, and sensitivity analyses are performed to study the various effects of some influential factors.  相似文献   

18.
In this article, we examine the learning performance of various strategies under different conditions using the Voronoi Q-value element (VQE) based on reward in a single-agent environment, and decide how to act in a certain state. In order to test our hypotheses, we performed computational experiments using several situations such as various angles of rotation of VQEs which are arranged into a lattice structure, various angles of an agent’s action rotation that has 4 actions, and a random arrangement of VQEs to correctly evaluate the optimal Q-values for state and action pairs in order to deal with continuous-valued inputs. As a result, the learning performance changes when the angle of VQEs and the angle of action are changed by a specific relative position.  相似文献   

19.
媒体访问控制(MAC)协议负责协调所有认知用户的空闲信道接入服务,是认知 Ad-hoc 网络支持服务质量(QoS)的关键技术之一。在二进制指数退避算法基础上,提出一种支持服务区分的多智能体Q学习MAC算法。实时调整传输概率,使系统信道接入服务达到最优,建立传输概率调节的Markov链模型,导出分组的传输概率与协议参数的关系,给出基于服务区分的信道吞吐率模型,建立基于MAC协议参数学习的多智能体Q学习算法。实验结果表明,该算法能满足高优先级业务的QoS,且吞吐率和时延性能优于IEEE 802.11e EDCA机制。  相似文献   

20.
We present work on a six-legged walking machine that uses a hierarchical version of [C.J.C.H. Watkins, Learning with delayed rewards, Ph.D. Thesis, Psychology Department, Cambridge University, 1989] Q-learning (HQL) to learn both: the elementary swing and stance movements of individual legs as well as the overall coordination scheme to perform forward movements. The architecture consists of a hierarchy of local controllers implemented in layers. The lowest layer consists of control modules performing elementary actions, like moving a leg up, down, left or right to achieve the elementary swing and stance motions for individual legs. The next level consists of controllers that learn to perform more complex tasks like forward movement by using the previously learned, lower level modules. The work is related to similar, although simulation based, work [L.J. Lin, Reinforcement learning for robots using neural networks, Ph.D. Thesis, Carnegie Mellon University, 1993] on hierarchical reinforcement-learning and [S.P. Singh, learning to solve Markovian decision problems, Ph.D. Thesis, Department of Computer Science at the University of Massachusetts, 1994] on compositional Q-learning. We report on the HQL architecture as well as on its implementation on the walking machine Sir Arthur. Results from experiments carried out on the real robot are reported to show the applicability of the HQL approach to real world robot problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号