首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Technical Note: Q-Learning   总被引:6,自引:0,他引:6  
Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states.This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989). We show thatQ-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where manyQ values can be changed each iteration, rather than just one.  相似文献   

2.
This paper studies a multi-goal Q-learning algorithm of cooperative teams. Member of the cooperative teams is simulated by an agent. In the virtual cooperative team, agents adapt its knowledge according to cooperative principles. The multi-goal Q-learning algorithm is approached to the multiple learning goals. In the virtual team, agents learn what knowledge to adopt and how much to learn (choosing learning radius). The learning radius is interpreted in Section 3.1. Five basic experiments are manipulated proving the validity of the multi-goal Q-learning algorithm. It is found that the learning algorithm causes agents to converge to optimal actions, based on agents’ continually updated cognitive maps of how actions influence learning goals. It is also proved that the learning algorithm is beneficial to the multiple goals. Furthermore, the paper analyzes how sensitive the learning performance is affected by the parameter values of the learning algorithm.  相似文献   

3.
Reinforcement learning (RL) has received some attention in recent years from agent-based researchers because it deals with the problem of how an autonomous agent can learn to select proper actions for achieving its goals through interacting with its environment. Although there have been several successful examples demonstrating the usefulness of RL, its application to manufacturing systems has not been fully explored yet. In this paper, Q-learning, a popular RL algorithm, is applied to a single machine dispatching rule selection problem. This paper investigates the application potential of Q-learning, a widely used RL algorithm to a dispatching rule selection problem on a single machine to determine if it can be used to enable a single machine agent to learn commonly accepted dispatching rules for three example cases in which the best dispatching rules have been previously defined. This study provided encouraging results that show the potential of RL for application to agent-based production scheduling.  相似文献   

4.
摄像机节点动态选择问题是摄像机网络应用中的一个难点.提出了一种基于增强学习的节点动态选择方法.采用视觉信息评分作为单步回报设计了节点选择策略的Q-学习算法,为了加速算法收敛速度,利用摄像机空间拓扑关系初始化Q值表,并基于Gibbs分布进行非贪心尝试.从目标可见性、朝向、清晰度和切换次数设计视觉评价函数反映视频信息丰富程度和视觉舒适度.实验结果表明,该节点动态选择方法能够有效地反映视频中的目标状态信息,选择结果切换平滑,满足实际应用需要.  相似文献   

5.
This paper deals with a new approach based on Q-learning for solving the problem of mobile robot path planning in complex unknown static environments.As a computational approach to learning through interaction with the environment,reinforcement learning algorithms have been widely used for intelligent robot control,especially in the field of autonomous mobile robots.However,the learning process is slow and cumbersome.For practical applications,rapid rates of convergence are required.Aiming at the problem of slow convergence and long learning time for Q-learning based mobile robot path planning,a state-chain sequential feedback Q-learning algorithm is proposed for quickly searching for the optimal path of mobile robots in complex unknown static environments.The state chain is built during the searching process.After one action is chosen and the reward is received,the Q-values of the state-action pairs on the previously built state chain are sequentially updated with one-step Q-learning.With the increasing number of Q-values updated after one action,the number of actual steps for convergence decreases and thus,the learning time decreases,where a step is a state transition.Extensive simulations validate the efficiency of the newly proposed approach for mobile robot path planning in complex environments.The results show that the new approach has a high convergence speed and that the robot can find the collision-free optimal path in complex unknown static environments with much shorter time,compared with the one-step Q-learning algorithm and the Q(λ)-learning algorithm.  相似文献   

6.
Organisational abstractions have been presented during the last years as common solutions to regulate Open MultiAgent Systems. In particular, the concept of norm is defined at design time to assure the correct behaviour of agents in such systems. However, in many cases, the performance of a system does not only depend on the correct behaviour of the agents according to the imposed norms but also on some other efficiency measures. To tackle this issue, this paper puts forward a novel mechanism that attempts to persuade agents to act according to system's preferences. This mechanism relies on incentive policies that aim to induce (not enforce) agents to perform those actions that are more appropriated from the system's point of view. In particular, two different policies have been presented. On the one hand, a policy that tries to promote the most appropriate action regarding the global utility of the system, by assigning a positive incentive to it. On the other hand, a policy that assigns incentives to all actions an agent can choose in a given state, with the aim of persuading the former to choose a “good” action. Besides, incentives are adapted and defined for each individual agent and contextualised by taking into account the state of the system. This task is carried out through a learning process based on Q-learning. Finally, a p2p file sharing scenario has been chosen to validate our approach.  相似文献   

7.
In this paper we address the problem of simultaneous learning and coordination in multiagent Markov decision problems (MMDPs) with infinite state-spaces. We separate this problem in two distinct subproblems: learning and coordination. To tackle the problem of learning, we survey Q-learning with soft-state aggregation (Q-SSA), a well-known method from the reinforcement learning literature (Singh et al. in Advances in neural information processing systems. MIT Press, Cambridge, vol 7, pp 361–368, 1994). Q-SSA allows the agents in the game to approximate the optimal Q-function, from which the optimal policies can be computed. We establish the convergence of Q-SSA and introduce a new result describing the rate of convergence of this method. In tackling the problem of coordination, we start by pointing out that the knowledge of the optimal Q-function is not enough to ensure that all agents adopt a jointly optimal policy. We propose a novel coordination mechanism that, given the knowledge of the optimal Q-function for an MMDP, ensures that all agents converge to a jointly optimal policy in every relevant state of the game. This coordination mechanism, approximate biased adaptive play (ABAP), extends biased adaptive play (Wang and Sandholm in Advances in neural information processing systems. MIT Press, Cambridge, vol 15, pp 1571–1578, 2003) to MMDPs with infinite state-spaces. Finally, we combine Q-SSA with ABAP, this leading to a novel algorithm in which learning of the game and coordination take place simultaneously. We discuss several important properties of this new algorithm and establish its convergence with probability 1. We also provide simple illustrative examples of application.  相似文献   

8.
沈项军  常青  姚银  查正军 《软件学报》2015,26(S2):218-227
非结构化P2P(unstructured peer-to-peer network)对等网络中的节点资源定位的路由查询是对等网络研究中的一个主要难题,特别是当网络中客户端节点由于其频繁加入、离开导致网络结构动态变化所带来的资源查询难题.提出了一种新的基于拥塞控制的路由查询方法来实现动态网络下的资源查询.该方法分两部分实现:首先是网络资源的分组与节点重连策略.该策略使得具有同等资源的节点相互连接,并周期性地调整节点上的节点连接数量以减少同组资源节点上的负载.通过以上策略,使得网络的拓扑结构自动地从随机网络结构进化到以资源组为单位的聚类网络,从而使得网络中形成网络资源组间的查询负载均衡.另一方面,组内的节点之间的路由负载均衡是通过节点间协同学习实现的.采用协同Q-学习方法,所研究的方法不仅从节点上学习其处理能力、连接数和资源的个数等参数,还将节点的拥塞状态作为协同Q-学习的重要参数,并建立模型.通过这种技术,同一组节点上的资源查询被有目的地引导,以避开那些组内拥塞的节点,从而最终实现资源组内节点之间的查询均衡.仿真实验结果表明,相比常用的random walk资源查找方法,该研究所实现的资源定位方法能够更迅速地实现网络的资源查询.仿真结果还表明,相比random walk方法,所提出的方法在网络高强度查询和网络节点动态加入和退出的情况下进行查询具有更高的鲁棒性和适应性.  相似文献   

9.
The problem of multi-agent learning and adaptation has attracted a great deal of attention in recent years. It has been suggested that the dynamics of multi agent learning can be studied using replicator equations from population biology. Most existing studies so far have been limited to discrete strategy spaces with a small number of available actions. In many cases, however, the choices available to agents are better characterized by continuous spectra. This paper suggests a generalization of the replicator framework that allows to study the adaptive dynamics of Q-learning agents with continuous strategy spaces. Instead of probability vectors, agents’ strategies are now characterized by probability measures over continuous variables. As a result, the ordinary differential equations for the discrete case are replaced by a system of coupled integral-differential replicator equations that describe the mutual evolution of individual agent strategies. We derive a set of functional equations describing the steady state of the replicator dynamics, examine their solutions for several two-player games, and confirm our analytical results using simulations.  相似文献   

10.
With recent Industry 4.0 developments, companies tend to automate their industries. Warehousing companies also take part in this trend. A shuttle-based storage and retrieval system (SBS/RS) is an automated storage and retrieval system technology experiencing recent drastic market growth. This technology is mostly utilized in large distribution centers processing mini-loads. With the recent increase in e-commerce practices, fast delivery requirements with low volume orders have increased. SBS/RS provides ultrahigh-speed load handling due to having an excess amount of shuttles in the system. However, not only the physical design of an automated warehousing technology but also the design of operational system policies would help with fast handling targets. In this work, in an effort to increase the performance of an SBS/RS, we apply a machine learning (ML) (i.e., Q-learning) approach on a newly proposed tier-to-tier SBS/RS design, redesigned from a traditional tier-captive SBS/RS. The novelty of this paper is twofold: First, we propose a novel SBS/RS design where shuttles can travel between tiers in the system; second, due to the complexity of operation of shuttles in that newly proposed design, we implement an ML-based algorithm for transaction selection in that system. The ML-based solution is compared with traditional scheduling approaches: first-in-first-out and shortest process time (i.e., travel) scheduling rules. The results indicate that in most cases, the Q-learning approach performs better than the two static scheduling approaches.  相似文献   

11.
A reinforcement agent for object segmentation in ultrasound images   总被引:1,自引:0,他引:1  
The principal contribution of this work is to design a general framework for an intelligent system to extract one object of interest from ultrasound images. This system is based on reinforcement learning. The input image is divided into several sub-images, and the proposed system finds the appropriate local values for each of them so that it can extract the object of interest. The agent uses some images and their ground-truth (manually segmented) version to learn from. A reward function is employed to measure the similarities between the output and the manually segmented images, and to provide feedback to the agent. The information obtained can be used as valuable knowledge stored in the Q-matrix. The agent can then use this knowledge for new input images. The experimental results for prostate segmentation in trans-rectal ultrasound images show high potential of this approach in the field of ultrasound image segmentation.  相似文献   

12.
This paper proposes a new approach for solving the problem of obstacle avoidance during manipulation tasks performed by redundant manipulators. The developed solution is based on a double neural network that uses Q-learning reinforcement technique. Q-learning has been applied in robotics for attaining obstacle free navigation or computing path planning problems. Most studies solve inverse kinematics and obstacle avoidance problems using variations of the classical Jacobian matrix approach, or by minimizing redundancy resolution of manipulators operating in known environments. Researchers who tried to use neural networks for solving inverse kinematics often dealt with only one obstacle present in the working field. This paper focuses on calculating inverse kinematics and obstacle avoidance for complex unknown environments, with multiple obstacles in the working field. Q-learning is used together with neural networks in order to plan and execute arm movements at each time instant. The algorithm developed for general redundant kinematic link chains has been tested on the particular case of PowerCube manipulator. Before implementing the solution on the real robot, the simulation was integrated in an immersive virtual environment for better movement analysis and safer testing. The study results show that the proposed approach has a good average speed and a satisfying target reaching success rate.  相似文献   

13.
We propose two approximate dynamic programming (ADP)-based strategies for control of nonlinear processes using input-output data. In the first strategy, which we term ‘J-learning,’ one builds an empirical nonlinear model using closed-loop test data and performs dynamic programming with it to derive an improved control policy. In the second strategy, called ‘Q-learning,’ one tries to learn an improved control policy in a model-less manner. Compared to the conventional model predictive control approach, the new approach offers some practical advantages in using nonlinear empirical models for process control. Besides the potential reduction in the on-line computational burden, it offers a convenient way to control the degree of model extrapolation in the calculation of optimal control moves. One major difficulty associated with using an empirical model within the multi-step predictive control setting is that the model can be excessively extrapolated into regions of the state space where identification data were scarce or nonexistent, leading to performances far worse than predicted by the model. Within the proposed ADP-based strategies, this problem is handled by imposing a penalty term designed on the basis of local data distribution. A CSTR example is provided to illustrate the proposed approaches.  相似文献   

14.
In this article, an iterative procedure is proposed for the training process of the probabilistic neural network (PNN). In each stage of this procedure, the Q(0)-learning algorithm is utilized for the adaptation of PNN smoothing parameter (σ). Four classes of PNN models are regarded in this study. In the case of the first, simplest model, the smoothing parameter takes the form of a scalar; for the second model, σ is a vector whose elements are computed with respect to the class index; the third considered model has the smoothing parameter vector for which all components are determined depending on each input attribute; finally, the last and the most complex of the analyzed networks, uses the matrix of smoothing parameters where each element is dependent on both class and input feature index. The main idea of the presented approach is based on the appropriate update of the smoothing parameter values according to the Q(0)-learning algorithm. The proposed procedure is verified on six repository data sets. The prediction ability of the algorithm is assessed by computing the test accuracy on 10 %, 20 %, 30 %, and 40 % of examples drawn randomly from each input data set. The results are compared with the test accuracy obtained by PNN trained using the conjugate gradient procedure, support vector machine algorithm, gene expression programming classifier, k–Means method, multilayer perceptron, radial basis function neural network and learning vector quantization neural network. It is shown that the presented procedure can be applied to the automatic adaptation of the smoothing parameter of each of the considered PNN models and that this is an alternative training method. PNN trained by the Q(0)-learning based approach constitutes a classifier which can be treated as one of the top models in data classification problems.  相似文献   

15.
Hao Xu  S. Jagannathan  F.L. Lewis 《Automatica》2012,48(6):1017-1030
In this paper, the stochastic optimal control of linear networked control system (NCS) with uncertain system dynamics and in the presence of network imperfections such as random delays and packet losses is derived. The proposed stochastic optimal control method uses an adaptive estimator (AE) and ideas from Q-learning to solve the infinite horizon optimal regulation of unknown NCS with time-varying system matrices. Next, a stochastic suboptimal control scheme which uses AE and Q-learning is introduced for the regulation of unknown linear time-invariant NCS that is derived using certainty equivalence property. Update laws for online tuning the unknown parameters of the AE to obtain the Q-function are derived. Lyapunov theory is used to show that all signals are asymptotically stable (AS) and that the estimated control signals converge to optimal or suboptimal control inputs. Simulation results are included to show the effectiveness of the proposed schemes. The result is an optimal control scheme that operates forward-in-time manner for unknown linear systems in contrast with standard Riccati equation-based schemes which function backward-in-time.  相似文献   

16.
In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.  相似文献   

17.
This paper concerns with a class of discrete-time linear nonzero-sum games with the partially observable system state. As is known, the optimal control policy for the nonzero-sum games relies on the full state measurement which is hard to fulfil in partially observable environment. Moreover, to achieve the optimal control, one needs to know the accurate system model. To overcome these deficiencies, this paper develops a data-driven adaptive dynamic programming method via Q-learning method using measurable input/output data without any system knowledge. First, the representation of the unmeasurable inner system state is built using historical input/output data. Then, based on the representation state, a Q-function-based policy iteration approach with convergence analysis is introduced to approximate the optimal control policy iteratively. A neural network (NN)-based actor-critic framework is applied to implement the developed data-driven approach. Finally, two simulation examples are provided to demonstrate the effectiveness of the developed approach.  相似文献   

18.
In this paper, the optimal strategies for discrete-time linear system quadratic zero-sum games related to the H-infinity optimal control problem are solved in forward time without knowing the system dynamical matrices. The idea is to solve for an action dependent value function Q(x,u,w) of the zero-sum game instead of solving for the state dependent value function V(x) which satisfies a corresponding game algebraic Riccati equation (GARE). Since the state and actions spaces are continuous, two action networks and one critic network are used that are adaptively tuned in forward time using adaptive critic methods. The result is a Q-learning approximate dynamic programming (ADP) model-free approach that solves the zero-sum game forward in time. It is shown that the critic converges to the game value function and the action networks converge to the Nash equilibrium of the game. Proofs of convergence of the algorithm are shown. It is proven that the algorithm ends up to be a model-free iterative algorithm to solve the GARE of the linear quadratic discrete-time zero-sum game. The effectiveness of this method is shown by performing an H-infinity control autopilot design for an F-16 aircraft.  相似文献   

19.
Reasoning with advanced policy rules and its application to access control   总被引:1,自引:0,他引:1  
This paper presents a formal framework to represent and manage advanced policy rules, which incorporate the notions of provision and obligation. Provisions are those conditions that need to be satisfied or actions that must be performed by a user or an agent before a decision is rendered, while obligations are those conditions or actions that must be fulfilled by either the user or agent or by the system itself within a certain period of time after the decision. This paper proposes a specific formalism to express provisions and obligations within a policy and investigates a reasoning mechanism within this framework. A policy decision may be supported by more than one rule-based derivation, each associated with a potentially different set of provisions and obligations (called a global PO set). The reasoning mechanism can derive all the global PO sets for each specific policy decision and facilitates the selection of the best one based on numerical weights assigned to provisions and obligations as well as on semantic relationships among them. The formal results presented in the paper hold for many applications requiring the specification of policies, but this paper illustrates the use of the proposed policy framework in the security domain only.  相似文献   

20.

An agent-society of the future is envisioned to be as complex as a human society. Just like human societies, such multiagent systems (MAS) deserve an in-depth study of the dynamics, relationships, and interactions of the constituent agents. An agent in a MAS may have only approximate a priori estimates of the trustworthiness of another agent. But it can learn from interactions with other agents, resulting in more accurate models of these agents and their dependencies together with the influences of other environmental factors. Such models are proposed to be represented as Bayesian or belief networks. An objective mechanism is presented to enable an agent elicit crucial information from the environment regarding the true nature of the other agents. This mechanism allows the modeling agent to choose actions that will produce guaranteed minimal improvement of the model accuracy. The working of the proposed maxim in entropy procedure is demonstrated in a multiagent scenario.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号