期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李学勇欧阳柳波李国徽《南华大学学报(理工版)》2004,18(2):10-16

传统的强化学习算法应用到大状态、动作空间和任务复杂的马尔可夫决策过程问题时，存在收敛速度慢，训练时间长等问题．有效地学习和利用问题中包含的偏向信息可以加快学习速度，提高学习效率．在分析了偏向机制特点的基础上引入了隐偏向信息的概念，建立了一种基于偏向信息学习的强化学习模型，并提出了一种基于特征的改进SARSA(λ)算法．针对于拄箱任务的实验表明，改进的算法明显提高了学习效率．相似文献

2.

多步R学习算法

胡光华吴沧浦《北京理工大学学报(英文版)》1999,8(3):245-250

目的讨论平均准则下控制马氏链的强化学习算法,在事先未知状态转移矩阵及报酬函数的条件下,通过试凑法寻找使得长期每阶段期望平均报酬最大的最优控制策略.方法结合平均报酬问题的一步学习算法和即时差分学习算法,提出了一种多步强化学习算法--R(λ)学习算法.结果与结论新算法使得已有的R学习算法成为其λ=0时的特例.同时它也是折扣报酬Q(λ)学习算法到平均报酬问题的自然推广.仿真结果表明λ取中间值的R(λ)学习算法明显优于一步的R学习算法. 相似文献

3.

一种有限时段Markov决策过程的强化学习算法 总被引：4，自引：0，他引：4

李春贵刘永信《广西工学院学报》2003,14(1):1-4

研究有限时段非平稳的Markov决策过程的强化学习算法。通过引入一个人工吸收状态，把有限时段问题变为无限时段问题，从而可利用通常的强化学习方法来求解。在文献[3]提出的算法思想基础上，提出了一种新的有限时段非平稳的Markov决策过程的强化学习算法，并用无完全模型的库存控制问题进行了实验。相似文献

4.

基于SARSA(λ)算法的单路口交通信号学习控制 总被引：1，自引：0，他引：1

李春贵阳树洪王萌张增芳《广西工学院学报》2008,19(2):10-14

针对复杂的、难以建模的城市交通系统,将多步强化学习算法SARSA（λ）应用于交通信号控制,根据实时的交通状态信息动态进行决策,自动适应环境以便取得更好的控制效果。由于问题状态空间太大而难以直接存储和表示,采用径向基函数神经网络进行值函数近似,通过训练自适应非线性处理单元,可达到较好的近似表示效果,解决了单个十字交叉路口的交通信号控制问题。对该方法进行仿真实验,结果表明其控制效果明显优于传统的固定配时控制策略。相似文献

5.

基于强化学习的多智能体协作实现

陈雪江杨东勇《浙江工业大学学报》2004,32(5):516-520

基于马尔科夫过程的强化学习作为一种在线学习方式,能够很好地应用于单智能体环境中.但是由于强化学习理论的限制,在多智能体系统中马尔科夫过程模型不再适用,因此强化学习不能直接用于多智能体的协作学习问题.本文提出了多智能体协作的两层强化学习方法.该方法主要通过在单个智能体中构筑两层强化学习单元来实现.第一层强化学习单元负责学习智能体的联合任务协作策略,第二层强化学习单元负责学习在本智能体看来是最有效的行动策略.所提出的方法应用于3个智能体协作抬起圆形物体的计算机模拟中,结果表明所提出的方法比采用传统强化学习方法的智能体协作得更好. 相似文献

6.

浅谈“25步多循环”学习法

奚迪超《重庆通信学院学报》1993,(3):41-43

本文介绍了“25步多循环”学习法的概念、过程及内容要求。此学习方法适用于大学本科、专科、中专、士官等各层次各类型在校学员的学习。相似文献

7.

强化学习方法在移动机器人导航中的应用 总被引：1，自引：0，他引：1

陆军徐莉周小平《哈尔滨工程大学学报》2004,25(2):176-179

路径规划是智能机器人关键问题之一,它包括全局路径规划和局部路径规划.局部路径规划是路径规划的难点,当环境复杂时,很难得到好的路径规划结果.这里将强化学习方法用于自主机器人的局部路径规划,用以实现在复杂未知环境下的路径规划.为了克服标准Q 学习算法收敛速度慢等缺点,采用多步在策略SARSA(λ)强化学习算法,讨论了该算法在局部路径规划问题上的具体应用.采用CMAC神经网络实现了强化学习系统,完成了基于CMAC神经网络的SARSA(λ)算法.提出了路径规划和沿墙壁行走两个网络的互相转换的方法,成功解决了复杂障碍物环境下的自主机器人的局部路径规划问题.仿真结果表明了该算法的有效性,同传统方法相比该算法有较强的学习能力和适应能力. 相似文献

8.

强化学习原理、算法及应用

黄炳强曹广益王占全《河北工业大学学报》2006,35(6):34-38

强化学习(ReinforcementLearningRL)是从动物学习理论发展而来的,它不需要有先验知识,通过不断与环境交互来获得知识,自主的进行动作选择,具有自主学习能力,在自主机器人行为学习中受到广泛重视.本文综述了强化学习的基本原理,各种算法,包括TD算法、Q-学习和R学习等,最后介绍了强化学习的应用及其在多机器人系统中的研究热点问题. 相似文献

9.

基于有效跟踪的平均渐进瞬时差分学习遗忘算法

殷苌茗王汉兴陈焕文谢丽娟《电力科学与技术学报》2003,18(4):12-16

智能体通过学习最优决策来解决其决策问题.激励学习方法是智能体通过与其所处的环境交互来改进它自身的行为.Markov决策过程(MDP)模型是求解激励学习问题的一般框架,瞬时差分TD(λ)是在MDP模型下与策略相关的学习值函数的一种算法.一般情况下,智能体必须记住其所有的值函数的值,当状态空间非常大时,这种记忆的量是大得惊人的.为了解决这个问题,给出了一种遗忘算法,这种算法把心理学的遗忘准则引入到了激励学习之中.利用遗忘算法,可以解决智能体在大状态空间中的激励学习问题. 相似文献

10.

部分可观测Markov环境下的激励学习综述

谢丽娟陈焕文《电力科学与技术学报》2002,17(2):23-27

对智能体在不确定环境下的学习与规划问题的激励学习技术进行了综述.首先介绍了用于描述隐状态问题的部分可观测Markov决策理论(POMDPs),在简单回顾其它POMDP求解技术后,重点讨论环境模型事先未知的激励学习技术,包括两类:一类为基于状态的值函数学习;一类为策略空间的直接搜索.最后分析了这些方法尚存在的问题,并指出了未来可能的研究方向. 相似文献

11.

A new accelerating algorithm for multi-agent reinforcement learning

张汝波仲宇顾国昌《哈尔滨工业大学学报(英文版)》2005,12(1):48-51

In multi-agent systems, joint-action must be employed to achieve cooperation because the evaluation of the behavior of an agent often depends on the other agents‘ behaviors. However, joint-action reinforcement learning algorithms suffer the slow convergence rate because of the enormous learning space produced by jointaction. In this article, a prediction-based reinforcement learning algorithm is presented for multi-agent cooperation tasks, which demands all agents to learn predicting the probabilities of actions that other agents may execute. A multi-robot cooperation experiment is run to test the efficacy of the new algorithm, and the experiment results show that the new algorithm can achieve the cooperation policy much faster than the primitive reinforcement learning algorithm. 相似文献

12.

异构无线网络中基于强化学习的频谱管理算法

张文柱邵丽娜《西安电子科技大学学报(自然科学版)》2011,38(4):32-37

提出了一种基于归一化径向基函数的自适应启发评价强化学习算法,用于异构无线网络系统中自主的动态频谱分配.该算法利用归一化径向基函数自适应构建状态空间,加快学习速度;利用自适应启发评价机制减少不必要的探索,提高学习效率.通过与无线环境交互,算法学会为不同接入网内的各个会话动态分配合适的频段.仿真结果表明,在同等网络条件下,该算法能获取更好的频谱利用率和服务质量,性能优于确定性频谱分配策略和一般的动态频谱分配策略. 相似文献

13.

基于强化学习的多路口可变车道协同控制方法

徐小高夏莹杰朱思雨邝砾《浙江大学学报(工学版)》2022,56(5):987

为了解决传统的可变导向车道控制方法无法适应多路口场景下的复杂交通流的问题,提出基于多智能体强化学习的多路口可变导向车道协同控制方法来缓解多路口的交通拥堵状况. 该方法对多智能体强化学习 (QMIX)算法进行改进,针对可变导向车道场景下的全局奖励分配问题,将全局奖励分解为基本奖励与绩效奖励,提高了拥堵场景下对车道转向变化的决策准确性. 引入优先级经验回放算法,以提升经验回放池中转移序列的利用效率,加速算法收敛. 实验结果表明,本研究所提出的多路口可变导向车道协同控制方法在排队长度、延误时间和等待时间等指标上的表现优于其他控制方法,能够有效协调可变导向车道的策略切换,提高多路口下路网的通行能力. 相似文献

14.

Pass-ball trainning based on genetic reinforcement learning

褚海涛洪炳熔《哈尔滨工业大学学报(英文版)》2001,8(3)

0　ＩＮＴＲＯＤＵＣＴＩＯＮＥｓｔａｂｌｉｓｈｉｎｇａｉｎｄｅｐｅｎｄｅｎｃｅｒｏｂｏｔｗｈｏｌｅａｒｎｔｏｃａｒｒｙｏｕｔｔａｓｋｄｅｐｅｎｄｉｎｇｏｎｖｉｓｕａｌｉｎｆｏｒｍａｔｉｏｎｈａｓｂｅｃｏｍｅａｐｒｉｍａｒｉｌｙｃｈａｌｌｅｎｇｅｏｆａｒｔｉｆｉｃｉａｌｉｎｔｅｌｌｉｇｅｎｃｅ .Ｒｅｃｅｎｔｌｙ ,ａｓａｋｉｎｄｏｆｒｏｂｏｔｌｅａｒｎｉｎｇａｐｐｒｏａｃｈｔｈａｔｎｅｅｄｎｏｔｒａｎｓｃｅｎｄｅｎｔａｌｋｎｏｗｌｅｄｇｅａｎｄｈａｓｈｉｇｈｒｅｓｐｏｎｓｅａｎｄａｄａｐｔａｔｉｏ… 相似文献

15.

组合动作空间深度强化学习的人群疏散引导方法

薛怡然吴锐刘家锋《哈尔滨工业大学学报》2021,53(8):29-38

人群疏散引导系统可在建筑物内发生灾害时有效保护生命安全,减少人员财产损失.针对现有人群疏散引导系统需要人工设计模型和输入参数,工作量大且容易造成误差的问题,本文提出了基于深度强化学习的端到端智能疏散引导方法,设计了基于社会力模型的强化学习智能体仿真交互环境.使智能体可以仅以场景图像为输入,通过与仿真环境的交互和试错自主... 相似文献

16.

Aircraft reinforcement learning multi-mode control in orbit

ZHANG Ying WEI Minfeng WANG Shihui TAO Leiyan CAO Jian ZHANG Xing 《西安电子科技大学学报(自然科学版)》1996,47(2):75-82

In order to improve the long-term in orbit flight reliability of the aircraft control system, a multi-mode control scheme is proposed based on reinforcement learning. This system includes a sensor module, a control module and an execution module. The sensor module is used to input the sensitive flight data of the aircraft to the control module in real time. This data is divided into multidimensional structured floating point data with historical relevance that can be directly used for aircraft control and the unique physical representation quantity of a particular sensor. The control module is divided into an input layer, a feature extraction layer and a full connection layer. The execution module is used to receive the driving data from the control module in real time, which includes the optimal state value for decision-making and the action output value for evaluation. The system decides which specific execution modules to use based on the optimal return value for decision making, with the output value of a selected specific execution module depending on the output value of the action used for evaluation. The system enables the aircraft to complete a long-term orbit operation in the multi-mode input and output state with 15ms fast response and 5.23GOP/s/W Performance per Watt. 相似文献

17.

A special hierarchical fuzzy neural-networks based reinforcement learning for multi-variables system

张文志吕恬生《哈尔滨工业大学学报(英文版)》2005,12(6):661-666

Proposes a reinforcement learning scheme based on a special Hierarchical Fuzzy Neural-Networks （HFNN） for solving complicated learning tasks in a continuous multi-variables environment. The output of the previous layer in the HFNN is no longer used as if-part of the next layer, but used only in then-part. Thus it can deal with the difficulty when the output of the previous layer is meaningless or its meaning is uncertain. The proposed HFNN has a minimal number of fuzzy rules and can successfully solve the problem of rules combination explosion and decrease the quantity of computation and memory requirement. In the learning process, two HFNN with the same structure perform fuzzy action composition and evaluation function approximation simultaneously where the parameters of neural-networks are tuned and updated on line by using gradient descent algorithm. The reinforcement learning method is proved to be correct and feasible by simulation of a double inverted pendulum system. 相似文献

18.

Multiagent reinforcement learning through merging individually learned value functions

张化祥黄上腾《哈尔滨工业大学学报(英文版)》2005,12(3):346-350

In cooperative multiagent systems, to learn the optimal policies of multiagents is very difficult. As the numbers of states and actions increase exponentially with the number of agents, their action policies become more intractable. By learning these value functions, an agent can learn its optimal action policies for a task. If a task can be decomposed into several subtasks and the agents have learned the optimal value functions for each subtask, this knowledge can be helpful for the agents in learning the optimal action policies for the whole task when they are acting simultaneously. When merging the agents‘ independently learned optimal value functions,a novel multiagent online reinforcement learning algorithm LU - Q is proposed. By applying a transformation to the individually learned value functions, the constraints on the optimal value functions of each subtask are loosened. In each learning iteration process in algorithm LU - Q, the agents‘ joint action set in a state is processed. Some actions of that state are pruned from the available action set according to the defined multiagent value function in LU - Q. As the items of the available action set of each state are reduced gradually in the iteration process of LU - Q, the convergence of the value functions is accelerated. LU - Q‘s effectiveness, soundness and convergence are analyzed, and the experimental results show that the learning performance of LU-Q is better than the performance of standard Q learning. 相似文献

19.

Application of reinforcement learning and neural network in robot navigation

孟伟洪炳熔《哈尔滨工业大学学报(英文版)》2001,8(3)

0　ＩＮＴＲＯＤＵＣＴＩＯＮＰａｔｈｐｌａｎｎｉｎｇｉｓｏｎｅｏｆｔｈｅｍｏｓｔｉｍｐｏｒｔａｎｔｐｒｏｂｌｅｍｓｉｎｒｏｂｏｔｎａｖｉｇａｔｉｏｎ .Ｔｈｅｐａｔｈｐｌａｎｎｉｎｇｏｆｔｈｅｍｏｂｉｌｅｒｏｂｏｔｉｓｃｌａｓｓｉｆｉｅｄｉｎｔｏｔｗｏｃａｔｅｇｏｒｉｅｓ:ｇｌｏｂａｌｐａｔｈｐｌａｎｎｉｎｇｂａｓｅｄｏｎｐｒｉｏｒｋｎｏｗｌｅｄｇｅａｂｏｕｔｅｎｖｉｒｏｎｍｅｎｔａｎｄｌｏｃａｌｐａｔｈｐｌａｎｎｉｎｇｂａｓｅｄｏｎｕｎｓｔｒｕｃｔｕｒｅｄｅｎｖｉｒｏｎｍｅｎｔ.Ｔｈｉｓｐａｐｅ… 相似文献