期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

一种自适应模糊Actor-Critic 学习 总被引：1，自引：0，他引：1

王雪松程玉虎易建强《控制与决策》2006,21(9):1068-1072

提出一种基于模糊RBF网络的自适应模糊Actor—Critic学习．采用一个模糊RBF神经网络同时逼近Actor的动作函数和Critic的值函数，解决状态空间泛化中易出现的“维数灾”问题．模糊RBF网络能够根据环境状态和被控对象特性的变化进行网络结构和参数的自适应学习，使得网络结构更加紧凑，整个模糊Actor—Critic学习具有泛化性能好、控制结构简单和学习效率高的特点．MountainCar的仿真结果验证了所提方法的有效性．相似文献

2.

基于自组织模糊RBF网络的连续空间Q学习

程玉虎王雪松易建强孙伟《信息与控制》2008,37(1):1-1

针对连续空间下的强化学习控制问题,提出了一种基于自组织模糊RBF网络的Q学习方法．网络的输入为状态,输出为连续动作及其Q值,从而实现了“连续状态—连续动作”的映射关系．首先将连续动作空间离散化为确定数目的离散动作,采用完全贪婪策略选取具有最大Q值的离散动作作为每条模糊规则的局部获胜动作．然后采用命令融合机制对获胜的离散动作按其效用值进行加权,得到实际作用于系统的连续动作．另外,为简化网络结构和提高学习速度,采用改进的RAN算法和梯度下降法分别对网络的结构和参数进行在线自适应调整．倒立摆平衡控制的仿真结果验证了所提Q学习方法的有效性．相似文献

3.

基于执行器–评价器学习的自适应PID控制

陈学松杨宜民《控制理论与应用》2011,28(8):1187-1192

针对传统PID控制器无法在线自整定参数的不足,提出了一种基于执行器一评估器（Actor-Critic,AC）学习的自适应PID控制器结构与学习算法．该控制器利用AC学习实现PID参数的自适应整定,采用一个径向基函数网络同时对Actor的策略函数和Critic的值函数进行逼近．径向基函数网络的输入为系统误差、误差的一次差分和二次差分,Actor实现系统状态到PID参数的映射,Critic则对Actor的输出进行评判并且生成时序差分（temporaldifference,TD）误差信号．基于AC学习的体系结构和TD误差性能指标,给出了控制器设计的步骤流程图．两个仿真实验表明：与传统的PID控制器相比,基于AC学习的PID控制器在响应速度和自适应能力方面要优于传统PID控制器．相似文献

4.

基于注意力消息共享的多智能体强化学习

臧嵘王莉史腾飞《计算机应用》2022,42(11):3346-3353

通信是非全知环境中多智能体间实现有效合作的重要途径,当智能体数量较多时,通信过程会产生冗余消息。为有效处理通信消息,提出一种基于注意力消息共享的多智能体强化学习算法AMSAC。首先,在智能体间搭建用于有效沟通的消息共享网络,智能体通过消息读取和写入完成信息共享,解决智能体在非全知、任务复杂场景下缺乏沟通的问题;其次,在消息共享网络中,通过注意力消息共享机制对通信消息进行自适应处理,有侧重地处理来自不同智能体的消息,解决较大规模多智能体系统在通信过程中无法有效识别消息并利用的问题;然后,在集中式Critic网络中,使用Native Critic依据时序差分（TD）优势策略梯度更新Actor网络参数,使智能体的动作价值得到有效评判;最后,在执行期间,智能体分布式Actor网络根据自身观测和消息共享网络的信息进行决策。在星际争霸Ⅱ多智能体挑战赛（SMAC）环境中进行实验,结果表明,与朴素Actor?Critic （Native AC）、博弈抽象通信（GA?Comm）等多智能体强化学习方法相比,AMSAC在四个不同场景下的平均胜率提升了4 ~ 32个百分点。AMSAC的注意力消息共享机制为处理多智能体系统中智能体间的通信消息提供了合理方案,在交通枢纽控制和无人机协同领域都具备广泛的应用前景。相似文献

5.

基于深度递归强化学习的无人自主驾驶策略研究

李志航《工业控制计算机》2020,(4):61-63

提出了一种基于递深度递归强化学习的自动驾驶策略模型学习方法,并在TORCS虚拟驾驶引擎进行仿真验真。针对Actor-Critic框架过估计和更新缓慢的问题,结合clipped double DQN,通过取最小估计值的方法缓解过估计的情况。为了获取多时刻状态输入以帮助智能体更好的决策,结合递归神经网络,设计出包含LSTM结构的Actor策略网络的Critic评价网络。在TORCS平台仿真实验表明,所提算法相对与传统DDPG算法能有效提高训练效率。相似文献

6.

一种模糊强化学习算法及其在RoboCup中的应用 总被引：1，自引：0，他引：1

高建清王浩于磊方宝富《计算机工程与应用》2006,42(6):52-54

传统的强化学习算法只能解决离散状态空间和动作空间的学习问题。论文提出一种模糊强化学习算法,通过模糊推理系统将连续的状态空间映射到连续的动作空间,然后通过学习得到一个完整的规则库。这个规则库为Agent的行为选择提供了先验知识,通过这个规则库可以实现动态规划。作者在RoboCup环境中验证了这个算法,实现了踢球策略的优化。相似文献

7.

一种基于DFS的Agent强化学习策略研究

刘升贵朱旦晨《计算机与现代化》2010,(12):25-26,29

主要讨论一种基于动态模糊集的Agent强化学习策略,介绍Agent强化学习的目标,状态值函数和动作值函数,马尔可夫决策过程的优化以及学习策略等。相似文献

8.

强化学习在足球机器人基本动作学习中的应用 总被引：1，自引：0，他引：1

段勇杨淮清崔宝侠徐心和《机器人》2008,30(5):1

主要研究了强化学习算法及其在机器人足球比赛技术动作学习问题中的应用．强化学习的状态空间和动作空间过大或变量连续,往往导致学习的速度过慢甚至难于收敛．针对这一问题,提出了基于T-S 模型模糊神经网络的强化学习方法,能够有效地实现强化学习状态空间到动作空间的映射．此外,使用提出的强化学习方法设计了足球机器人的技术动作,研究了在不需要专家知识和环境模型情况下机器人的行为学习问题．最后,通过实验证明了所研究方法的有效性,其能够满足机器人足球比赛的需要．相似文献

9.

竞争式Takagi-Sugeno模糊再励学习 总被引：4，自引：0，他引：4

晏雄伟邓志东孙增圻《自动化学报》2002,28(6):873-880

针对连续空间的复杂学习任务,提出了一种竞争式Takagi-Sugeno模糊再励学习网络 (CTSFRLN),该网络结构集成了Takagi-Sugeno模糊推理系统和基于动作的评价值函数的再励学习方法.文中相应提出了两种学习算法,即竞争式Takagi-Sugeno模糊Q-学习算法和竞争式Takagi-Sugeno模糊优胜学习算法,其把CTSFRLN训练成为一种所谓的Takagi-Sugeno模糊变结构控制器.以二级倒立摆控制系统为例,仿真研究表明所提出的学习算法在性能上优于其它的再励学习算法. 相似文献

10.

Reinforcement learning for parameter estimation in statistical spoken dialogue systems

Filip Jur?í?ek Blaise Thomson Steve Young 《Computer Speech and Language》2012,26(3):168-192

Reinforcement techniques have been successfully used to maximise the expected cumulative reward of statistical dialogue systems. Typically, reinforcement learning is used to estimate the parameters of a dialogue policy which selects the system's responses based on the inferred dialogue state. However, the inference of the dialogue state itself depends on a dialogue model which describes the expected behaviour of a user when interacting with the system. Ideally the parameters of this dialogue model should be also optimised to maximise the expected cumulative reward.This article presents two novel reinforcement algorithms for learning the parameters of a dialogue model. First, the Natural Belief Critic algorithm is designed to optimise the model parameters while the policy is kept fixed. This algorithm is suitable, for example, in systems using a handcrafted policy, perhaps prescribed by other design considerations. Second, the Natural Actor and Belief Critic algorithm jointly optimises both the model and the policy parameters. The algorithms are evaluated on a statistical dialogue system modelled as a Partially Observable Markov Decision Process in a tourist information domain. The evaluation is performed with a user simulator and with real users. The experiments indicate that model parameters estimated to maximise the expected reward function provide improved performance compared to the baseline handcrafted parameters. 相似文献

11.

Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

Xin Xu Chunming Liu Dewen Hu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1055-1070

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI. 相似文献

12.

基于增强型算法并能自动生成规则的模糊神经网络控制器

吴耿锋傅忠谦《控制理论与应用》2001,18(2):241-244

给出了一种基于增强型算法并能自动生成控制规则的模糊神经网络控制器RBFNNC（reinforcements based fuzzy neural network comtroller）。该控制器能根据被控对象的状态通过增强型学习自动生成模糊控制规则,RBFNNC用于倒立摆小车平衡系统控制的仿真实验表明了该系统的结构及增强型学习算法是有效和成功的。相似文献

13.

一种基于联合神经网络的连续空间行动者评论家学习方法

下载免费PDF全文

杨金鸿皇甫立谭斌熊璋《智能安全》2022,1(2):19-25

在复杂的连续空间应用场景中,经典的离散空间强化学习方法已难以满足实际需要,而已有的连续空间强化学习方法主要采用线性拟合方法逼近状态值函数和动作选择函数,存在精度不高的问题。提出一种基于联合神经网络非线性行动者评论家方法(actor-critic approach based on union neural network, UNN-AC)。该方法将动作选择函数和评论值函数表示为统一的联合神经网络模型,利用联合神经网络非线性拟合状态值函数和动作选择概率。与已有的线性拟合方法相比,非线性UNN-AC提高了对评论值函数和动作选择函数的拟合精度。实验结果表明,UNN-AC算法能够有效求解连续空间中近似最优策略问题。与经典的连续动作空间算法相比,该算法具有收敛速度快和稳定性高的优点。相似文献

14.

基于拉普拉斯特征映射的启发式Q学习

朱美强李明程玉虎张倩王雪松《控制与决策》2014,29(3):425-430

在基于目标的强化学习任务中, 欧氏距离常作为启发式函数用于策略选择, 其用于状态空间在欧氏空间内不连续的任务效果不理想. 针对此问题, 引入流形学习中计算复杂度较低的拉普拉斯特征映射法, 提出一种基于谱图理论的启发式策略选择方法. 所提出的方法适用于状态空间在某个内在维数易于估计的流形上连续, 且相邻状态间的连接关系为无向图的任务. 格子世界的仿真结果验证了所提出方法的有效性.

相似文献

15.

基于协同最小二乘支持向量机的Q学习 总被引：5，自引：0，他引：5

王雪松田西兰程玉虎易建强《自动化学报》2009,35(2):214-219

针对强化学习系统收敛速度慢的问题, 提出一种适用于连续状态、离散动作空间的基于协同最小二乘支持向量机的Q学习. 该Q学习系统由一个最小二乘支持向量回归机(Least squares support vector regression machine, LS-SVRM)和一个最小二乘支持向量分类机(Least squares support vector classification machine, LS-SVCM)构成. LS-SVRM用于逼近状态--动作对到值函数的映射, LS-SVCM则用于逼近连续状态空间到离散动作空间的映射, 并为LS-SVRM提供实时、动态的知识或建议(建议动作值)以促进值函数的学习. 小车爬山最短时间控制仿真结果表明, 与基于单一LS-SVRM的Q学习系统相比, 该方法加快了系统的学习收敛速度, 具有较好的学习性能. 相似文献

16.

Supervised fuzzy reinforcement learning for robot navigation

《Applied Soft Computing》2016

This paper addresses a new method for combination of supervised learning and reinforcement learning (RL). Applying supervised learning in robot navigation encounters serious challenges such as inconsistent and noisy data, difficulty for gathering training data, and high error in training data. RL capabilities such as training only by one evaluation scalar signal, and high degree of exploration have encouraged researchers to use RL in robot navigation problem. However, RL algorithms are time consuming as well as suffer from high failure rate in the training phase. Here, we propose Supervised Fuzzy Sarsa Learning (SFSL) as a novel idea for utilizing advantages of both supervised and reinforcement learning algorithms. A zero order Takagi–Sugeno fuzzy controller with some candidate actions for each rule is considered as the main module of robot's controller. The aim of training is to find the best action for each fuzzy rule. In the first step, a human supervisor drives an E-puck robot within the environment and the training data are gathered. In the second step as a hard tuning, the training data are used for initializing the value (worth) of each candidate action in the fuzzy rules. Afterwards, the fuzzy Sarsa learning module, as a critic-only based fuzzy reinforcement learner, fine tunes the parameters of conclusion parts of the fuzzy controller online. The proposed algorithm is used for driving E-puck robot in the environment with obstacles. The experiment results show that the proposed approach decreases the learning time and the number of failures; also it improves the quality of the robot's motion in the testing environments. 相似文献

17.

一种新的基于蚁群优化的模糊强化学习算法

谢光强陈学松《计算机应用研究》2011,28(4):1266-1268

模糊Sarsa学习(FSL)是基于Sarsa学习而提出来的一种模糊强化学习算法,它是一种通过在线策略来逼近动作值函数的算法,在其每条模糊规则中,动作的选择是按照Softmax公式选择下一个动作。对于连续空间的复杂学习任务,FSL不能较好平衡探索和利用之间的关系,为此,本文提出了一种新的基于蚁群优化的模糊强化学习算法(ACO-FSL),主要工作是把蚁群优化(ACO)思想和传统的模糊强化学习算法结合起来形成一种新的算法。给出了算法的设计原理、方法和具体步骤,小车爬山问题的仿真实验表明本文提出的ACO-FSL算法在学习速度和稳定性上优于FSL算法。相似文献

18.

Self-learning fuzzy logic controllers for pursuit-evasion differential games

Sameh F. DesoukyAuthor Vitae Howard M. Schwartz Author Vitae 《Robotics and Autonomous Systems》2011,59(1):22-33

This paper addresses the problem of tuning the input and the output parameters of a fuzzy logic controller. The system learns autonomously without supervision or a priori training data. Two novel techniques are proposed. The first technique combines Q(λ)-learning with function approximation (fuzzy inference system) to tune the parameters of a fuzzy logic controller operating in continuous state and action spaces. The second technique combines Q(λ)-learning with genetic algorithms to tune the parameters of a fuzzy logic controller in the discrete state and action spaces. The proposed techniques are applied to different pursuit-evasion differential games. The proposed techniques are compared with the classical control strategy, Q(λ)-learning only, reward-based genetic algorithms learning, and with the technique proposed by Dai et al. (2005) [19] in which a neural network is used as a function approximation for Q-learning. Computer simulations show the usefulness of the proposed techniques. 相似文献

19.

自适应模糊RBF神经网络的多智能体机器人强化学习 总被引：3，自引：0，他引：3

张文志李智军吕恬生罗青《计算机工程与应用》2003,39(32):111-115

多机器人环境中的学习,由于机器人所处的环境是连续状态,连续动作,而且包含多个机器人,因此学习空间巨大,直接应用Q学习算法难以获得满意的结果。文章研究中针对多智能体机器人系统的学习问题,提出自适应模糊RBF神经网络强化学习算法,网络本身具有模糊推理能力、较强的函数逼近能力以及泛化能力,因此,实现了人类专家知识与机器学习方法的结合,减少学习问题的复杂度;实现连续状态空间与动作空间的策略学习。相似文献

20.

连续空间增量最近邻时域差分学习 总被引：1，自引：1，他引：0

张春元朱清新钟声《控制与决策》2014,29(12):2121-2128

针对连续空间强化学习问题,提出一种基于局部加权学习的增量最近邻时域差分(TD)学习框架。通过增量方式在线选取部分已观测状态构建实例词典,采用新观测状态的范围最近邻实例逼近其值函数与策略,并结合TD算法对词典中各实例的值函数和资格迹迭代更新。就框架各主要组成部分给出多种设计方案,并对其收敛性进行理论分析。对24种方案组合进行仿真验证的实验结果表明, SNDN组合具有较好的学习性能和计算效率。相似文献