期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

朱斐刘全傅启明伏玉琛《计算机研究与发展》2014,(3)

解决具有连续动作空间的问题是当前强化学习领域的一个研究热点和难点.在处理这类问题时,传统的强化学习算法通常利用先验信息对连续动作空间进行离散化处理,然后再求解最优策略.然而,在很多实际应用中,由于缺乏用于离散化处理的先验信息,算法效果会变差甚至算法失效.针对这类问题,提出了一种最小二乘行动者-评论家方法(least square actor-critic algorithm,LSAC),使用函数逼近器近似表示值函数及策略,利用最小二乘法在线动态求解近似值函数参数及近似策略参数,以近似值函数作为评论家指导近似策略参数的求解.将LSAC算法用于解决经典的具有连续动作空间的小车平衡杆问题和mountain car问题,并与Cacla(continuous actor-critic learning automaton)算法和eNAC(episodic natural actor-critic)算法进行比较.结果表明,LSAC算法能有效地解决连续动作空间问题,并具有较优的执行性能. 相似文献

2.

基于核方法的连续动作Actor-Critic学习

陈兴国高阳范顺国俞亚君《模式识别与人工智能》2014,(2):103-110

强化学习算法通常要处理连续状态及连续动作空间问题以实现精确控制.就此文中结合Actor-Critic方法在处理连续动作空间的优点及核方法在处理连续状态空间的优势,提出一种基于核方法的连续动作Actor-Critic学习算法(KCACL).该算法中,Actor根据奖赏不作为原则更新动作概率,Critic采用基于核方法的在线选择时间差分算法学习状态值函数.对比实验验证该算法的有效性. 相似文献

3.

基于径向基神经网络的多步Sarsa控制算法

司彦娜普杰信于晓升司鹏举孙力帆《控制与决策》2023,38(4):944-950

针对具有连续状态空间的无模型非线性系统,提出一种基于径向基(radial basis function, RBF)神经网络的多步强化学习控制算法.首先,将神经网络引入强化学习系统,利用RBF神经网络的函数逼近功能近似表示状态-动作值函数,解决连续状态空间表达问题;然后,结合资格迹机制形成多步Sarsa算法,通过记录经历过的状态提高系统的学习效率;最后,采用温度参数衰减的方式改进softmax策略,优化动作的选择概率,达到平衡探索和利用关系的目的. MountainCar任务的仿真实验表明:所提出算法经过少量训练能够有效实现无模型情况下的连续非线性系统控制;与单步算法相比,该算法完成任务所用的平均收敛步数更少,效果更稳定,表明非线性值函数近似与多步算法结合在控制任务中同样可以具有良好的性能. 相似文献

4.

强化学习算法研究 总被引：2，自引：0，他引：2

刘忠李海红刘全《计算机工程与设计》2008,29(22)

针对智能Agent运动中普遍存在的避障问题,结合强化学习具有的试错和环境交互获得在莱状态下选择动作的策略以及无导师在线学习等特性.在介绍强化学习的原理、分类以及主要算法(TD(λ)、Q_learning、Dyna,Prioritized Sweeping、Sarsa)的基础上,对TS(λ)、Q_learning的算法进行分析,并将其应用到实验中.实验结果表明,强化学习中的TS(λ)、Q_learning等算法在不同情况下都能高效地解决避障等问题. 相似文献

5.

一种高斯过程的带参近似策略迭代算法

傅启明刘全伏玉琛周谊成于俊《软件学报》2013,24(11):2676-2686

在大规模状态空间或者连续状态空间中,将函数近似与强化学习相结合是当前机器学习领域的一个研究热点;同时,在学习过程中如何平衡探索和利用的问题更是强化学习领域的一个研究难点.针对大规模状态空间或者连续状态空间、确定环境问题中的探索和利用的平衡问题,提出了一种基于高斯过程的近似策略迭代算法.该算法利用高斯过程对带参值函数进行建模,结合生成模型,根据贝叶斯推理,求解值函数的后验分布.在学习过程中,根据值函数的概率分布,求解动作的信息价值增益,结合值函数的期望值,选择相应的动作.在一定程度上,该算法可以解决探索和利用的平衡问题,加快算法收敛.将该算法用于经典的Mountain Car 问题,实验结果表明,该算法收敛速度较快,收敛精度较好. 相似文献

6.

基于TG–LSTM神经网络的非完整时间序列预测

陈中林杨翠丽乔俊飞《控制理论与应用》2022,39(5):867-878

针对传统模型对含数据缺失的非完整时间序列预测精度不高的问题,利用长短期记忆(LSTM)神经网络强大的时序建模能力,提出一种带时间门的长短期记忆(TG–LSTM)神经网络.首先,提出一种能同时对输入值在线估计和输出值实时预测的TG–LSTM单元结构;其次,基于TG–LSTM结构设计一种网络的前向传播算法,实现输入填补和输出预测同步进行;然后,建立TG–LSTM神经网络的学习算法来对输入填补和输出预测任务整体训练;最后,通过在Mackey-glass基准数据集,月平均气温数据集和污水处理出水氨氮预测中的实验结果表明:与传统方法相比,TG–LSTM神经网络模型能以更高精度对非完整时间序列进行填补和预测. 相似文献

7.

基于深度强化学习的空间众包任务分配策略

倪志伟刘浩朱旭辉赵杨冉家敏《模式识别与人工智能》2021,34(3):191-205

针对动态在线任务分配策略难以有效利用历史数据进行学习、同时未考虑当前决策对未来收益的影响的问题,提出基于深度强化学习的空间众包任务分配策略.首先,以最大化长期累积收益为优化目标,基于马尔科夫决策过程从单个众包工作者的角度建模,将任务分配问题转化为对状态动作价值Q的求解及工作者与任务的一对一分配.然后采用改进的深度强化学习算法对历史任务数据进行离线学习,构建关于Q值的预测模型.最后,动态在线分配过程中实时预测Q值,作为KM(Kuhn-Munkres)算法的边权,实现全局累积收益的最优分配.在出租车真实出行数据集上的实验表明,当工作者数量在一定规模内时,文中策略可提高长期累积收益. 相似文献

8.

基于状态回溯代价分析的启发式Q学习

方敏李浩《模式识别与人工智能》2013,26(9):838-844

由于强化学习算法动作策略学习比较费时,提出一种基于状态回溯的启发式强化学习方法.分析强化学习过程中重复状态,通过比较状态回溯过程中重复动作的选择策略,引入代价函数描述重复动作的重要性.结合动作奖赏及动作代价提出一种新的启发函数定义.该启发函数在强调动作重要性以加快学习速度的同时,基于代价函数计算动作选择的代价以减少不必要的探索,从而平稳地提高学习效率.对基于代价函数的动作选择策略进行证明.建立两种仿真场景,将算法用于机器人路径规划的仿真实验.实验结果表明基于状态回溯的启发式强化学习方法能平衡考虑获得的奖赏及付出的代价,有效提高Q学习的收敛速度. 相似文献

9.

基于强化学习的挖掘机时间最优轨迹规划

张韵悦孙志毅孙前来王银《控制与决策》2024,39(5):1433-1440

针对挖掘机的自主作业场景,提出基于强化学习的时间最优轨迹规划方法.首先,搭建仿真环境用于产生数据,以动臂、斗杆和铲斗关节的角度、角速度为状态观测变量,以各关节的角加速度值为动作信息,通过状态观测信息实现仿真环境与自主学习算法的交互;然后,设计以动臂、斗杆和铲斗关节运动是否超出允许范围、完成任务总时间和目标相对距离为奖励函数对策略网络参数进行训练;最后,利用改进的近端策略优化算法(proximal policy optimization, PPO)实现挖掘机的时间最优轨迹规划.与此同时,与不同连续动作空间的强化学习算法进行对比,实验结果表明:所提出优化算法效率更高,收敛速度更快,作业轨迹更平滑,可有效避免各关节受到较大冲击,有助于挖掘机高效、平稳地作业. 相似文献

10.

基于离线模型预训练学习的改进DDPG算法

张茜王洪格倪亮《计算机工程与设计》2022,43(5):1451-1458

针对DDPG(deep deterministic policy gradient)在线训练过程中陷入局部极小值及产生大量试错动作和无效数据的问题,提出一种基于离线模型预训练学习的改进DDPG算法。利用已有数据离线训练对象状态模型和价值奖励模型,提前对DDPG中动作网络和价值网络进行预训练学习,减少DDPG前期工作量并提升在线学习的品质。加入DDQN(double deep Q-Learning network)结构解决Q值估计偏高问题。仿真结果中获取平均累积奖励值提升了9.15%,表明改进算法有效提高了DDPG算法效果。相似文献

11.

Towards online reinforced learning of assembly sequence planning with interactive guidance systems for industry 4.0 adaptive manufacturing

《Journal of Manufacturing Systems》2021

Literature shows that reinforcement learning (RL) and the well-known optimization algorithms derived from it have been applied to assembly sequence planning (ASP); however, the way this is done, as an offline process, ends up generating optimization methods that are not exploiting the full potential of RL. Today’s assembly lines need to be adaptive to changes, resilient to errors and attentive to the operators’ skills and needs. If all of these aspects need to evolve towards a new paradigm, called Industry 4.0, the way RL is applied to ASP needs to change as well: the RL phase has to be part of the assembly execution phase and be optimized with time and several repetitions of the process. This article presents an agile exploratory experiment in ASP to prove the effectiveness of RL techniques to execute ASP as an adaptive, online and experience-driven optimization process, directly at assembly time. The human-assembly interaction is modelled through the input-outputs of an assembly guidance system built as an assembly digital twin. Experimental assemblies are executed without pre-established assembly sequence plans and adapted to the operators’ needs. The experiments show that precedence and transition matrices for an assembly can be generated from the statistical knowledge of several different assembly executions. When the frequency of a given subassembly reinforces its importance, statistical results obtained from the experiments prove that online RL applications are not only possible but also effective for learning, teaching, executing and improving assembly tasks at the same time. This article paves the way towards the application of online RL algorithms to ASP. 相似文献

12.

Explanation-Based Learning and Reinforcement Learning: A Unified View 总被引：3，自引：0，他引：3

Dietterich Thomas G. Flann Nicholas S. 《Machine Learning》1997,28(2-3):169-210

相似文献

13.

Supervised fuzzy reinforcement learning for robot navigation

《Applied Soft Computing》2016

This paper addresses a new method for combination of supervised learning and reinforcement learning (RL). Applying supervised learning in robot navigation encounters serious challenges such as inconsistent and noisy data, difficulty for gathering training data, and high error in training data. RL capabilities such as training only by one evaluation scalar signal, and high degree of exploration have encouraged researchers to use RL in robot navigation problem. However, RL algorithms are time consuming as well as suffer from high failure rate in the training phase. Here, we propose Supervised Fuzzy Sarsa Learning (SFSL) as a novel idea for utilizing advantages of both supervised and reinforcement learning algorithms. A zero order Takagi–Sugeno fuzzy controller with some candidate actions for each rule is considered as the main module of robot's controller. The aim of training is to find the best action for each fuzzy rule. In the first step, a human supervisor drives an E-puck robot within the environment and the training data are gathered. In the second step as a hard tuning, the training data are used for initializing the value (worth) of each candidate action in the fuzzy rules. Afterwards, the fuzzy Sarsa learning module, as a critic-only based fuzzy reinforcement learner, fine tunes the parameters of conclusion parts of the fuzzy controller online. The proposed algorithm is used for driving E-puck robot in the environment with obstacles. The experiment results show that the proposed approach decreases the learning time and the number of failures; also it improves the quality of the robot's motion in the testing environments. 相似文献

14.

Reinforcement learning for resource allocation in LEO satellite networks.

Wipawee Usaha Javier A Barria 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2007,37(3):515-527

In this paper, we develop and assess online decision-making algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semi-Markov decision process formulation of the call admission and routing problem can achieve better performance in terms of an average revenue function than existing routing methods. However, the conventional dynamic programming (DP) numerical solution becomes prohibited as the problem size increases. In this paper, two solution methods based on reinforcement learning (RL) are proposed in order to circumvent the computational burden of DP. The first method is based on an actor-critic method with temporal-difference (TD) learning. The second method is based on a critic-only method, called optimistic TD learning. The algorithms enhance performance in terms of requirements in storage, computational complexity and computational time, and in terms of an overall long-term average revenue function that penalizes blocked calls. Numerical studies are carried out, and the results obtained show that the RL framework can achieve up to 56% higher average revenue over existing routing methods used in LEO satellite networks with reasonable storage and computational requirements. 相似文献

15.

强化学习与自适应动态规划:从基础理论到多智能体系统中的应用进展综述

温广辉杨涛周佳玲付俊杰徐磊《控制与决策》2023,38(5):1200-1230

近年来,强化学习与自适应动态规划算法的迅猛发展及其在一系列挑战性问题(如大规模多智能体系统优化决策和最优协调控制问题)中的成功应用,使其逐渐成为人工智能、系统与控制和应用数学等领域的研究热点.鉴于此,首先简要介绍强化学习和自适应动态规划算法的基础知识和核心思想,在此基础上综述两类密切相关的算法在不同研究领域的发展历程,着重介绍其从应用于单个智能体(控制对象)序贯决策(最优控制)问题到多智能体系统序贯决策(最优协调控制)问题的发展脉络和研究进展.进一步,在简要介绍自适应动态规划算法的结构变化历程和由基于模型的离线规划到无模型的在线学习发展演进的基础上,综述自适应动态规划算法在多智能体系统最优协调控制问题中的研究进展.最后,给出多智能体强化学习算法和利用自适应动态规划求解多智能体系统最优协调控制问题研究中值得关注的一些挑战性课题. 相似文献

16.

Ensemble Algorithms in Reinforcement Learning 总被引：1，自引：0，他引：1

Wiering M.A. van Hasselt H. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2008,38(4):930-936

This paper describes several ensemble methods that combine multiple different reinforcement learning (RL) algorithms in a single agent. The aim is to enhance learning speed and final performance by combining the chosen actions or action probabilities of different RL algorithms. We designed and implemented four different ensemble methods combining the following five different RL algorithms: $Q$ -learning, Sarsa, actor–critic (AC), $QV$-learning, and AC learning automaton. The intuitively designed ensemble methods, namely, majority voting (MV), rank voting, Boltzmann multiplication (BM), and Boltzmann addition, combine the policies derived from the value functions of the different RL algorithms, in contrast to previous work where ensemble methods have been used in RL for representing and learning a single value function. We show experiments on five maze problems of varying complexity; the first problem is simple, but the other four maze tasks are of a dynamic or partially observable nature. The results indicate that the BM and MV ensembles significantly outperform the single RL algorithms. 相似文献

17.

Reinforcement learning in robotic applications: a comprehensive survey

Singh Bharat Kumar Rajesh Singh Vinay Pratap 《Artificial Intelligence Review》2022,55(2):945-990

In recent trends, artificial intelligence (AI) is used for the creation of complex automated control systems. Still, researchers are trying to make a completely autonomous system that resembles human beings. Researchers working in AI think that there is a strong connection present between the learning pattern of human and AI. They have analyzed that machine learning (ML) algorithms can effectively make self-learning systems. ML algorithms are a sub-field of AI in which reinforcement learning (RL) is the only available methodology that resembles the learning mechanism of the human brain. Therefore, RL must take a key role in the creation of autonomous robotic systems. In recent years, RL has been applied on many platforms of the robotic systems like an air-based, under-water, land-based, etc., and got a lot of success in solving complex tasks. In this paper, a brief overview of the application of reinforcement algorithms in robotic science is presented. This survey offered a comprehensive review based on segments as (1) development of RL (2) types of RL algorithm like; Actor-Critic, DeepRL, multi-agent RL and Human-centered algorithm (3) various applications of RL in robotics based on their usage platforms such as land-based, water-based and air-based, (4) RL algorithms/mechanism used in robotic applications. Finally, an open discussion is provided that potentially raises a range of future research directions in robotics. The objective of this survey is to present a guidance point for future research in a more meaningful direction.

相似文献

18.

Greedy feature replacement for online value function approximation

Feng-fei Zhao Zheng Qin Zhuo Shao Jun Fang Bo-yan Ren 《浙江大学学报:C卷英文版》2014,15(3):223-231

Reinforcement learning （RL） in real-world problems requires function approximations that depend on selecting the appropriate feature representations. Representational expansion techniques can make linear approximators represent value functions more effectively; however, most of these techniques function well only for low dimensional problems. In this paper, we present the greedy feature replacement （GFR）, a novel online expansion technique, for value-based RL algorithms that use binary features. Given a simple initial representation, the feature representation is expanded incrementally. New feature dependencies are added automatically to the current representation and conjunctive features are used to replace current features greedily. The virtual temporal difference （TD） error is recorded for each conjunctive feature to judge whether the replacement can improve the approximation. Correctness guarantees and computational complexity analysis are provided for GFR. Experimental results in two domains show that GFR achieves much faster learning and has the capability to handle large-scale problems. 相似文献

19.

Robust reinforcement learning

Morimoto J Doya K 《Neural computation》2005,17(2):335-359

This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H(infinity) control, we consider a differential game in which a "disturbing" agent tries to make the worst possible disturbance while a "control" agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H(infinity) control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired. 相似文献

20.

EXPERIMENTS WITH ONLINE REINFORCEMENT LEARNING IN REAL-TIME STRATEGY GAMES

Kresten Toftgaard Andersen Dennis Dahl Christensen Dung Tran 《Applied Artificial Intelligence》2013,27(9):855-871

Real-time strategy (RTS) games provide a challenging platform to implement online reinforcement learning (RL) techniques in a real application. Computer, as one game player, monitors opponents’ (human or other computers) strategies and then updates its own policy using RL methods. In this article, we first examine the suitability of applying the online RL in various computer games. Reinforcement learning application depends on both RL complexity and the game features. We then propose a multi-layer framework for implementing online RL in an RTS game. The framework significantly reduces RL computational complexity by decomposing the state space in a hierarchical manner. We implement an RTS game—Tank General—and perform a thorough test on the proposed framework. We consider three typical profiles of RTS game players and compare two basic RL techniques applied in the game. The results show the effectiveness of our proposed framework and shed light on relevant issues in using online RL in RTS games. 相似文献