首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 140 毫秒
1.
强化学习用于解决无模型情况下的优化决策问题,是实现人工智能的重要技术之一,但传统的表格型强化学习方法难以处理具有大规模、连续空间的控制问题。近似强化学习受到函数逼近思想的启发,对价值函数或策略函数参数化表示,通过参数优化间接获得最优行为策略,在视频游戏、棋类对抗及机器人控制等领域应用效果显著。基于此,对近似强化学习算法的研究现状与应用进展进行了梳理和综述。介绍了近似强化学习相关的基础理论;分类总结了近似强化学习的经典算法及一些相应的改进方法;概述了近似强化学习在机器人控制领域的研究进展,并总结了当前面临的若干主要问题,为后续的研究提供参考。  相似文献   

2.
针对目前制导工具误差评价和落点精度评定的方法中存在的缺陷,提出了基于环境函数法和弹道仿真法的制导工具误差综合评价方法,推导了制导工具各单机单项性能指标与导弹落点精度的函数关系式和交叉耦合关系式。仿真结果说明本方法可行有效,可以对制导工具误差进行更全面的评价。  相似文献   

3.
在复杂的连续空间应用场景中,经典的离散空间强化学习方法已难以满足实际需要,而已有的连续空间强化学习方法主要采用线性拟合方法逼近状态值函数和动作选择函数,存在精度不高的问题。提出一种基于联合神经网络非线性行动者评论家方法(actor-critic approach based on union neural network, UNN-AC)。该方法将动作选择函数和评论值函数表示为统一的联合神经网络模型,利用联合神经网络非线性拟合状态值函数和动作选择概率。与已有的线性拟合方法相比,非线性UNN-AC提高了对评论值函数和动作选择函数的拟合精度。实验结果表明,UNN-AC算法能够有效求解连续空间中近似最优策略问题。与经典的连续动作空间算法相比,该算法具有收敛速度快和稳定性高的优点。  相似文献   

4.
为克服动态规划方法在解决高维问题优化存在的维数灾问题,利用函数近似方法来取得代价函数,通过自学习的方法得到近似动态规划解,适用于复杂、非线性系统的决策优化或控制问题。采用双启发动态规划(DHP)算法用于水泥烧成系统的控制,用神经网络建立评价模块和动作模块对该系统进行优化控制。寻找合适的优化目标函数,由评价模块判断动作的好坏并反馈给动作模块,动作模块给出各参数的调整量。仿真结果显示,系统状态量能够被稳定控制在合理的范围。  相似文献   

5.
障碍规避是无人机等自主无人系统运动规划的重要环节,其核心是设计有效的避障控制方法.为了进一步提高决策优化性和控制效果,本文在最优控制的设定下,提出一种基于强化学习的自主避障控制方法,以自适应方式在线生成安全运行轨迹.首先,利用障碍函数法在代价函数中设计了一个光滑的奖惩函数,从而将避障问题转换为一个无约束的最优控制问题.然后,利用行为–评价神经网络和策略迭代法实现了自适应强化学习,其中评价网络利用状态跟随核函数逼近代价函数,行为网络给出近似最优的控制策略;同时,通过状态外推法获得模拟经验,使得评价网络能利用经验回放实现可靠的局部探索.最后,在简化的无人机系统和非线性数值系统上进行了仿真实验与方法对比,结果表明,提出的避障控制方法能实时生成较优的安全运行轨迹.  相似文献   

6.
多项式函数型回归神经网络模型及应用   总被引:2,自引:1,他引:2  
周永权 《计算机学报》2003,26(9):1196-1200
文中利用回归神经网络既有前馈通路又有反馈通路的特点,将网络隐层中神经元的激活函数设置为可调多项式函数序列,提出了多项式函数型回归神经网络新模型,它不但具有传统回归神经网络的特点,而且具有较强的函数逼近能力,针对递归计算问题,提出了多项式函数型回归神经网络学习算法,并将该网络模型应用于多元多项式近似因式分解,其学习算法在多元多项式近似分解中体现了较强的优越性,通过算例分析表明,该算法十分有效,收敛速度快,计算精度高,可适用于递归计算问题领域,该文所提出的多项式函数型回归神经网络模型及学习算法对于代数符号近似计算有重要的指导意义。  相似文献   

7.
为了实现红外成像制导导弹更加逼真和更加有效的仿真从而验证优化红外成像制导导弹的制导效果,研究了红外成像制导导弹飞行仿真技术;讨论了导弹飞行仿真的基本要求和红外成像制导仿真系统设计的基本原理和设计方案,提出了导弹飞行效果的一些评价指标,并利用虚拟现实仿真技术实现了一个导弹飞行可视化仿真系统;分析了仿真实验结果,实验结果表明仿真系统基本满足红外成像制导导弹飞行仿真要求,在红外成像制导导弹的科学研究中有一定价值.  相似文献   

8.
针对具有连续状态空间的无模型非线性系统,提出一种基于径向基(radial basis function, RBF)神经网络的多步强化学习控制算法.首先,将神经网络引入强化学习系统,利用RBF神经网络的函数逼近功能近似表示状态-动作值函数,解决连续状态空间表达问题;然后,结合资格迹机制形成多步Sarsa算法,通过记录经历过的状态提高系统的学习效率;最后,采用温度参数衰减的方式改进softmax策略,优化动作的选择概率,达到平衡探索和利用关系的目的. MountainCar任务的仿真实验表明:所提出算法经过少量训练能够有效实现无模型情况下的连续非线性系统控制;与单步算法相比,该算法完成任务所用的平均收敛步数更少,效果更稳定,表明非线性值函数近似与多步算法结合在控制任务中同样可以具有良好的性能.  相似文献   

9.
刘力  刘兴堂  孙文  高翔 《计算机仿真》2005,(Z1):189-190
导弹武器系统仿真中,制导控制系统仿真始终占据着非常重要的地位,现已从研制性仿真发展到全寿命周期仿真,并且仿真应用已贯穿于全寿命周期的各个阶段,并在各个阶段中起到了重大的作用,逐步形成对导弹系统性能评价方法分级的概念,使评价真实性不断增加.该文详细地探讨了导弹制导控制系统全寿命周期中各阶段的仿真应用、不同研制阶段中仿真方法选取、所采用半实物仿真系统构成及其最终试验结果分析.为导弹制导控制系统仿真问题研究提供一定的理论依据.  相似文献   

10.
李晓宝  赵国荣  刘帅  温家鑫 《控制与决策》2020,35(10):2336-2344
针对导弹拦截机动目标的末制导问题,基于有限时间滑模控制理论设计一种带有攻击角度和导弹视场角约束的制导律.首先,将导弹末制导问题转化为带有状态约束的制导系统稳定问题,设计一种新型的非奇异终端滑模面和时变的障碍Lyapunov函数,给出一种终端滑模制导律的设计方法,并针对目标机动的不确定性设计一种对目标机动上界的自适应估计;然后,通过稳定性理论证明制导系统的状态变量最终是有限时间收敛的,并且结合时变的障碍Lyapunov函数和滑模面的设计特性证明在末制导过程中视场角约束条件始终不会被违背,相比于现有的考虑视场角约束的制导律,该制导律不存在指令转换,能够加快制导系统收敛速率,增强制导系统的抗干扰能力;最后,通过仿真实验验证所提出制导方法的有效性.  相似文献   

11.
神经网络增强学习的梯度算法研究   总被引:11,自引:1,他引:11  
徐昕  贺汉根 《计算机学报》2003,26(2):227-233
针对具有连续状态和离散行为空间的Markov决策问题,提出了一种新的采用多层前馈神经网络进行值函数逼近的梯度下降增强学习算法,该算法采用了近似贪心且连续可微的Boltzmann分布行为选择策略,通过极小化具有非平稳行为策略的Bellman残差平方和性能指标,以实现对Markov决策过程最优值函数的逼近,对算法的收敛性和近似最优策略的性能进行了理论分析,通过Mountain-Car学习控制问题的仿真研究进一步验证了算法的学习效率和泛化性能。  相似文献   

12.
This paper introduces ANASA (adaptive neural algorithm of stochastic activation), a new, efficient, reinforcement learning algorithm for training neural units and networks with continuous output. The proposed method employs concepts, found in self-organizing neural networks theory and in reinforcement estimator learning algorithms, to extract and exploit information relative to previous input pattern presentations. In addition, it uses an adaptive learning rate function and a self-adjusting stochastic activation to accelerate the learning process. A form of optimal performance of the ANASA algorithm is proved (under a set of assumptions) via strong convergence theorems and concepts. Experimentally, the new algorithm yields results, which are superior compared to existing associative reinforcement learning methods in terms of accuracy and convergence rates. The rapid convergence rate of ANASA is demonstrated in a simple learning task, when it is used as a single neural unit, and in mathematical function modeling problems, when it is used to train various multilayered neural networks.  相似文献   

13.
Kernel-Based Reinforcement Learning   总被引:5,自引:0,他引:5  
Ormoneit  Dirk  Sen  Śaunak 《Machine Learning》2002,49(2-3):161-178
We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeoff in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.  相似文献   

14.
针对带有饱和执行器且局部未知的非线性连续系统的有穷域最优控制问题,设计了一种基于自适应动态规划(ADP)的在线积分增强学习算法,并给出算法的收敛性证明.首先,引入非二次型函数处理控制饱和问题.其次,设计一种由常量权重和时变激活函数构成的单一网络,来逼近未知连续的值函数,与传统双网络相比减少了计算量.同时,综合考虑神经网络产生的残差和终端误差,应用最小二乘法更新神经网络权重,并且给出基于神经网络的迭代值函数收敛到最优值的收敛性证明.最后,通过两个仿真例子验证了算法的有效性.  相似文献   

15.
This article proposes three novel time-varying policy iteration algorithms for finite-horizon optimal control problem of continuous-time affine nonlinear systems. We first propose a model-based time-varying policy iteration algorithm. The method considers time-varying solutions to the Hamiltonian–Jacobi–Bellman equation for finite-horizon optimal control. Based on this algorithm, value function approximation is applied to the Bellman equation by establishing neural networks with time-varying weights. A novel update law for time-varying weights is put forward based on the idea of iterative learning control, which obtains optimal solutions more efficiently compared to previous works. Considering that system models may be unknown in real applications, we propose a partially model-free time-varying policy iteration algorithm that applies integral reinforcement learning to acquiring the time-varying value function. Moreover, analysis of convergence, stability, and optimality is provided for every algorithm. Finally, simulations for different cases are given to verify the convenience and effectiveness of the proposed algorithms.  相似文献   

16.
RAM-based neural networks are designed to be efficiently implemented in hardware. The desire to retain this property influences the training algorithms used, and has led to the use of reinforcement (reward-penalty) learning. An analysis of the reinforcement algorithm applied to RAM-based nodes has shown the ease with which unlearning can occur. An amended algorithm is proposed which demonstrates improved learning performance compared to previously published reinforcement regimes.  相似文献   

17.
Learning classifier systems (LCS) are population-based reinforcement learners that were originally designed to model various cognitive phenomena. This paper presents an explicitly cognitive LCS by using spiking neural networks as classifiers, providing each classifier with a measure of temporal dynamism. We employ a constructivist model of growth of both neurons and synaptic connections, which permits a genetic algorithm to automatically evolve sufficiently-complex neural structures. The spiking classifiers are coupled with a temporally-sensitive reinforcement learning algorithm, which allows the system to perform temporal state decomposition by appropriately rewarding “macro-actions”, created by chaining together multiple atomic actions. The combination of temporal reinforcement learning and neural information processing is shown to outperform benchmark neural classifier systems, and successfully solve a robotic navigation task.  相似文献   

18.
This paper proposes model-free deep inverse reinforcement learning to find nonlinear reward function structures. We formulate inverse reinforcement learning as a problem of density ratio estimation, and show that the log of the ratio between an optimal state transition and a baseline one is given by a part of reward and the difference of the value functions under the framework of linearly solvable Markov decision processes. The logarithm of density ratio is efficiently calculated by binomial logistic regression, of which the classifier is constructed by the reward and state value function. The classifier tries to discriminate between samples drawn from the optimal state transition probability and those from the baseline one. Then, the estimated state value function is used to initialize the part of the deep neural networks for forward reinforcement learning. The proposed deep forward and inverse reinforcement learning is applied into two benchmark games: Atari 2600 and Reversi. Simulation results show that our method reaches the best performance substantially faster than the standard combination of forward and inverse reinforcement learning as well as behavior cloning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号