首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 171 毫秒
1.
双轮驱动移动机器人的学习控制器设计方法*   总被引:1,自引:0,他引:1  
提出一种基于增强学习的双轮驱动移动机器人路径跟随控制方法,通过将机器人运动控制器的优化设计问题建模为Markov决策过程,采用基于核的最小二乘策略迭代算法(KLSPI)实现控制器参数的自学习优化。与传统表格型和基于神经网络的增强学习方法不同,KLSPI算法在策略评价中应用核方法进行特征选择和值函数逼近,从而提高了泛化性能和学习效率。仿真结果表明,该方法通过较少次数的迭代就可以获得优化的路径跟随控制策略,有利于在实际应用中的推广。  相似文献   

2.
周鑫  刘全  傅启明  肖飞 《计算机科学》2014,41(9):232-238
策略迭代是一种迭代地评估和改进控制策略的强化学习方法。采用最小二乘的策略评估方法可以从经验数据中提取出更多有用信息,提高数据有效性。针对在线的最小二乘策略迭代方法对样本数据的利用不充分、每个样本仅使用一次就被丢弃的问题,提出一种批量最小二乘策略迭代算法(BLSPI),并从理论上证明其收敛性。BLSPI算法将批量更新方法与在线最小二乘策略迭代方法相结合,在线保存生成的样本数据,多次重复使用这些样本数据并结合最小二乘方法来更新控制策略。将BLSPI算法用于倒立摆实验平台,实验结果表明,该算法可以有效利用之前的经验知识,提高经验利用率,加快收敛速度。  相似文献   

3.
研究了几类典型增强学习算法的性能评估问题,包括Q-学习算法、最小二乘策略迭代(LSPI)和基于核的最小二乘策略迭代 (KLSPI)算法等,重点针对Markov决策问题(MDP)的值函数平滑特性对算法性能的影响进行了研究。分别利用值函数非平滑的组合优化问题——旅行商问题(TSP)和值函数平滑的Mountain-Car运动控制问题,对不同增强学习算法的性能进行了测试和比较分析。分析了三种算法针对不同类型问题的各自特点,通过实验对比,验证了近似策略迭代算法,特别是KLSPI算法在解决值函数平滑的序贯决策问题时性  相似文献   

4.
许洋  秦小林  刘佳  张力戈 《计算机应用》2020,40(5):1515-1521
针对多无人机(UAV)协同航迹规划中因编队队形约束而忽略部分较窄通道的问题,提出了一种基于自适应分布式模型预测控制的快速粒子群优化(ADMPC-FPSO)方法。该方法利用领航跟随法和虚拟结构法相结合的编队策略构造出虚拟编队引导点,以完成自适应编队协同控制任务。根据模型预测控制的思想,结合分布式控制方法,将协同航迹规划转化为滚动在线优化问题,且以最小距离等性能指标为代价函数。通过设计评价函数准则,使用变权重快速粒子群优化算法对问题进行求解。仿真结果表明,通过所提算法能够有效实现多无人机协同航迹规划,并可根据环境变化快速完成自适应编队变换,同时较传统编队策略代价更低。  相似文献   

5.
研究了基于智能算法的机器人无标定视觉伺服问题, 提出了一种新的基于最小二乘支持向量回归的机器人无标定视觉免疫控制方法. 利用最小二乘支持向量回归学习机器人位姿变化和观测到的图像特征变化之间的复杂非线性关系, 其中最小二乘支持向量回归的参数由自适应免疫算法加5折交叉检验优化确定, 在此基础上利用免疫控制原理设计了视觉控制器. 六自由度工业机器人空间4DOF 视觉定位实验结果表明了该方法的有效性.  相似文献   

6.
队形切换是多机器人编队的重要研究内容。针对未指定各机器人ID与目标位置对应关系情况下如何实现分布式多机器人编队队形切换的问题,本文结合自然界鸟群、鱼群等群生物的觅食行为建立了一种新的诱饵-捕食者系统,并将该系统应用到分布式多机器人编队的队形切换控制中。视编队中各机器人为捕食者,各目标位置为诱饵,通过捕食者与诱饵之间的相互作用,实现不同队形之间的切换。同时针对捕食者所受到的各种影响都建立了具体的数学模型。该方法在解决队形切换过程中无需预先规划各机器人的运动路径,可在分布性较强的编队中应用。仿真实验表明该控制策略对于分布式多机器人编队的队形切换有较好的适用性,为多机器人编队进行队形切换提供了新的思路和方法。  相似文献   

7.
简单介绍了NuBot机器人的两个主要组成部分:全向视觉和全向运动系统,并给出了运动学分析.基于该机器人平台,提出了D-A和D-D控制两种跟踪算法.通过机器人之间的相对定位和局部通信,实现了多机器人编队的分布式控制,同时,该算法可对机器人朝向进行独立控制.针对不同情况下的编队避障问题,提出了编队变形和编队变换两种方法.仿真和实际机器人实验表明,D-A控制方法能够实现平滑的编队变换;编队变形方法能够在尽量保持原始队形的情况下保证编队顺利避障.  相似文献   

8.
一种无标定视觉伺服控制技术的研究   总被引:3,自引:0,他引:3  
赵杰  李牧  李戈  闫继宏 《控制与决策》2006,21(9):1015-1019
在视觉伺服控制过程中无法精确地标定摄像机和机器人运动学模型,而当前的无标定视觉伺服控制技术或者只能针对静态的目标,或者针对动态目标但无法摆脱大偏差的影响.针对此问题,提出一种动态无标定的视觉伺服控制方法:基于非线性方差最小化法控制机器人跟踪运动目标,利用动态拟牛顿法估计图像雅克比矩阵,采用迭代最小二乘法提高系统的稳定性并提出大偏差条件下的无标定控制策略.仿真实验证明了该方法的正确性和有效性.  相似文献   

9.
闫敬  徐龙  曹文强  杨睍  罗小元 《控制与决策》2023,38(5):1457-1463
考虑水下未知信道与不确定模型参数,提出基于深度强化学习的多潜器编队控制算法.首先,提出基于环境采样数据的最小二乘估计器,用于预测在衰落环境下的未知信道;其次,根据信道预测估计器得出的信噪比(SNR),建立通信有效性与编队稳定性的联合优化问题,并给出基于深度强化学习-深度确定性策略梯度算法(DDPG)的编队控制算法;最后,通过仿真与实验结果验证所提出算法的有效性,参考仿真结果并相比于直接编队控制,考虑通信有效性的情况下所提出算法提升了13.5%的通信性能.  相似文献   

10.
研究具有有向通信拓扑的多智能体分布式编队事件触发控制问题,被控对象采用两轮差速轮式机器人.首先,建立轮式机器人运动学模型,并利用动态反馈线性化方法将模型转化为线性双积分器模型.其次,根据通信拓扑关系设计分布式编队控制器.然后,基于李雅普诺夫稳定性定理,在满足稳定性的前提下设计事件触发器,从而实现分布式编队事件触发控制,...  相似文献   

11.
Kernel-based least squares policy iteration for reinforcement learning.   总被引:4,自引:0,他引:4  
In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.  相似文献   

12.
基于状态-动作图测地高斯基的策略迭代强化学习   总被引:3,自引:2,他引:1  
在策略迭代强化学习中,基函数构造是影响动作值函数逼近精度的一个重要因素.为了给动作值函数逼近提供合适的基函数,提出一种基于状态-动作图测地高斯基的策略迭代强化学习方法.首先,根据离策略方法建立马尔可夫决策过程的状态-动作图论描述;然后,在状态-动作图上定义测地高斯核函数,利用基于近似线性相关的核稀疏方法自动选择测地高斯...  相似文献   

13.
This article proposes three novel time-varying policy iteration algorithms for finite-horizon optimal control problem of continuous-time affine nonlinear systems. We first propose a model-based time-varying policy iteration algorithm. The method considers time-varying solutions to the Hamiltonian–Jacobi–Bellman equation for finite-horizon optimal control. Based on this algorithm, value function approximation is applied to the Bellman equation by establishing neural networks with time-varying weights. A novel update law for time-varying weights is put forward based on the idea of iterative learning control, which obtains optimal solutions more efficiently compared to previous works. Considering that system models may be unknown in real applications, we propose a partially model-free time-varying policy iteration algorithm that applies integral reinforcement learning to acquiring the time-varying value function. Moreover, analysis of convergence, stability, and optimality is provided for every algorithm. Finally, simulations for different cases are given to verify the convenience and effectiveness of the proposed algorithms.  相似文献   

14.
Policy iteration, which evaluates and improves the control policy iteratively, is a reinforcement learning method. Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity. However, most existing online least-squares policy iteration methods only use each sample just once, resulting in the low utilization rate. With the goal of improving the utilization efficiency, we propose an experience replay for least-squares policy iteration (ERLSPI) and prove its convergence. ERLSPI method combines online least-squares policy iteration method with experience replay, stores the samples which are generated online, and reuses these samples with least-squares method to update the control policy. We apply the ERLSPI method for the inverted pendulum system, a typical benchmark testing. The experimental results show that the method can effectively take advantage of the previous experience and knowledge, improve the empirical utilization efficiency, and accelerate the convergence speed.   相似文献   

15.
季挺  张华 《控制与决策》2017,32(12):2153-2161
为解决当前近似策略迭代增强学习算法普遍存在计算量大、基函数不能完全自动构建的问题,提出一种基于状态聚类的非参数化近似广义策略迭代增强学习算法(NPAGPI-SC).该算法利用二级随机采样过程采集样本,利用trial-and-error过程和以样本完全覆盖为目标的估计方法计算逼近器初始参数,利用delta规则和最近邻思想在学习过程中自适应地调整逼近器,利用贪心策略选择应执行的动作.一级倒立摆平衡控制的仿真实验结果验证了所提出算法的有效性和鲁棒性.  相似文献   

16.
为解决在线近似策略迭代增强学习计算复杂度高、收敛速度慢的问题,引入CMAC结构作为值函数逼近器,提出一种基于CMAC的非参数化近似策略迭代增强学习(NPAPI-CMAC)算法。算法通过构建样本采集过程确定CMAC泛化参数,利用初始划分和拓展划分确定CMAC状态划分方式,利用量化编码结构构建样本数集合定义增强学习率,实现了增强学习结构和参数的完全自动构建。此外,该算法利用delta规则和最近邻思想在学习过程中自适应调整增强学习参数,利用贪心策略对动作投票器得到的结果进行选择。一级倒立摆平衡控制的仿真实验结果验证了算法的有效性、鲁棒性和快速收敛能力。  相似文献   

17.
Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multi-armed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountain-car.  相似文献   

18.
为克服全状态对称约束以及控制策略频繁更新的局限,同时使得无限时间的代价函数最优,针对一类具有部分动力学未知的仿射非线性连续系统,提出一种带状态约束的事件触发积分强化学习的控制器设计方法。该方法是一种基于数据的在线策略迭代方法。引入系统转换将带有全状态约束的系统转化为不含约束的系统。基于事件触发机制以及积分强化学习算法,通过交替执行系统转换、策略评估、策略改进,最终系统在满足全状态约束的情况下,代价函数以及控制策略将分别收敛于最优值,并能降低控制策略的更新频率。此外,通过构建李亚普诺夫函数对系统以及评论神经网络权重误差的稳定性进行严格的分析。单连杆机械臂的仿真实验也进一步说明算法的可行性。  相似文献   

19.
This paper studies the cooperative control problem for a class of multiagent dynamical systems with partially unknown nonlinear system dynamics. In particular, the control objective is to solve the state consensus problem for multiagent systems based on the minimisation of certain cost functions for individual agents. Under the assumption that there exist admissible cooperative controls for such class of multiagent systems, the formulated problem is solved through finding the optimal cooperative control using the approximate dynamic programming and reinforcement learning approach. With the aid of neural network parameterisation and online adaptive learning, our method renders a practically implementable approximately adaptive neural cooperative control for multiagent systems. Specifically, based on the Bellman's principle of optimality, the Hamilton–Jacobi–Bellman (HJB) equation for multiagent systems is first derived. We then propose an approximately adaptive policy iteration algorithm for multiagent cooperative control based on neural network approximation of the value functions. The convergence of the proposed algorithm is rigorously proved using the contraction mapping method. The simulation results are included to validate the effectiveness of the proposed algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号