首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
非线性参数化系统自适应迭代学习控制   总被引:3,自引:1,他引:2  
研究一类含有未知时变参数的非线性参数化系统的学习控制问题.利用参数分离技术和信号置换思想,通过置换系统方程,合并所有时变参数为一个未知时变参数,用迭代自适应方法估计该未知参数,设计了一种自适应迭代学习控制方法,使得跟踪误差的平方在一个有限区间上的积分渐近收敛于零.通过构造一个类Lyapunov函数,给出了跟踪误差收敛和所有闭环系统信号有界的一个充分条件.仿真结果验证了该方法的有效性.  相似文献   

2.
针对目前智能移动机器人在未知环境中学习遇到的如学习主动性、实时性差,无法在线积累学习的知识和经验等问题,受心理学中内部动机的启发,提出一种内部动机驱动的移动机器人未知环境在线自主学习方法,在一定程度上弥补目前该领域存在的一些问题。该方法通过在移动机器人Q学习的框架下,将奖励机制用基于心理学启发的内部动机取代,提高其对于未知环境的学习主动性,同时,采用增量自组织神经网络代替经典Q学习中的查找表,实现输入输出空间的映射,使得机器人能够在线增量地学习未知环境。实验结果表明,通过内部动机驱动的方法,移动机器人对于未知环境的学习主动性得到了提高,智能程度有了明显改进。  相似文献   

3.
本文对于一类含有未知控制方向及时滞的非线性参数化系统,设计了自适应迭代学习控制算法.在设计控制算法过程中采用了参数分离技术和信号置换思想来处理系统中出现的时滞项,Nussbaum增益技术解决未知控制方向等问题.为了对系统中出现的未知时变参数和时不变参数进行估计,分别设计了差分及微分参数学习律.然后通过构造的Lyapunov-Krasovskii复合能量函数给出了系统跟踪误差渐近收敛及闭环系统中所有信号有界的条件.最后通过一个仿真例子说明了控制器设计的有效性.  相似文献   

4.
针对动态未知环境下多智能体多目标协同问题,为实现在动态未知环境下多个智能体能够同时到达所有目标点,设计函数式奖励函数,对强化学习算法进行改进.智能体与环境交互,不断重复"探索-学习-决策"过程,在与环境的交互中积累经验并优化策略,在未预先分配目标点的情况下,智能体通过协同决策,能够避开环境中的静态障碍物和动态障碍物,同时到达所有目标点.仿真结果表明,该算法相比现有多智能体协同方法的学习速度平均提高约42.86%,同时智能体能够获得更多的奖励,可以做到自主决策自主分配目标,并且实现同时到达所有目标点的目标.  相似文献   

5.
现有基于深度强化学习的机械臂轨迹规划方法在未知环境中学习效率偏低,规划策略鲁棒性差。为了解决上述问题,提出了一种基于新型方位奖励函数的机械臂轨迹规划方法A-DPPO,基于相对方向和相对位置设计了一种新型方位奖励函数,通过降低无效探索,提高学习效率。将分布式近似策略优化(DPPO)首次用于机械臂轨迹规划,提高了规划策略的鲁棒性。实验证明相比现有方法,A-DPPO有效地提升了学习效率和规划策略的鲁棒性。  相似文献   

6.
针对一类参数未知的周期非线性时滞系统的输出跟踪控制问题,设计了一种周期自适应迭代学习跟踪控制算法,该方法利用信号置换的思想重组系统,并在假设未知时变参数和参考输出的周期具有已知最小公倍数的情况下,将时滞以及其他不确定的时变项合并为一个周期性的辅助时变参数新变量,进而用周期自适应算法来估计该辅助量.通过构造一个Lyapunov-Krasovskii型复合能量函数,分析了系统的收敛性,证明了经过多次重复迭代学习,所有闭环信号有界且输出跟踪误差收敛,最后通过构造数值实例进行了仿真验证.理论分析和仿真结果表明,该算法简单有效,对于非线性时滞系统的跟踪问题具有很好的控制效果.  相似文献   

7.
在随机博弈频谱竞拍机制模型的基础上,给出基于值分解多智能体合作的频谱管理算法,算法不需要状态转移概率,考虑次用户之间的合作,把团队奖励分解为次用户的价值函数,之后将误差反向传播给各个次用户的价值函数.团队奖励分解可避免出现虚假奖励信号,提高了学习效率.  相似文献   

8.
基于小波神经网络的非线性系统预测控制研究   总被引:1,自引:0,他引:1  
提出了一种基于小波基函数神经网络的未知非线性系统的一步超前预测控制算法。该方法利用小波网络学习这个非线性系统,并且应用小波神经网络模型作为系统的预测模型,控制信号直接通过极小化期望输出值与预测输出值之间的偏差来获得,通过对一非线性系统的仿真,表明了该方法的有效性。  相似文献   

9.
基于深度强化学习的机器人操作技能学习成为研究热点, 但由于任务的稀疏奖励性质, 学习效率较低. 本 文提出了基于元学习的双经验池自适应软更新事后经验回放方法, 并将其应用于稀疏奖励的机器人操作技能学习 问题求解. 首先, 在软更新事后经验回放算法的基础上推导出可以提高算法效率的精简值函数, 并加入温度自适应 调整策略, 动态调整温度参数以适应不同的任务环境; 其次, 结合元学习思想对经验回放进行分割, 训练时动态调整 选取真实采样数据和构建虚拟数的比例, 提出了DAS-HER方法; 然后, 将DAS-HER算法应用到机器人操作技能学 习中, 构建了一个稀疏奖励环境下具有通用性的机器人操作技能学习框架; 最后, 在Mujoco下的Fetch和Hand环境 中, 进行了8项任务的对比实验, 实验结果表明, 无论是在训练效率还是在成功率方面, 本文算法表现均优于其他算 法.  相似文献   

10.
现有的内在奖励随着agent不断探索环境而逐渐消失,导致了agent无法利用内在奖励信号去指引agent寻找最优策略。为了解决这个问题,提出了一种基于内在奖励的技能获取和组合方法。该方法首先在agent与环境交互过程中寻找积极状态,在积极状态中筛选子目标;其次从初始状态到达子目标,子目标到达终止状态所产生的一条轨迹中发现技能,对技能中出现一个或者两个以上的子目标进行组合;最后用初始状态到子目标的距离和初始状态到子目标的累积奖励值对技能进行评估。该方法在Mujoco环境中取得了较高的平均奖励值,尤其是在外在奖励延迟的情况下,也能取得较好的平均奖励值。说明该方法提出的子目标和技能可以有效地解决内在奖励消失后,agent无法利用内在奖励信号学习最优策略的问题。  相似文献   

11.
强化学习算法中启发式回报函数的设计及其收敛性分析   总被引:3,自引:0,他引:3  
(中国科学院沈阳自动化所机器人学重点实验室沈阳110016)  相似文献   

12.
Reinforcement learning (RL) can provide a basic framework for autonomous robots to learn to control and maximize future cumulative rewards in complex environments. To achieve high performance, RL controllers must consider the complex external dynamics for movements and task (reward function) and optimize control commands. For example, a robot playing tennis and squash needs to cope with the different dynamics of a tennis or squash racket and such dynamic environmental factors as the wind. In addition, this robot has to tailor its tactics simultaneously under the rules of either game. This double complexity of the external dynamics and reward function sometimes becomes more complex when both the multiple dynamics and multiple reward functions switch implicitly, as in the situation of a real (multi-agent) game of tennis where one player cannot observe the intention of her opponents or her partner. The robot must consider its opponent's and its partner's unobservable behavioral goals (reward function). In this article, we address how an RL agent should be designed to handle such double complexity of dynamics and reward. We have previously proposed modular selection and identification for control (MOSAIC) to cope with nonstationary dynamics where appropriate controllers are selected and learned among many candidates based on the error of its paired dynamics predictor: the forward model. Here we extend this framework for RL and propose MOSAIC-MR architecture. It resembles MOSAIC in spirit and selects and learns an appropriate RL controller based on the RL controller's TD error using the errors of the dynamics (the forward model) and the reward predictors. Furthermore, unlike other MOSAIC variants for RL, RL controllers are not a priori paired with the fixed predictors of dynamics and rewards. The simulation results demonstrate that MOSAIC-MR outperforms other counterparts because of this flexible association ability among RL controllers, forward models, and reward predictors.  相似文献   

13.
The integration of reinforcement learning (RL) and imitation learning (IL) is an important problem that has long been studied in the field of intelligent robotics. RL optimizes policies to maximize the cumulative reward, whereas IL attempts to extract general knowledge about the trajectories demonstrated by experts, i.e, demonstrators. Because each has its own drawbacks, many methods combining them and compensating for each set of drawbacks have been explored thus far. However, many of these methods are heuristic and do not have a solid theoretical basis. This paper presents a new theory for integrating RL and IL by extending the probabilistic graphical model (PGM) framework for RL, control as inference. We develop a new PGM for RL with multiple types of rewards, called probabilistic graphical model for Markov decision processes with multiple optimality emissions (pMDP-MO). Furthermore, we demonstrate that the integrated learning method of RL and IL can be formulated as a probabilistic inference of policies on pMDP-MO by considering the discriminator in generative adversarial imitation learning (GAIL) as an additional optimality emission. We adapt the GAIL and task-achievement reward to our proposed framework, achieving significantly better performance than policies trained with baseline methods.  相似文献   

14.
This paper proposes an algorithm to deal with continuous state/action space in the reinforcement learning (RL) problem. Extensive studies have been done to solve the continuous state RL problems, but more research should be carried out for RL problems with continuous action spaces. Due to non-stationary, very large size, and continuous nature of RL problems, the proposed algorithm uses two growing self-organizing maps (GSOM) to elegantly approximate the state/action space through addition and deletion of neurons. It has been demonstrated that GSOM has a better performance in topology preservation, quantization error reduction, and non-stationary distribution approximation than the standard SOM. The novel algorithm proposed in this paper attempts to simultaneously find the best representation for the state space, accurate estimation of Q-values, and appropriate representation for highly rewarded regions in the action space. Experimental results on delayed reward, non-stationary, and large-scale problems demonstrate very satisfactory performance of the proposed algorithm.  相似文献   

15.
强化学习是一种人工智能算法,具有计算逻辑清晰、模型易扩展的优点,可以在较少甚至没有先验信息的前提下,通过和环境交互并最大化值函数,调优策略性能,有效地降低物理模型引起的复杂性。基于策略梯度的强化学习算法目前已成功应用于图像智能识别、机器人控制、自动驾驶路径规划等领域。然而强化学习高度依赖采样的特性决定了其训练过程需要大量样本来收敛,且决策的准确性易受到与仿真环境中不匹配的轻微干扰造成严重影响。特别是当强化学习应用于控制领域时,由于无法保证算法的收敛性,难以对其稳定性进行证明,为此,需要对强化学习进行改进。考虑到群体智能算法可通过群体协作解决复杂问题,具有自组织性及稳定性强的特征,利用其对强化学习进行优化求解是一个提高强化学习模型稳定性的有效途径。结合群体智能中的鸽群算法,对基于策略梯度的强化学习进行改进:针对求解策略梯度时存在迭代求解可能无法收敛的问题,提出了基于鸽群的强化学习算法,以最大化未来奖励为目的求解策略梯度,将鸽群算法中的适应性函数和强化学习结合估计策略的优劣,避免求解陷入死循环,提高了强化学习算法的稳定性。在具有非线性关系的两轮倒立摆机器人控制系统上进行仿真验证,实验结果表...  相似文献   

16.
In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.  相似文献   

17.
Many image segmentation solutions are problem-based. Medical images have very similar grey level and texture among the interested objects. Therefore, medical image segmentation requires improvements although there have been researches done since the last few decades. We design a self-learning framework to extract several objects of interest simultaneously from Computed Tomography (CT) images. Our segmentation method has a learning phase that is based on reinforcement learning (RL) system. Each RL agent works on a particular sub-image of an input image to find a suitable value for each object in it. The RL system is define by state, action and reward. We defined some actions for each state in the sub-image. A reward function computes reward for each action of the RL agent. Finally, the valuable information, from discovering all states of the interest objects, will be stored in a Q-matrix and the final result can be applied in segmentation of similar images. The experimental results for cranial CT images demonstrated segmentation accuracy above 95%.  相似文献   

18.
The ability to analyze the effectiveness of agent reward structures is critical to the successful design of multiagent learning algorithms. Though final system performance is the best indicator of the suitability of a given reward structure, it is often preferable to analyze the reward properties that lead to good system behavior (i.e., properties promoting coordination among the agents and providing agents with strong signal to noise ratios). This step is particularly helpful in continuous, dynamic, stochastic domains ill-suited to simple table backup schemes commonly used in TD(λ)/Q-learning where the effectiveness of the reward structure is difficult to distinguish from the effectiveness of the chosen learning algorithm. In this paper, we present a new reward evaluation method that provides a visualization of the tradeoff between the level of coordination among the agents and the difficulty of the learning problem each agent faces. This method is independent of the learning algorithm and is only a function of the problem domain and the agents’ reward structure. We use this reward property visualization method to determine an effective reward without performing extensive simulations. We then test this method in both a static and a dynamic multi-rover learning domain where the agents have continuous state spaces and take noisy actions (e.g., the agents’ movement decisions are not always carried out properly). Our results show that in the more difficult dynamic domain, the reward efficiency visualization method provides a two order of magnitude speedup in selecting good rewards, compared to running a full simulation. In addition, this method facilitates the design and analysis of new rewards tailored to the observational limitations of the domain, providing rewards that combine the best properties of traditional rewards.  相似文献   

19.
针对认知无线网络中多用户资源分配时需要大量信道和功率策略信息交互,并且占用和耗费了大规模系统资源的问题,通过非合作博弈模型对用户的策略进行了研究,提出一种基于多用户Q学习的联合信道选择和功率控制算法。用户在自学习过程中将采用统一的策略,仅通过观察自己的回报来进行Q学习,并逐渐收敛到最优信道和功率分配的最优集合。仿真结果表明,该算法可以高概率地收敛到纳什均衡,用户通过信道选择得到的整体回报非常接近最大整体回报值。  相似文献   

20.
强化学习是提高机器人完成任务效率的有效方法,目前比较流行的学习方法一般采用累积折扣回报方法,但平均值回报在某些方面更适于多机器人协作。累积折扣回报方法在机器人动作层次上可以提高性能,但在多机器人任务层次上却不会得到很好的协作效果,而采用平均回报值的方法,就可以改变这种状态。本文把基于平均值回报的蒙特卡罗学习应用于多机器人合作中,得到很好的学习效果,实际机器人实验结果表明,采用平均值回报的方法优于累积折扣回报方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号