期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

孟江华朱纪洪孙增圻《计算机工程与应用》2006,42(33):34-37

Option的自动发现与生成是递阶再励学习的难点之一,论文提出探索密度检测(ED)法,通过检测状态空间中的探索密度来发现并构建Option。和现有的方法相比具有和任务无关、不需要先验知识等优点;能很好地工作于完全未知的环境中;并且构造出的Option,在同一环境下不同任务间可以直接共享。相似文献

2.

竞争式Takagi-Sugeno模糊再励学习 总被引：4，自引：0，他引：4

晏雄伟邓志东孙增圻《自动化学报》2002,28(6):873-880

针对连续空间的复杂学习任务,提出了一种竞争式Takagi-Sugeno模糊再励学习网络 (CTSFRLN),该网络结构集成了Takagi-Sugeno模糊推理系统和基于动作的评价值函数的再励学习方法.文中相应提出了两种学习算法,即竞争式Takagi-Sugeno模糊Q-学习算法和竞争式Takagi-Sugeno模糊优胜学习算法,其把CTSFRLN训练成为一种所谓的Takagi-Sugeno模糊变结构控制器.以二级倒立摆控制系统为例,仿真研究表明所提出的学习算法在性能上优于其它的再励学习算法. 相似文献

3.

基于再励学习的多移动机器人协调避障路径规划方法 总被引：1，自引：0，他引：1

张芳颜国正林良明《计算机工程与应用》2003,39(3):80-83

随着多移动机器人协调系统的应用向未知环境发展,一些依赖于环境模型的路径规划方法不再适用。而利用再励学习与环境直接交互,不需要先验知识和样本数据的特点,该文将再励学习应用于多机器人协调系统中,提出了基于再励学习的避障路径规划方法,并将再励函数设计为基于行为分解的无模型非均匀结构。计算机仿真实验结果表明该方法有效,并有较好的鲁棒性,新的再励函数结构使得学习速度得以提高。相似文献

4.

基于Takagi-Sugeno的再励学习模糊神经网络控制

马力佳高岩《微计算机信息》2006,22(16):7-9

提出一种模糊神经网络的自适应控制方案。针对连续空间的复杂学习任务,提出了一种竞争式Takagi-Sugeno模糊再励学习网络,该网络结构集成了Takagi-Sugeno模糊推理系统和基于动作的评价值函数的再励学习方法。相应地,提出了一种优化学习算法,其把竞争式Takagi-Sugeno模糊再励学习网络训练成为一种所谓的Takagi-Sugeno模糊变结构控制器。以一级倒立摆控制系统为例,仿真研究表明所提出的学习算法在性能上优于其它的再励学习算法。相似文献

5.

基于Takagi—Sugeno的再励学习模糊神经网络控制

马力佳高岩《微计算机信息》2006,(6S):7-9

提出一种模糊神经网络的自适应控制方案。针对连续空间的复杂学习任务，提出了一种竞争式Takagi—Sugeno模糊再励学习网络，该网络结构集成了Takagi-Sugeno模糊推理系统和基于动作的评价值函数的再励学习方法。相应地，提出了一种优化学习算法，其把竞争式Takagi-Sugeno模糊再励学习网络训练成为一种所谓的Takagi-Sugeno模糊变结构控制器。以一级倒立摆控制系统为例．仿真研究表明所提出的学习算法在性能上优于其它的再励学习算法。相似文献

6.

一种基于特特征向量提取的FMDP模型求解方法

张双民石纯一《软件学报》2005,16(5):733-743

在诸如机器人足球赛等典型的可分解马尔可夫决策过程(factored Markov decision process,简称FMDP)模型中,不同状态属性在不同的状态下,对于状态评估的影响程度是不同的,其中存在若干关键状态属性,能够唯一或近似判断当前状态的好坏.为了解决FMDP模型中普遍存在的"维数灾"问题,在效用函数非线性的情况下,通过对状态特征向量的提取近似状态效用函数,同时根据对FMDP模型的认知程度,从线性规划和再励学习两种求解角度分别进行约束不等式组的化简和状态效用函数的高维移植,从而达到降低计算复杂度,加快联合策略生成速度的目的.以机器人足球赛任意球战术配合为背景进行实验来验证基于状态特征向量的再励学习算法的有效性和学习结果的可移植性.与传统再励学习算法相比,基于状态特征向量的再励学习算法能够极大地加快策略的学习速度.但更重要的是,还可以将学习到的状态效用函数方便地移植到更高维的FMDP模型中,从而直接计算出联合策略而不需要重新进行学习. 相似文献

7.

一种基于特征向量提取的FMDP模型求解方法 总被引：1，自引：0，他引：1

下载免费PDF全文

张双民石纯一《软件学报》2005,16(5):733-743

在诸如机器人足球赛等典型的可分解马尔可夫决策过程(factored Markov decision process，简称FMDP)模型中，不同状态属性在不同的状态下，对于状态评估的影响程度是不同的，其中存在若干关键状态属性，能够唯一或近似判断当前状态的好坏．为了解决FMDP模型中普遍存在的“维数灾”问题，在效用函数非线性的情况下，通过对状态特征向量的提取近似状态效用函数，同时根据对FMDP模型的认知程度，从线性规划和再励学习两种求解角度分别进行约束不等式组的化简和状态效用函数的高维移植，从而达到降低计算复杂度，加快联合策略生成速度的目的、以机器人足球赛任意球战术配合为背景进行实验来验证基于状态特征向量的再励学习算法的有效性和学习结果的可移植性．与传统再励学习算法相比，基于状态特征向量的再励学习算法能够极大地加快策略的学习速度．但更重要的是，还可以将学习到的状态效用函数方便地移植到更高维的FMDP模型中，从而直接计算出联合策略而不需要重新进行学习．相似文献

8.

基于再励学习的主动队列管理算法 总被引：6，自引：0，他引：6

下载免费PDF全文

张雁冰杭大明马正新曹志刚《软件学报》2004,15(7):1090-1098

从最优决策的角度出发,将人工智能中的再励学习方法引入主动队列管理的研究中,提出了一种基于再励学习的主动队列管理算法RLGD(reinforcement learning gradient-descent).RLGD以速率匹配和队列稳定为优化目标,根据网络状态自适应地调节更新步长,使得队列长度能够很快收敛到目标值,并且抖动很小.此外,RLGD不需要知道源端的速率调整算法,因而具有很好的可扩展性.通过不同网络环境下的仿真显示,RLGD与REM,PI等AQM算法相比,具有更好的性能和鲁棒性. 相似文献

9.

再励学习——原理,算法及其在智能控制中的应用 总被引：20，自引：0，他引：20

阎平凡《信息与控制》1996,25(1):28-34

综述了再励学习的原理，主要算法，基于神经网络的实现及其在智能控制中的作用，探讨了应进一步研究的问题。相似文献

10.

基于探索密度的Option子目标发现算法

孟江华朱纪洪孙增圻《模式识别与人工智能》2007,20(2)

提出状态探索密度的概念,通过检测状态对智能体探索环境能力的影响来发现学习的子目标并构建对应的Option.用该算法创建Option的再励学习算法能有效提高学习速度.算法具有和任务无关、不需要先验知识等优点,构造出的Option在同一环境下不同任务间可以直接共享. 相似文献

11.

基于强化学习的多技能项目调度算法

胡振涛崔南方胡雪君雷晓琪《控制理论与应用》2024,41(3):502-511

多技能项目调度存在组合爆炸的现象, 其问题复杂度远超传统的单技能项目调度, 启发式算法和元启发式算法在求解多技能项目调度问题时也各有缺陷. 为此, 根据项目调度的特点和强化学习的算法逻辑, 本文设计了基于强化学习的多技能项目调度算法. 首先, 将多技能项目调度过程建模为符合马尔科夫性质的序贯决策过程, 并依据决策过程设计了双智能体机制. 而后, 通过状态整合和行动分解, 降低了价值函数的学习难度. 最后, 为进一步提高算法性能, 针对资源的多技能特性, 设计了技能归并法, 显著降低了资源分配算法的时间复杂度. 与启发式算法的对比实验显示, 本文所设计的强化学习算法求解性能更高, 与元启发式算法的对比实验表明, 该算法稳定性更强, 且求解速度更快. 相似文献

12.

Optimal Synchronization Control of Heterogeneous Asymmetric Input-Constrained Unknown Nonlinear MASs via Reinforcement Learning

下载免费PDF全文

Lina Xia Qing Li Ruizhuo Song Hamidreza Modares 《IEEE/CAA Journal of Automatica Sinica》2022,9(3):520-532

The asymmetric input-constrained optimal synchronization problem of heterogeneous unknown nonlinear multiagent systems(MASs)is considered in the paper.Intuitively,a state-space transformation is performed such that satisfaction of symmetric input constraints for the transformed system guarantees satisfaction of asymmetric input constraints for the original system.Then,considering that the leader’s information is not available to every follower,a novel distributed observer is designed to estimate the leader’s state using only exchange of information among neighboring followers.After that,a network of augmented systems is constructed by combining observers and followers dynamics.A nonquadratic cost function is then leveraged for each augmented system(agent)for which its optimization satisfies input constraints and its corresponding constrained Hamilton-Jacobi-Bellman(HJB)equation is solved in a data-based fashion.More specifically,a data-based off-policy reinforcement learning(RL)algorithm is presented to learn the solution to the constrained HJB equation without requiring the complete knowledge of the agents’dynamics.Convergence of the improved RL algorithm to the solution to the constrained HJB equation is also demonstrated.Finally,the correctness and validity of the theoretical results are demonstrated by a simulation example. 相似文献

13.

双轮驱动移动机器人的学习控制器设计方法* 总被引：1，自引：0，他引：1

张洪宇徐昕张鹏程刘春明宋金泽《计算机应用研究》2009,26(6):2310-2313

提出一种基于增强学习的双轮驱动移动机器人路径跟随控制方法,通过将机器人运动控制器的优化设计问题建模为Markov决策过程,采用基于核的最小二乘策略迭代算法(KLSPI)实现控制器参数的自学习优化。与传统表格型和基于神经网络的增强学习方法不同,KLSPI算法在策略评价中应用核方法进行特征选择和值函数逼近,从而提高了泛化性能和学习效率。仿真结果表明,该方法通过较少次数的迭代就可以获得优化的路径跟随控制策略,有利于在实际应用中的推广。相似文献

14.

Conflict-Aware Safe Reinforcement Learning: A Meta-Cognitive Learning Framework

下载免费PDF全文

Majid Mazouchi Subramanya Nageshrao Hamidreza Modares 《IEEE/CAA Journal of Automatica Sinica》2022,9(3):466-481

In this paper,a data-driven conflict-aware safe reinforcement learning(CAS-RL)algorithm is presented for control of autonomous systems.Existing safe RL results with predefined performance functions and safe sets can only provide safety and performance guarantees for a single environment or circumstance.By contrast,the presented CAS-RL algorithm provides safety and performance guarantees across a variety of circumstances that the system might encounter.This is achieved by utilizing a bilevel learning control architecture:A higher metacognitive layer leverages a data-driven receding-horizon attentional controller(RHAC)to adapt relative attention to different system’s safety and performance requirements,and,a lower-layer RL controller designs control actuation signals for the system.The presented RHAC makes its meta decisions based on the reaction curve of the lower-layer RL controller using a metamodel or knowledge.More specifically,it leverages a prediction meta-model(PMM)which spans the space of all future meta trajectories using a given finite number of past meta trajectories.RHAC will adapt the system’s aspiration towards performance metrics(e.g.,performance weights)as well as safety boundaries to resolve conflicts that arise as mission scenarios develop.This will guarantee safety and feasibility(i.e.,performance boundness)of the lower-layer RL-based control solution.It is shown that the interplay between the RHAC and the lower-layer RL controller is a bilevel optimization problem for which the leader(RHAC)operates at a lower rate than the follower(RL-based controller)and its solution guarantees feasibility and safety of the control solution.The effectiveness of the proposed framework is verified through a simulation example. 相似文献

15.

强化学习在足球机器人基本动作学习中的应用 总被引：1，自引：0，他引：1

段勇杨淮清崔宝侠徐心和《机器人》2008,30(5):1

主要研究了强化学习算法及其在机器人足球比赛技术动作学习问题中的应用．强化学习的状态空间和动作空间过大或变量连续,往往导致学习的速度过慢甚至难于收敛．针对这一问题,提出了基于T-S 模型模糊神经网络的强化学习方法,能够有效地实现强化学习状态空间到动作空间的映射．此外,使用提出的强化学习方法设计了足球机器人的技术动作,研究了在不需要专家知识和环境模型情况下机器人的行为学习问题．最后,通过实验证明了所研究方法的有效性,其能够满足机器人足球比赛的需要．相似文献

16.

Price-Based Residential Demand Response Management in Smart Grids: A Reinforcement Learning-Based Approach

下载免费PDF全文

Yanni Wan Jiahu Qin Xinghuo Yu Tao Yang Yu Kang 《IEEE/CAA Journal of Automatica Sinica》2022,9(1):123-134

This paper studies price-based residential demand response management (PB-RDRM) in smart grids, in which non-dispatchable and dispatchable loads (including general loads and plug-in electric vehicles (PEVs)) are both involved. The PB-RDRM is composed of a bi-level optimization problem, in which the upper-level dynamic retail pricing problem aims to maximize the profit of a utility company (UC) by selecting optimal retail prices (RPs), while the lower-level demand response (DR) problem expects to minimize the comprehensive cost of loads by coordinating their energy consumption behavior. The challenges here are mainly two-fold: 1) the uncertainty of energy consumption and RPs; 2) the flexible PEVs’ temporally coupled constraints, which make it impossible to directly develop a model-based optimization algorithm to solve the PB-RDRM. To address these challenges, we first model the dynamic retail pricing problem as a Markovian decision process (MDP), and then employ a model-free reinforcement learning (RL) algorithm to learn the optimal dynamic RPs of UC according to the loads’ responses. Our proposed RL-based DR algorithm is benchmarked against two model-based optimization approaches (i.e., distributed dual decomposition-based (DDB) method and distributed primal-dual interior (PDI)-based method), which require exact load and electricity price models. The comparison results show that, compared with the benchmark solutions, our proposed algorithm can not only adaptively decide the RPs through on-line learning processes, but also achieve larger social welfare within an unknown electricity market environment. 相似文献

17.

多步截断优先扫描强化学习算法

李春贵《计算机工程》2005,31(11):13-15

研究了优先扫描的强化学习方法,通过定义新的迹,把多步截断即时差分学习用于集成规划的优先扫描强化学习,用多步截断即时差分来定义扫描优先权,提出一种改进的优先扫描强化学习算法并进行仿真实验,实验结果表明,新算法的学习效率有明显的提高。相似文献

18.

强化学习算法中启发式回报函数的设计及其收敛性分析 总被引：3，自引：0，他引：3

魏英姿赵明扬《计算机科学》2005,32(3):190-193

(中国科学院沈阳自动化所机器人学重点实验室沈阳110016) 相似文献

19.

Synergizing reinforcement learning and game theory—A new direction for control

Rajneesh Sharma M. Gopal 《Applied Soft Computing》2010,10(3):675-688

Reinforcement learning (RL) has now evolved as a major technique for adaptive optimal control of nonlinear systems. However, majority of the RL algorithms proposed so far impose a strong constraint on the structure of environment dynamics by assuming that it operates as a Markov decision process (MDP). An MDP framework envisages a single agent operating in a stationary environment thereby limiting the scope of application of RL to control problems. Recently, a new direction of research has focused on proposing Markov games as an alternative system model to enhance the generality and robustness of the RL based approaches. This paper aims to present this new direction that seeks to synergize broad areas of RL and Game theory, as an interesting and challenging avenue for designing intelligent and reliable controllers. First, we briefly review some representative RL algorithms for the sake of completeness and then describe the recent direction that seeks to integrate RL and game theory. Finally, open issues are identified and future research directions outlined. 相似文献

20.

Offline reinforcement learning with task hierarchies

Devin Schwab Soumya Ray 《Machine Learning》2017,106(9-10):1569-1598

In this work, we build upon the observation that offline reinforcement learning (RL) is synergistic with task hierarchies that decompose large Markov decision processes (MDPs). Task hierarchies can allow more efficient sample collection from large MDPs, while offline algorithms can learn better policies than the so-called “recursively optimal” or even hierarchically optimal policies learned by standard hierarchical RL algorithms. To enable this synergy, we study sample collection strategies for offline RL that are consistent with a provided task hierarchy while still providing good exploration of the state-action space. We show that naïve extensions of uniform random sampling do not work well in this case and design a strategy that has provably good convergence properties. We also augment the initial set of samples using additional information from the task hierarchy, such as state abstraction. We use the augmented set of samples to learn a policy offline. Given a capable offline RL algorithm, this policy is then guaranteed to have a value greater than or equal to the value of the hierarchically optimal policy. We evaluate our approach on several domains and show that samples generated using a task hierarchy with a suitable strategy allow significantly more sample-efficient convergence than standard offline RL. Further, our approach also shows more sample-efficient convergence to policies with value greater than or equal to hierarchically optimal policies found through an online hierarchical RL approach. 相似文献