期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

郭方洪何通吴祥董辉刘冰《控制理论与应用》2022,39(10):1881-1889

随着海量新能源接入到微电网中, 微电网系统模型的参数空间成倍增长, 其能量优化调度的计算难度不断上升. 同时, 新能源电源出力的不确定性也给微电网的优化调度带来巨大挑战. 针对上述问题, 本文提出了一种基于分布式深度强化学习的微电网实时优化调度策略. 首先, 在分布式的架构下, 将主电网和每个分布式电源看作独立智能体. 其次, 各智能体拥有一个本地学习模型, 并根据本地数据分别建立状态和动作空间, 设计一个包含发电成本、交易电价、电源使用寿命等多目标优化的奖励函数及其约束条件. 最后, 各智能体通过与环境交互来寻求本地最优策略, 同时智能体之间相互学习价值网络参数, 优化本地动作选择, 最终实现最小化微电网系统运行成本的目标. 仿真结果表明, 与深度确定性策略梯度算法(Deep Deterministic Policy Gradient, DDPG)相比, 本方法在保证系统稳定以及求解精度的前提下, 训练速度提高了17.6%, 成本函数值降低了67%, 实现了微电网实时优化调度. 相似文献

2.

Minimizing mean weighted tardiness in unrelated parallel machine scheduling with reinforcement learning

Zhicong Zhang Li Zheng Na Li Weiping Wang Shouyan Zhong Kaishun Hu 《Computers & Operations Research》2012,39(7):1315-1324

We address an unrelated parallel machine scheduling problem with R-learning, an average-reward reinforcement learning (RL) method. Different types of jobs dynamically arrive in independent Poisson processes. Thus the arrival time and the due date of each job are stochastic. We convert the scheduling problems into RL problems by constructing elaborate state features, actions, and the reward function. The state features and actions are defined fully utilizing prior domain knowledge. Minimizing the reward per decision time step is equivalent to minimizing the schedule objective, i.e. mean weighted tardiness. We apply an on-line R-learning algorithm with function approximation to solve the RL problems. Computational experiments demonstrate that R-learning learns an optimal or near-optimal policy in a dynamic environment from experience and outperforms four effective heuristic priority rules (i.e. WSPT, WMDD, ATC and WCOVERT) in all test problems. 相似文献

3.

强化学习算法中启发式回报函数的设计及其收敛性分析 总被引：3，自引：0，他引：3

魏英姿赵明扬《计算机科学》2005,32(3):190-193

(中国科学院沈阳自动化所机器人学重点实验室沈阳110016) 相似文献

4.

Transfer in variable-reward hierarchical reinforcement learning 总被引：2，自引：1，他引：1

Neville Mehta Sriraam Natarajan Prasad Tadepalli Alan Fern 《Machine Learning》2008,73(3):289-312

Transfer learning seeks to leverage previously learned tasks to achieve faster learning in a new task. In this paper, we consider transfer learning in the context of related but distinct Reinforcement Learning (RL) problems. In particular, our RL problems are derived from Semi-Markov Decision Processes (SMDPs) that share the same transition dynamics but have different reward functions that are linear in a set of reward features. We formally define the transfer learning problem in the context of RL as learning an efficient algorithm to solve any SMDP drawn from a fixed distribution after experiencing a finite number of them. Furthermore, we introduce an online algorithm to solve this problem, Variable-Reward Reinforcement Learning (VRRL), that compactly stores the optimal value functions for several SMDPs, and uses them to optimally initialize the value function for a new SMDP. We generalize our method to a hierarchical RL setting where the different SMDPs share the same task hierarchy. Our experimental results in a simplified real-time strategy domain show that significant transfer learning occurs in both flat and hierarchical settings. Transfer is especially effective in the hierarchical setting where the overall value functions are decomposed into subtask value functions which are more widely amenable to transfer across different SMDPs. 相似文献

5.

基于深度强化学习求解作业车间机器与 AGV联合调度问题

下载免费PDF全文

孙爱红雷琦宋豫川杨云帆《控制与决策》2024,39(1):253-262

针对作业车间中自动引导运输车(automated guided vehicle, AGV)与机器联合调度问题,以完工时间最小化为目标,提出一种基于卷积神经网络和深度强化学习的集成算法框架.首先,对含AGV的作业车间调度析取图进行分析,将问题转化为一个序列决策问题,并将其表述为马尔可夫决策过程.接着,针对问题的求解特点,设计一种基于析取图的空间状态与5个直接状态特征;在动作空间的设置上,设计包含工序选择和AGV指派的二维动作空间;根据作业车间中加工时间与有效运输时间为定值这一特点,构造奖励函数来引导智能体进行学习.最后,设计针对二维动作空间的2D-PPO算法进行训练和学习,以快速响应AGV与机器的联合调度决策.通过实例验证,基于2D-PPO算法的调度算法具有较好的学习性能和可扩展性效果. 相似文献

6.

一种基于强化学习的作业车间动态调度方法

魏英姿赵明扬《自动化学报》2005,31(5):765-771

Production scheduling is critical to manufacturing system. Dispatching rules are usually applied dynamically to schedule the job in a dynamic job-shop. Existing scheduling approaches sel- dom address machine selection in the scheduling process. Composite rules, considering both machine selection and job selection, are proposed in this paper. The dynamic system is trained to enhance its learning and adaptive capability by a reinforcement learning (RL) algorithm. We define the conception of pressure to describe the system feature. Designing a reward function should be guided by the scheduling goal to accurately record the learning progress. Competitive results with the RL-based approach show that it can be used as real-time scheduling technology. 相似文献

7.

A Reinforcement Learning-based Approach to Dynamic Job-shop Scheduling

WEI Ying-Zi~ 《自动化学报》2005,(5)

Production scheduling is critical to manufacturing system.Dispatching rules are usually applied dynamically to schedule (?)he job in a dynamic job-shop.Existing scheduling approaches sel- dom address machine selection in the scheduling process.Composite rules,considering both machine selection and job selection,are proposed in this paper.The dynamic system is trained to enhance its learning and adaptive capability by a reinforcement learning(RL)algorithm.We define the concep- tion of pressure to describe the system feature.Designing a reward function should be guided by the scheduling goal to accurately record the learning progress.Competitive results with the RL-based approach show that it can be used as real-time scheduling technology. 相似文献

8.

基于多智能体强化学习的多AGV路径规划方法

刘辉肖克王京擘《自动化与仪表》2020,(2):84-89

AGV(automated guided vehicle)路径规划问题已成为货物运输、快递分拣等领域中一项关键技术问题。由于在此类场景中需要较多的AGV合作完成,传统的规划模型难以协调多AGV之间的相互作用,采用分而治之的思想或许能获得系统的最优性能。基于此,该文提出一种最大回报频率的多智能体独立强化学习MRF(maximum reward frequency)Q-learning算法,对任务调度和路径规划同时进行优化。在学习阶段AGV不需要知道其他AGV的动作,减轻了联合动作引起的维数灾问题。采用Boltzmann与ε-greedy结合策略,避免收敛到较差路径,另外算法提出采用获得全局最大累积回报的频率作用于Q值更新公式,最大化多AGV的全局累积回报。仿真实验表明,该算法能够收敛到最优解,以最短的时间步长完成路径规划任务。相似文献

9.

On Average Versus Discounted Reward Temporal-Difference Learning

Tsitsiklis John N. Van Roy Benjamin 《Machine Learning》2002,49(2-3):179-191

We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms. 相似文献

10.

分层强化学习综述

下载免费PDF全文

周文吉俞扬《智能系统学报》2017,12(5):590-594

强化学习（reinforcement learning）是机器学习和人工智能领域的重要分支,近年来受到社会各界和企业的广泛关注。强化学习算法要解决的主要问题是,智能体如何直接与环境进行交互来学习策略。但是当状态空间维度增加时,传统的强化学习方法往往面临着维度灾难,难以取得好的学习效果。分层强化学习（hierarchical reinforcement learning）致力于将一个复杂的强化学习问题分解成几个子问题并分别解决,可以取得比直接解决整个问题更好的效果。分层强化学习是解决大规模强化学习问题的潜在途径,然而其受到的关注不高。本文将介绍和回顾分层强化学习的几大类方法。相似文献

11.

深度强化学习算法求解作业车间调度问题

下载免费PDF全文

李宝帅叶春明《计算机工程与应用》2021,57(23):248-254

由于传统车间调度方法实时响应能力有限,难以在复杂调度环境中取得良好效果,提出一种基于深度Q网络的深度强化学习算法。该方法结合了深度神经网络的学习能力与强化学习的决策能力,将车间调度问题视作序列决策问题,用深度神经网络拟合价值函数,将调度状态表示为矩阵形式进行输入,使用多个调度规则作为动作空间,并设置基于机器利用率的奖励函数,不断与环境交互,获得每个决策点的最佳调度规则。通过与智能优化算法、调度规则在标准问题集上的测试对比证明了算法有效性。相似文献

12.

基于强化学习的多技能项目调度算法

胡振涛崔南方胡雪君雷晓琪《控制理论与应用》2024,41(3):502-511

多技能项目调度存在组合爆炸的现象, 其问题复杂度远超传统的单技能项目调度, 启发式算法和元启发式算法在求解多技能项目调度问题时也各有缺陷. 为此, 根据项目调度的特点和强化学习的算法逻辑, 本文设计了基于强化学习的多技能项目调度算法. 首先, 将多技能项目调度过程建模为符合马尔科夫性质的序贯决策过程, 并依据决策过程设计了双智能体机制. 而后, 通过状态整合和行动分解, 降低了价值函数的学习难度. 最后, 为进一步提高算法性能, 针对资源的多技能特性, 设计了技能归并法, 显著降低了资源分配算法的时间复杂度. 与启发式算法的对比实验显示, 本文所设计的强化学习算法求解性能更高, 与元启发式算法的对比实验表明, 该算法稳定性更强, 且求解速度更快. 相似文献

13.

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis 总被引：2，自引：0，他引：2

Abhijit Gosavi 《Machine Learning》2004,55(1):5-29

We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method. 相似文献

14.

Semi-Markov adaptive critic heuristics with application to airline revenue management

Ketaki KULKARNI Abhijit GOSAVI Susan MURRAY Katie GRANTHAM 《控制理论与应用(英文版)》2011,9(3):421-430

The adaptive critic heuristic has been a popular algorithm in reinforcement learning(RL) and approximate dynamic programming(ADP) alike.It is one of the first RL and ADP algorithms.RL and ADP algorithms are particularly useful for solving Markov decision processes(MDPs) that suffer from the curses of dimensionality and modeling.Many real-world problems,however,tend to be semi-Markov decision processes(SMDPs) in which the time spent in each transition of the underlying Markov chains is itself a random variable.Unfortunately for the average reward case,unlike the discounted reward case,the MDP does not have an easy extension to the SMDP.Examples of SMDPs can be found in the area of supply chain management,maintenance management,and airline revenue management.In this paper,we propose an adaptive critic heuristic for the SMDP under the long-run average reward criterion.We present the convergence analysis of the algorithm which shows that under certain mild conditions,which can be ensured within a simulator,the algorithm converges to an optimal solution with probability 1.We test the algorithm extensively on a problem of airline revenue management in which the manager has to set prices for airline tickets over the booking horizon.The problem has a large scale,suffering from the curse of dimensionality,and hence it is difficult to solve it via classical methods of dynamic programming.Our numerical results are encouraging and show that the algorithm outperforms an existing heuristic used widely in the airline industry. 相似文献

15.

基于平均奖赏强化学习算法的零阶分类元系统

臧兆祥李昭王俊英但志平《计算机工程与应用》2016,52(21):14-20

零阶学习分类元系统ZCS（Zeroth-level Classifier System）作为一种基于遗传的机器学习技术（Genetics-Based Machine Learning）,在解决多步学习问题上,已展现出应用价值。然而标准的ZCS系统采用折扣奖赏强化学习技术,难于适应更为广泛的应用领域。基于ZCS的现有框架,提出了一种采用平均奖赏强化学习技术（R-学习算法）的分类元系统,将ZCS中的折扣奖赏强化学习方法替换为R-学习算法,从而使ZCS一方面可应用于需要优化平均奖赏的问题领域,另一方面则可求解规模较大、需要动作长链支持的多步学习问题。实验显示,在多步学习问题中,该系统可给出满意解,且在维持动作长链,以及克服过泛化问题方面,具有更优的特性。相似文献

16.

P2P系统中的一种信任关系管理协议

下载免费PDF全文

林怀清李之棠黄庆凤《计算机工程》2007,33(18):20-21,2

信任关系管理是Peer-to-Peer信任模型的重要部分,在分布式环境中,如何安全存放和访问信任值是一个难以解决的问题。本协议采用可验证的、无可信中心的(k, n)门限密码系统产生系统的公/私密钥,征集k个管理者为系统中的用户生成证书,管理协议为用户提供信任值的匿名存储和访问服务。分析显示协议能极好地抵御各种攻击。相似文献

17.

平均报酬模型强化学习理论、算法及应用

下载免费PDF全文

黄炳强曹广益李建华《计算机工程》2007,33(18):18-19,3

折扣报酬模型强化学习是目前强化学习研究的主流,但折扣因子的选取使得近期期望报酬的影响大于远期期望报酬的影响,而有时候较大远期期望报酬的策略有可能是最优的,因此比较合理的方法是采用平均报酬模型强化学习。该文介绍了平均报酬模型强化学习的两个主要算法以及主要应用。相似文献

18.

一种基于自生成样本学习的奖赏塑形方法

钱煜俞扬周志华《软件学报》2013,24(11):2667-2675

强化学习通过从以往的决策反馈中学习,使Agent 做出正确的短期决策,以最大化其获得的累积奖赏值.以往研究发现,奖赏塑形方法通过提供简单、易学的奖赏替代函数(即奖赏塑性函数)来替换真实的环境奖赏,能够有效地提高强化学习性能.然而奖赏塑形函数通常是在领域知识或者最优策略示例的基础上建立的,均需要专家参与,代价高昂.研究是否可以在强化学习过程中自动地学习有效的奖赏塑形函数.通常,强化学习算法在学习过程中会采集大量样本.这些样本虽然有很多是失败的尝试,但对构造奖赏塑形函数可能提供有用信息.提出了针对奖赏塑形的新型最优策略不变条件,并在此基础上提出了RFPotential 方法,从自生成样本中学习奖赏塑形.在多个强化学习算法和问题上进行了实验,其结果表明,该方法可以加速强化学习过程. 相似文献

19.

在加强型学习系统中用伪熵进行不确定性估计

张平斯特凡·卡纽《控制理论与应用》1998,15(1):100-104

加强型学习系统是一种与没有约束的，未知的环境相互作用的系统，学习系统的目标在大最大可能地获取累积奖励信号，这个奖励信号在有限，未知的生命周期由系统所处的环境中得到，对于一个加强型学习系统，困难之一在于奖励信号非常稀疏，尤其是对于只有时延信号的系统，已有的加强型学习方法以价值函数的形式贮存奖励信号，例如著名的Ｑ－学习。本文提出了一个基于状态的不生估计模型的方法，这个算法对有利用存贮于价值函数中的奖励相似文献

20.

改进深度强化学习的室内移动机器人路径规划

下载免费PDF全文

成怡郝密密《计算机工程与应用》2021,57(21):256-262

为了解决传统深度强化学习在室内未知环境下移动机器人路径规划中存在探索能力差和环境状态空间奖励稀疏的问题,提出了一种基于深度图像信息的改进深度强化学习算法。利用Kinect视觉传感器直接获取的深度图像信息和目标位置信息作为网络的输入,以机器人的线速度和角速度作为下一步动作指令的输出。设计了改进的奖惩函数,提高了算法的奖励值,优化了状态空间,在一定程度上缓解了奖励稀疏的问题。仿真结果表明,改进算法提高了机器人的探索能力,优化了路径轨迹,使机器人有效地避开了障碍物,规划出更短的路径,简单环境下比DQN算法的平均路径长度缩短了21.4%,复杂环境下平均路径长度缩短了11.3%。相似文献