期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On Average Versus Discounted Reward Temporal-Difference Learning

Tsitsiklis John N. Van Roy Benjamin 《Machine Learning》2002,49(2-3):179-191

We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms. 相似文献

2.

融合序列模式评分的策略梯度推荐算法

官蕊丁家满《计算机应用与软件》2022,39(3):223-228

推荐算法在一定程度上解决了信息过载问题,但传统推荐模型在挖掘数据特性方面有待改进。为此,结合强化学习方法提出一种融合序列模式评分的策略梯度推荐算法。将推荐过程建模为马尔可夫决策过程;分析推荐基础数据特性模式,设计以序列模式评分为奖励的反馈函数,在算法的每一次迭代过程中学习;通过对累积奖励设计标准化操作来降低策略梯度的方差。将该方法应用到电影推荐中进行验证,结果表明所提方法具有较好的推荐准确性。相似文献

3.

Integral temporal difference learning for continuous-time linear quadratic regulations

Chun Tae Yoon Lee Jae Young Park Jin Bae Choi Yoon Ho 《International Journal of Control, Automation and Systems》2017,15(1):226-238

In this paper, we propose a temporal difference (TD) learning method, called integral TD learning that efficiently finds solutions to continuous-time (CT) linear quadratic regulation (LQR) problems in an online fashion where system matrix A is unknown. The idea originates from a computational reinforcement learning method known as TD(0), which is the simplest TD method in a finite Markov decision process. For the proposed integral TD method, we mathematically analyze the positive definiteness of the updated value functions, monotone convergence conditions, and stability properties concerning the locations of the closed-loop poles in terms of the learning rate and the discount factor. The proposed method includes the existing value iteration method for CT LQR problems as a special case. Finally, numerical simulations are carried out to verify the effectiveness of the proposed method and further investigate the aforementioned mathematical properties.

相似文献

4.

多步积累奖励的双重时序Q网络算法

朱威谯先锋陈艺楷何德峰《控制理论与应用》2022,39(2):222-230

车辆行驶控制决策是无人驾驶的核心技术,现有基于深度强化学习的无人驾驶控制决策算法存在处理数据效率低、无法有效提取状态间时序特征等问题.因此本文提出了一种基于多步积累奖励的双重时序Q网络算法.首先,设计了一种多步积累奖励方法,该方法对未来多步即时奖励的累加和进行均值化,与当前即时奖励共同作用于智能体的控制策略,并在奖励函... 相似文献

5.

Structured prediction with reinforcement learning

Francis Maes Ludovic Denoyer Patrick Gallinari 《Machine Learning》2009,77(2-3):271-301

We formalize the problem of Structured Prediction as a Reinforcement Learning task. We first define a Structured Prediction Markov Decision Process (SP-MDP), an instantiation of Markov Decision Processes for Structured Prediction and show that learning an optimal policy for this SP-MDP is equivalent to minimizing the empirical loss. This link between the supervised learning formulation of structured prediction and reinforcement learning (RL) allows us to use approximate RL methods for learning the policy. The proposed model makes weak assumptions both on the nature of the Structured Prediction problem and on the supervision process. It does not make any assumption on the decomposition of loss functions, on data encoding, or on the availability of optimal policies for training. It then allows us to cope with a large range of structured prediction problems. Besides, it scales well and can be used for solving both complex and large-scale real-world problems. We describe two series of experiments. The first one provides an analysis of RL on classical sequence prediction benchmarks and compares our approach with state-of-the-art SP algorithms. The second one introduces a tree transformation problem where most previous models fail. This is a complex instance of the general labeled tree mapping problem. We show that RL exploration is effective and leads to successful results on this challenging task. This is a clear confirmation that RL could be used for large size and complex structured prediction problems. 相似文献

6.

用于视频异常检测的时序多尺度自编码器

下载免费PDF全文

吕浩易鹏飞刘瑞周东生张强魏小鹏《图学学报》2022,43(2):223-229

视频异常检测是指识别不符合预期行为的事件.当前许多方法利用重构误差来检测异常,由于深度神经网络的强大能力可能会重构出异常行为,这与异常行为重构误差较大的假设不符.而利用预测未来帧的方法进行异常检测取得了很好的效果,但这些方法大多未考虑正常样本的多样性,或不能建立视频连续帧之间的关联.为了解决该问题,提出了一种时序多尺度... 相似文献

7.

An adaptive visual attentive tracker for human communicational behaviors using HMM-based TD learning with new State distinction capability

Ho M.A.T. Yamada Y. Umetani Y. 《Robotics, IEEE Transactions on》2005,21(3):497-504

To develop a nonverbal communication channel between an operator and a system, we built a tracking system called the Adaptive Visual Attentive Tracker (AVAT) to track and zoom in to the operator's behavioral sequence which represents his/her intention. In our system, hidden Markov models (HMMs) first roughly model the gesture pattern. Then, the state transition probabilities in HMMs are used to assign as the rewards in temporal difference (TD) learning. Later, the TD learning method is utilized to adjust the action model of the tracker for its situated behaviors in the tracking task. Identification of the hand sign gesture context through wavelet analysis autonomously provides a reward value for optimizing AVAT's action patterns. Experimental results of tracking the operator's hand sign action sequences during her natural walking motion with higher accuracy are shown which demonstrate the effectiveness of the proposed HMM-based TD learning algorithm of AVAT. During TD learning experiments, the exploring randomly chosen actions sometimes exceed the predefined state area, and thus involuntarily enlarge the domain of states. We describe a method utilizing HMMs with continuous observation distribution to detect whether the state would be split to make a new state. The generation of new states brings the ability of enlarging the predefined area of states. 相似文献

8.

Temporal difference learning applied to sequential detection

Chengan Guo Kuh A. 《Neural Networks, IEEE Transactions on》1997,8(2):278-287

This paper proposes a novel neural-network method for sequential detection, We first examine the optimal parametric sequential probability ratio test (SPRT) and make a simple equivalent transformation of the SPRT that makes it suitable for neural-network architectures. We then discuss how neural networks can learn the SPRT decision functions from observation data and labels. Conventional supervised learning algorithms have difficulties handling the variable length observation sequences, but a reinforcement learning algorithm, the temporal difference (TD) learning algorithm works ideally in training the neural network. The entire neural network is composed of context units followed by a feedforward neural network. The context units are necessary to store dynamic information that is needed to make good decisions. For an appropriate neural-network architecture, trained with independent and identically distributed (iid) observations by the TD learning algorithm, we show that the neural-network sequential detector can closely approximate the optimal SPRT with similar performance. The neural-network sequential detector has the additional advantage that it is a nonparametric detector that does not require probability density functions. Simulations demonstrated on iid Gaussian data show that the neural network and the SPRT have similar performance. 相似文献

9.

用马尔科夫链对Linux进程行为的异常检测

刘辉蔡利栋《计算机工程》2005,31(12):161-162,180

用马尔科夫链对序列数据进行分析时,其预报准确率对于序列演变的异常与否相当敏感,而Linux进程可由一系列的系统调用序列来表征。据此,该文用马尔科夫链对Linux进程的系统调用序列进行行为模式提取并作异常检测。同时,还考虑了序列的顺序关系,使得模式有了合理的解释。相似文献

10.

基于混合马尔科夫树模型的ICS异常检测算法 总被引：1，自引：0，他引：1

张仁斌吴佩陆阳郭忠义《自动化学报》2020,46(1):127-141

针对工业控制系统中现有异常检测算法在语义攻击检测方面存在的不足,提出一种基于混合马尔科夫树模型的异常检测算法,充分利用工业控制系统的阶段性和周期性特征,构建系统正常运行时的行为模型|混合马尔科夫树.该模型包含合法的状态事件、合法的状态转移、正常的概率分布以及正常的转移时间间隔等4种信息,基于动态自适应的方法增强状态事件的关联度并引入时间间隔信息以实现对复杂语义攻击的检测,语义建模时设计一种剪枝策略以去除模型中的低频事件、低转移事件以及冗余节点,当被检测行为使得模型的以上4种信息产生的偏差超过阈值时,判定该行为异常.最后,基于OMNeT++网络仿真环境构建一个简化的污水处理系统对本文算法进行功能性验证,并利用真实物理测试床的数据集对算法的检测准确度进行性能验证.验证结果表明,本文算法能有效消除人机交互和常规诊断等操作带来的噪声影响,对复杂语义攻击具有较高的检出率,且能识别传统的非语义攻击. 相似文献

11.

Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

Eiji Uchibe 《Neural Processing Letters》2018,47(3):891-905

This paper proposes model-free deep inverse reinforcement learning to find nonlinear reward function structures. We formulate inverse reinforcement learning as a problem of density ratio estimation, and show that the log of the ratio between an optimal state transition and a baseline one is given by a part of reward and the difference of the value functions under the framework of linearly solvable Markov decision processes. The logarithm of density ratio is efficiently calculated by binomial logistic regression, of which the classifier is constructed by the reward and state value function. The classifier tries to discriminate between samples drawn from the optimal state transition probability and those from the baseline one. Then, the estimated state value function is used to initialize the part of the deep neural networks for forward reinforcement learning. The proposed deep forward and inverse reinforcement learning is applied into two benchmark games: Atari 2600 and Reversi. Simulation results show that our method reaches the best performance substantially faster than the standard combination of forward and inverse reinforcement learning as well as behavior cloning. 相似文献

12.

基于空间尺度粗粒化的异常检测方法

富坤刘琪禚佳明李佳宁郭云朋《计算机应用研究》2022,39(7)

目前,大部分基于链路预测对社会网络进行异常检测的研究中,缺乏对异常节点演化影响的分析,且受社会网络规模以及复杂度的限制,检测效率普遍不高。针对上述问题,提出了一种基于空间尺度粗粒化和异常节点加权机制的异常检测方法。首先利用凝聚型社区发现算法Louvain对社会网络进行粗粒化得到简化网络,然后在简化网络的演化过程中识别有异常演化行为的节点,并将其异常演化过程量化,引入异常节点加权机制到链路预测方法中进行异常检测。在真实社会网络数据集VAST、Email-EU（dept1和dept2）以及Enron上,与基于LinkEvent的不同调整策略算法和NESO_ED方法进行对比。结果表明,该方法可以兼顾异常检测的稳定性和敏感性,能够更合理地描述网络演化过程,得到更好的异常检测效果。相似文献

13.

增强协作多智能体强化学习中的全局信用分配机制

姚兴虎宋光鑫《计算技术与自动化》2021,40(1):149-154

针对协作多智能体强化学习中的全局信用分配机制很难捕捉智能体之间的复杂协作关系及无法有效地处理非马尔可夫奖励信号的问题,提出了一种增强的协作多智能体强化学习中的全局信用分配机制。首先,设计了一种新的基于奖励高速路连接的全局信用分配结构,使得智能体在决策时能够考虑其所分得的局部奖励信号与团队的全局奖励信号;其次,通过融合多步奖励信号提出了一种能够适应非马尔可夫奖励的值函数估计方法。在星际争霸微操作实验平台上的多个复杂场景下的实验结果表明:所提方法不仅能够取得先进的性能,同时还能大大提高样本的利用率。相似文献

14.

基于深度强化学习求解作业车间机器与 AGV联合调度问题

孙爱红雷琦宋豫川杨云帆《控制与决策》2024,39(1):253-262

针对作业车间中自动引导运输车(automated guided vehicle, AGV)与机器联合调度问题,以完工时间最小化为目标,提出一种基于卷积神经网络和深度强化学习的集成算法框架.首先,对含AGV的作业车间调度析取图进行分析,将问题转化为一个序列决策问题,并将其表述为马尔可夫决策过程.接着,针对问题的求解特点,设计一种基于析取图的空间状态与5个直接状态特征;在动作空间的设置上,设计包含工序选择和AGV指派的二维动作空间;根据作业车间中加工时间与有效运输时间为定值这一特点,构造奖励函数来引导智能体进行学习.最后,设计针对二维动作空间的2D-PPO算法进行训练和学习,以快速响应AGV与机器的联合调度决策.通过实例验证,基于2D-PPO算法的调度算法具有较好的学习性能和可扩展性效果. 相似文献

15.

一类非线性动态系统基于强化学习的最优控制制

陈学松刘富春《控制与决策》2013,28(12):1889-1893

提出一类非线性不确定动态系统基于强化学习的最优控制方法. 该方法利用欧拉强化学习算法估计对象的未知非线性函数, 给出了强化学习中回报函数和策略函数迭代的在线学习规则. 通过采用向前欧拉差分迭代公式对学习过程中的时序误差进行离散化, 实现了对值函数的估计和控制策略的改进. 基于值函数的梯度值和时序误差指标值, 给出了该算法的步骤和误差估计定理. 小车爬山问题的仿真结果表明了所提出方法的有效性.

相似文献

16.

自适应序列生成的建筑能耗预测

王悦陈建平傅启明吴宏杰陆悠《计算机系统应用》2021,30(11):155-163

提出一种基于强化学习的生成对抗网络(Reinforcement learning-based Generative Adversarial Networks,Re-GAN)能耗预测方法.该算法将强化学习与生成对抗网络相结合,将GAN(Generative Adversarial Nets)中的生成器以及判别器分别构建为强化学习中Agent(生成器)以及奖赏函数.在训练过程中,将当前的真实能耗序列作为Agent的输入状态,构建一组固定长度的生成序列,结合判别器及蒙特卡洛搜索方法进一步构建当前序列的奖赏函数,并以此作为真实样本序列后续第一个能耗值的奖赏.在此基础之上,构建关于奖赏的目标函数,并求解最优参数.最后使用所提算法对唐宁街综合大楼公开的建筑能耗数据进行预测试验,实验结果表明,所提算法比多层感知机、门控循环神经网络和卷积神经网络具有更高的预测精度. 相似文献

17.

基于DAE和GRU组合的流量异常检测方法

下载免费PDF全文

尹梓诺马海龙胡涛《信息安全学报》2023,8(2):11-27

流量异常检测能够有效识别网络流量数据中的攻击行为,是一种重要的网络安全防护手段。近年来,深度学习在流量异常检测领域得到了广泛应用,现有的深度学习模型进行流量异常检测存在两个问题:一是数据受噪声影响导致检测鲁棒性差、准确率低;二是数据特征维度高以及模型参数多导致训练和检测速度慢。为了在降低流量数据噪声影响的基础上提高检测速度和准确性,本文提出了一种基于去噪自编码器(Denoising Auto Encoder,DAE)和门控循环单元(Gated Recurrent Unit,GRU)组合的流量异常检测方法。首先设计了基于DAE的流量特征提取算法,采用小批量梯度下降算法对DAE进行训练,通过最小化含噪声数据的重构向量与原始输入向量间的差异,有效提取具有较强鲁棒性的流量特征,降低特征维度。然后设计了基于GRU的异常检测算法,利用提取的低维流量特征数据训练GRU,从而构建异常流量分类器,实现对攻击流量的准确检测。最后在NSL-KDD、UNSW-NB15、CICIDS2017数据集上的实验结果表明:与其他的机器学习、深度学习方法相比,本文所提方法的检测准确率最大提升了18.71%。同时,本文方法可以实现较高的精确率、召回率和检测效率,同时具有较低的误报率。在面对数据受到噪声破坏时,具有较强的检测鲁棒性。相似文献

18.

Sensor control for multi-object state-space estimation using random finite sets

Branko Ristic Author Vitae Ba-Ngu Vo^{Author Vitae} 《Automatica》2010,46(11):1812-1818

The problem addressed in this paper is information theoretic sensor control for recursive Bayesian multi-object state-space estimation using random finite sets. The proposed algorithm is formulated in the framework of partially observed Markov decision processes where the reward function associated with different sensor actions is computed via the Rényi or alpha divergence between the multi-object prior and the multi-object posterior densities. The proposed algorithm in implemented via the sequential Monte Carlo method. The paper then presents a case study where the problem is to localise an unknown number of sources using a controllable moving sensor which provides range-only detections. Four sensor control reward functions are compared in the study and the proposed scheme is found to perform the best. 相似文献

19.

基于系统调用和齐次Markov链模型的程序行为异常检测 总被引：7，自引：0，他引：7

田新广高立志孙春来张尔扬《计算机研究与发展》2007,44(9):1538-1544

异常检测是目前入侵检测领域研究的热点内容.提出一种新的基于系统调用和Markov链模型的程序行为异常检测方法,该方法利用一阶齐次Markov链对主机系统中特权程序的正常行为进行建模,将Markov链的状态同特权程序运行时所产生的系统调用联系在一起,并引入一个附加状态;Markov链参数的计算中采用了各态历经性假设;在检测阶段,基于状态序列的出现概率对特权程序当前行为的异常程度进行分析,并根据Markov链状态的实际含义和程序行为的特点,提供了两种可选的判决方案.同现有的基于隐Markov模型和基于人工免疫原理的检测方法相比,提出的方法兼顾了计算成本和检测准确度,特别适用于在线检测.该方法已应用于实际入侵检测系统,并表现出良好的检测性能. 相似文献

20.

基于模糊数据挖掘与遗传算法的异常检测方法 总被引：4，自引：0，他引：4

孙东黄天戍秦丙栓朱天清《计算机应用》2006,26(1):210-0212

建立合适的隶属度函数是入侵检测中应用模糊数据挖掘所面临的一个难点。针对这一问题，提出了在异常检测中运用遗传算法对隶属度函数的参数进行优化的方法。将隶属度函数的参数组合成有序的参数集并编码为遗传个体，在个体的遗传进化中嵌入模糊数据挖掘，可以搜索到最佳的参数集。采用这一参数集，能够在实时检测中最大限度地将系统正常状态与异常状态区分开来，提高异常检测的准确性。最后，对网络流量的异常检测实验验证了这一方法的可行性。相似文献