首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 26 毫秒
1.
基于Markov对策的多Agent强化学习模型及算法研究   总被引:19,自引:0,他引:19  
在MDP,单Agent可以通过强化学习来寻找问题的最优解。但在多Agent系统中,MDP模型不再适用。同样极小极大Q算法只能解决采用零和对策模型的MAS学习问题。文中采用非零和Markov对策作为多Agent系统学习框架,并提出元对策强化学习的学习模型和元对策Q算法。理论证明元对策Q算法收敛在非零和Markov对策的元对策最优解。  相似文献   

2.
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法,而动态规划方法是Agent在具有Markov环境下与策略相关的值函数学习算法。但由于Agent在学习的过程中,需要记忆全部的值函数,这个记忆容量随着状态空间的增加会变得非常巨大。文章提出了一种基于动态规划方法的激励学习遗忘算法,这个算法是通过将记忆心理学中有关遗忘的基本原理引入到值函数的激励学习中,导出了一类用动态规划方法解决激励学习问题的比较好的方法,即Forget-DP算法。  相似文献   

3.
The paper achieves two outcomes. First, it summarizes previous work on concurrent Markov decision processes (CMDPs) currently demonstrated for use with multi-agent foraging problems. When using CMDPs, each agent models the environment using two Markov decision process (MDP). The two MDPs characterize a multi-agent foraging problem by modeling both a single-agent foraging problem, and multi-agent task allocation problem, for each agent. Second, the paper studies the effects of state uncertainty on a heterogeneous robot team that utilizes the aforementioned CMDP modelling approach. Furthermore, the paper presents a method to maintain performance despite state uncertainty. The resulting robust concurrent individual and social learning (RCISL) mechanism leads to an enhanced team learning behaviour despite state uncertainty. The paper analyzes the performance of the concurrent individual and social learning mechanism with and without a particle filter for a heterogeneous foraging scenario. The RCISL mechanism confers statistically significant performance improvements over the CISL mechanism.  相似文献   

4.
Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process (MDP) using belief states. However, because the belief state space is continuous and multi-dimensional, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete POMDP model of the environment, which is not always practical. This article introduces a modified memory-based reinforcement learning algorithm called modified U-Tree that is capable of learning from raw sensor experiences with minimum prior knowledge. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and also proposes a modification of the statistical test for reward estimation, which allows the algorithm to be benchmarked against some traditional model-based algorithms with a set of well known POMDP problems.  相似文献   

5.
Reinforcement learning (RL) has now evolved as a major technique for adaptive optimal control of nonlinear systems. However, majority of the RL algorithms proposed so far impose a strong constraint on the structure of environment dynamics by assuming that it operates as a Markov decision process (MDP). An MDP framework envisages a single agent operating in a stationary environment thereby limiting the scope of application of RL to control problems. Recently, a new direction of research has focused on proposing Markov games as an alternative system model to enhance the generality and robustness of the RL based approaches. This paper aims to present this new direction that seeks to synergize broad areas of RL and Game theory, as an interesting and challenging avenue for designing intelligent and reliable controllers. First, we briefly review some representative RL algorithms for the sake of completeness and then describe the recent direction that seeks to integrate RL and game theory. Finally, open issues are identified and future research directions outlined.  相似文献   

6.
Markov decision processes (MDP) are widely used in problems whose solutions may be represented by a certain series of actions. A lot of papers demonstrate successful MDP use in model problems, robotic control problems, planning problems, etc. In addition, economic problems have the property of multistep motion towards a goal as well. This paper is dedicated to MDP application to the problem of pricing policy management. The problem of dynamic pricing is stated in terms of MDP. Additional attention is paid to the method of constructing an MDP model based on data mining. Based on the data on sales of an actual industrial plant, construction of an MDP model that includes the searching for and generalization of regularities is demonstrated.  相似文献   

7.
口语对话系统的POMDP模型及求解   总被引:3,自引:0,他引:3  
许多口语对话系统已进入实用阶段,但一直没有很好的对话管理模型,把对话管理看做随机优化问题,用马尔科夫决策过程(MDP)来建模是最近出现的方向,但是对话状态的不确定性使MDP不能很好地反映对话模型,提出了一种新的基于部分可观察MDP(POMDP)的口语对话系统模型,用部分可观察特性来处理不确定问题,由于精确求解算法的局限性,考察了许多启发式近似算法在该模型中的话用性,并改进了部分算法,如对于格点近似算法,提出了两种基于模拟点的格点选择方法。  相似文献   

8.
李保罗  蔡明钰  阚震 《控制与决策》2023,38(7):1835-1844
针对动态不确定环境下机器人执行复杂任务的需求,提出一种线性时序逻辑(linear temporal logic, LTL)引导的无模型安全强化学习算法,能在最大化任务完成概率的同时保证学习过程的安全性.首先,综合考虑环境中的不确定因素,构建马尔可夫决策过程(Markov decision process, MDP),再用LTL刻画智能体的复杂任务,将其转化为有多接受集的基于转移的有限确定性广义布奇自动机(transition-based limit deterministic generalized Büchi automaton, t LDGBA),并通过接受边界函数构建可记录当前待访问接受集的约束型tLDGBA (constrained tLDGBA,ctLDGBA);其次,构建乘积MDP用于强化学习搜索最优策略;最后,基于LTL对安全性的描述和MDP的观测函数构建安全博弈,并根据安全博弈设计安全盾机制保证系统在学习过程中的安全性.严格的分析证明了所提出的算法能获得最大化LTL任务完成概率的最优策略.仿真结果验证了LTL引导的安全强化学习算法的有效性.  相似文献   

9.
双轮驱动移动机器人的学习控制器设计方法*   总被引:1,自引:0,他引:1  
提出一种基于增强学习的双轮驱动移动机器人路径跟随控制方法,通过将机器人运动控制器的优化设计问题建模为Markov决策过程,采用基于核的最小二乘策略迭代算法(KLSPI)实现控制器参数的自学习优化。与传统表格型和基于神经网络的增强学习方法不同,KLSPI算法在策略评价中应用核方法进行特征选择和值函数逼近,从而提高了泛化性能和学习效率。仿真结果表明,该方法通过较少次数的迭代就可以获得优化的路径跟随控制策略,有利于在实际应用中的推广。  相似文献   

10.
Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discreteevent systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.  相似文献   

11.
针对含扩散项不可靠随机生产系统最优生产控制的优化命题, 采用数值解方法来求解该优化命题最优控制所满足的模态耦合的非线性偏微分HJB方程. 首先构造Markov链来近似生产系统状态演化, 并基于局部一致性原理, 把求解连续时间随机控制问题转化为求解离散时间的Markov决策过程问题, 然后采用数值迭代和策略迭代算法来实现最优控制数值求解过程. 文末仿真结果验证了该方法的正确性和有效性.  相似文献   

12.
This article addresses reinforcement learning problems based on factored Markov decision processes (MDPs) in which the agent must choose among a set of candidate abstractions, each build up from a different combination of state components. We present and evaluate a new approach that can perform effective abstraction selection that is more resource‐efficient and/or more general than existing approaches. The core of the approach is to make selection of an abstraction part of the learning agent's decision‐making process by augmenting the agent's action space with internal actions that select the abstraction it uses. We prove that under certain conditions this approach results in a derived MDP whose solution yields both the optimal abstraction for the original MDP and the optimal policy under that abstraction. We examine our approach in three domains of increasing complexity: contextual bandit problems, episodic MDPs, and general MDPs with context‐specific structure. © 2013 Wiley Periodicals, Inc.  相似文献   

13.
The goal of dialogue management in a spoken dialogue system is to take actions based on observations and inferred beliefs. To ensure that the actions optimize the performance or robustness of the system, researchers have turned to reinforcement learning methods to learn policies for action selection. To derive an optimal policy from data, the dynamics of the system is often represented as a Markov Decision Process (MDP), which assumes that the state of the dialogue depends only on the previous state and action. In this article, we investigate whether constraining the state space by the Markov assumption, especially when the structure of the state space may be unknown, truly affords the highest reward. In simulation experiments conducted in the context of a dialogue system for interacting with a speech-enabled web browser, models under the Markov assumption did not perform as well as an alternative model which classifies the total reward with accumulating features. We discuss the implications of the study as well as its limitations.  相似文献   

14.
介绍了输电线路除冰机器人的研究现状.针对其工作环境恶劣、不确定因素多的特点,提出了基于马尔可夫决策的行为控制器的设计方法.该方法首先定义了输电线除冰机器人的马尔可夫模型,然后给出了相应的最优方针搜索策略,还给控制器添加了概率调节机制,以达到执行效果反馈的目的.为解决机器人作业中的突发情况,在该方法中还引入了人工辅助和行...  相似文献   

15.
为进一步改善个性化推荐系统的推荐效果,通过使用强化学习方法对SVDPP算法进行优化,提出一种新的协同过滤推荐算法。考虑用户评分的时间效应,将推荐问题转化为马尔科夫决策过程。在此基础上,利用Q-learning算法构建融合时间戳信息的用户评分优化模型,同时通过预测评分取整填充和优化边界补全方法预测缺失值,以解决数据稀疏性问题。实验结果显示,该算法的均方根误差较SVDPP算法降低了0.005 6,表明融合时间戳并采用强化学习方法进行推荐性能优化是可行的。  相似文献   

16.
使用模型检测解决概率布尔网络优化控制   总被引:1,自引:1,他引:0  
郭宗豪  魏欧 《计算机科学》2017,44(5):193-198, 231
系统生物学期望对复杂生物系统建立一个真实的、可计算的模型,以便于以系统的角度去理解生物系统的演变过程。在系统生物学中,一个重要的主题是通过外部的干预控制发展关于基因调控网络的控制理论,以作为未来基因治疗技术。目前,布尔网络及其扩展的概率布尔网络已经被广泛用于对基因调控网络进行建模。在控制问题的研究中,概率布尔控制网络的状态迁移本质上构成一条有限状态空间的离散时间马尔科夫决策过程。依据马尔科夫决策过程的理论,通过概率模型检测方法解决网络中有限范围优化控制问题和无限范围优化控制问题。针对带有随机干扰且上下文相关的概率布尔控制网络,使用概率模型检测器PRISM对其进行形式化建模,然后将两类优化控制问题描述为相应的时序逻辑公式,最后通过模型检测寻找出最优解。实验结果表明,提出的方法可以有效地用于生物网络的分析和优化控制。  相似文献   

17.
陈鑫  魏海军  吴敏  曹卫华 《自动化学报》2013,39(12):2021-2031
提高适应性、实现连续空间的泛化、降低维度是实现多智能体强化学习(Multi-agent reinforcement learning,MARL)在连续系统中应用的几个关键. 针对上述需求,本文提出连续多智能体系统(Multi-agent systems,MAS)环境下基于模型的智能体跟踪式学习机制和算法(MAS MBRL-CPT).以学习智能体适应同伴策略为出发点,通过定义个体期望即时回报,将智能体对同伴策略的观测融入环境交互效果中,并运用随机逼近实现个体期望即时回报的在线学习.定义降维的Q函数,在降低学习空间维度的同时,建立MAS环境下智能体跟踪式学习的Markov决策过程(Markov decision process,MDP).在运用高斯回归建立状态转移概率模型的基础上,实现泛化样本集Q值函数的在线动态规划求解.基于离散样本集Q函数运用高斯回归建立值函数和策略的泛化模型. MAS MBRL-CPT在连续空间Multi-cart-pole控制系统的仿真实验表明,算法能够使学习智能体在系统动力学模型和同伴策略未知的条件下,实现适应性协作策略的学习,具有学习效率高、泛化能力强等特点.  相似文献   

18.
为CDMA下行链路多用户系统建立在用户时延限制下使基站平均功率最小的数学模型,将该优化问题转化为非约束Markov决策过程,用动态规划的方法获得最优解,并证明2个用户的功率-时延曲面是一个凸曲面。仿真结果证明,在CDMA下行链路多用户系统中,增加时延能节省功率且平均时延与功率仍具有凸函数关系。  相似文献   

19.
In a distributed system, broadcasting is an efficient way to dispense data in certain highly dynamic environments. While there are several well-known on-line broadcast scheduling strategies that minimize wait time, there has been little research that considers on-demand broadcasting with timing constraints. One application which could benefit from a strategy for on-demand broadcast with timing constraints is a real-time database system. Scheduling strategies are needed in real-time databases that identify which data item to broadcast next in order to minimize missed deadlines. The scheduling decisions required in a real-time broadcast system allow the system to be modeled as a Markov Decision Process (MDP). In this paper, we analyze the MDP model and determine that finding an optimal solution is a hard problem in PSPACE. We propose a scheduling approach, called Aggregated Critical Requests (ACR), which is based on the MDP formulation and present two algorithms based on this approach. ACR is designed for timely delivery of data to clients in order to maximize the reward by minimizing the deadlines missed. Results from trace-driven experiments indicate the ACR approach provides a flexible strategy that can outperform existing strategies under a variety of factors.  相似文献   

20.
DeGroot learning is a model of opinion diffusion and formation in a social network. We examine the behaviour of the DeGroot learning model when external strategic players that aim to influence the opinion formation process are introduced. More specifically, we consider the case of a single decision maker and that of two competing players, with a fixed number of possible influence actions for each of them. In the former case, the DeGroot model takes the form of a Markov Decision Process (MDP), while in the latter case it takes the form of a Stochastic Game (SG). These models are solved using probabilistic model checking techniques, as well as other solution techniques beyond model checking. The viability of our analysis is attested on a well-known social network, the Zachary’s karate club. Finally, the evaluation of influence in a social network simultaneously with the decision maker’s cost is supported, which is encoded as a multi-objective model checking problem.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号