首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《Artificial Intelligence》2007,171(8-9):453-490
This study extends the framework of partially observable Markov decision processes (POMDPs) to allow their parameters, i.e., the probability values in the state transition functions and the observation functions, to be imprecisely specified. It is shown that this extension can reduce the computational costs associated with the solution of these problems. First, the new framework, POMDPs with imprecise parameters (POMDPIPs), is formulated. We consider (1) the interval case, in which each parameter is imprecisely specified by an interval that indicates possible values of the parameter, and (2) the point-set case, in which each probability distribution is imprecisely specified by a set of possible distributions. Second, a new optimality criterion for POMDPIPs is introduced. As in POMDPs, the criterion is to regard a policy, i.e., an action-selection rule, as optimal if it maximizes the expected total reward. The expected total reward, however, cannot be calculated precisely in POMDPIPs, because of the parameter imprecision. Instead, we estimate the total reward by adopting arbitrary second-order beliefs, i.e., beliefs in the imprecisely specified state transition functions and observation functions. Although there are many possible choices for these second-order beliefs, we regard a policy as optimal as long as there is at least one of such choices with which the policy maximizes the total reward. Thus there can be multiple optimal policies for a POMDPIP. We regard these policies as equally optimal, and aim at obtaining one of them. By appropriately choosing which second-order beliefs to use in estimating the total reward, computational costs incurred in obtaining such an optimal policy can be reduced significantly. We provide an exact solution algorithm for POMDPIPs that does this efficiently. Third, the performance of such an optimal policy, as well as the computational complexity of the algorithm, are analyzed theoretically. Last, empirical studies show that our algorithm quickly obtains satisfactory policies to many POMDPIPs.  相似文献   

2.
针对挖掘机的自主作业场景,提出基于强化学习的时间最优轨迹规划方法.首先,搭建仿真环境用于产生数据,以动臂、斗杆和铲斗关节的角度、角速度为状态观测变量,以各关节的角加速度值为动作信息,通过状态观测信息实现仿真环境与自主学习算法的交互;然后,设计以动臂、斗杆和铲斗关节运动是否超出允许范围、完成任务 总时间和目标相对距离为奖励函数对策略网络参数进行训练;最后,利用改进的近端策略优化算法(proximal policy optimization, PPO)实现挖掘机的时间最优轨迹规划.与此同时,与不同连续动作空间的强化学习算法进行对比,实验结果表明:所提出优化算法效率更高,收敛速度更快,作业轨迹更平滑,可有效避免各关节受到较大冲击,有助于挖掘机高效、平稳地作业.  相似文献   

3.
This paper observes a job search problem on a partially observable Markov chain, which can be considered as an extension of a job search in a dynamic economy in [1]. This problem is formulated as the state changes according to a partially observable Markov chain, i.e., the current state cannot be observed but there exists some information regarding what a present state is. All information about the unobservable state are summarized by the probability distributions on the state space, and we employ the Bayes' theorem as a learning procedure. The total positivity of order two, or simply TP2, is a fundamental property to investigate sequential decision problems, and it also plays an important role in the Bayesian learning procedure for a partially observable Markov process. By using this property, we consider some relationships among prior and posterior information, and the optimal policy. We will also observe the probabilities to make a transition into each state after some additional transitions by empolying the optimal policy. In the stock market, suppose that the states correspond to the business situation of one company and if there is a state designating the default, then the problem is what time the stocks are sold off before bankrupt, and the probability to become bankrupt will be also observed.  相似文献   

4.
This article demonstrates that Q-learning can be accelerated by appropriately specifying initial Q-values using dynamic wave expansion neural network. In our method, the neural network has the same topography as robot work space. Each neuron corresponds to a certain discrete state. Every neuron of the network will reach an equilibrium state according to the initial environment information. The activity of the special neuron denotes the maximum cumulative reward by following the optimal policy from the corresponding state when the network is stable. Then the initial Q-values are defined as the immediate reward plus the maximum cumulative reward by following the optimal policy beginning at the succeeding state. In this way, we create a mapping between the known environment information and the initial values of Q-table based on neural network. The prior knowledge can be incorporated into the learning system, and give robots a better learning foundation. Results of experiments in a grid world problem show that neural network-based Q-learning enables a robot to acquire an optimal policy with better learning performance compared to conventional Q-learning and potential field-based Qlearning.  相似文献   

5.
在模型未知的部分可观测马尔可夫决策过程(partially observable Markov decision process,POMDP)下,智能体无法直接获取环境的真实状态,感知的不确定性为学习最优策略带来挑战。为此,提出一种融合对比预测编码表示的深度双Q网络强化学习算法,通过显式地对信念状态建模以获取紧凑、高效的历史编码供策略优化使用。为改善数据利用效率,提出信念回放缓存池的概念,直接存储信念转移对而非观测与动作序列以减少内存占用。此外,设计分段训练策略将表示学习与策略学习解耦来提高训练稳定性。基于Gym-MiniGrid环境设计了POMDP导航任务,实验结果表明,所提出算法能够捕获到与状态相关的语义信息,进而实现POMDP下稳定、高效的策略学习。  相似文献   

6.
Reinforcement learning is a learning scheme for finding the optimal policy to control a system, based on a scalar signal representing a reward or a punishment. If the observation of the system by the controller is sufficiently rich to represent the internal state of the system, the controller can achieve the optimal policy simply by learning reactive behavior. However, if the state of the controlled system cannot be assessed completely using current sensory observations, the controller must learn a dynamic behavior to achieve the optimal policy. In this paper, we propose a dynamic controller scheme which utilizes memory to uncover hidden states by using information about past system outputs, and makes control decisions using memory. This scheme integrates Q-learning, as proposed by Watkins, and recurrent neural networks of several types. It performs favorably in simulations which involve a task with hidden states. This work was presented, in part, at the International Symposium on Artificial Life and Robotics, Oita, Japan, February 18–20, 1996  相似文献   

7.
基于采样的POMDP近似算法   总被引:1,自引:0,他引:1  
部分可观察马尔科夫决策过程(POMDP)是一种描述机器人在动态不确定环境下行动选择的问题模型。对于具有稀疏转移矩阵的POMDP问题模型,该文提出了一种求解该问题模型的快速近似算法。该算法首先利用QMDP算法产生的策略进行信念空间采样,并通过点迭代算法快速生成POMDP值函数,从而产生近似的最优行动选择策略。在相同的POMDP试验模型上,执行该算法产生的策略得到的回报值与执行其他近似算法产生的策略得到的回报值相当,但该算法计算速度快,它产生的策略表示向量集合小于现有其他近似算法产生的集合。因此,它比这些近似算法更适应于大规模的稀疏状态转移矩阵POMDP模型求解计算。  相似文献   

8.
In this paper, the optimal maintenance policy for a multi-state system with no observation is considered. Different from most existing works, only a limited number of imperfect preventive maintenance actions can be performed between two successive replacements. Assume that the system's deterioration state cannot be observed during its operation expected after each replacement, and it evolves as a discrete-time Markov chain with a finite state space. After choosing the information state as state variable, the problem is then formulated as a Markov decision process over the infinite time horizon. In order to increase the computational efficiency, several key structural properties are developed by minimising the total expected cost per unit time. The existence of the optimal threshold-type maintenance policy is proved and the monotonicity of the threshold is obtained. Finally, a numerical example is given to illustrate the optimal policy.  相似文献   

9.
This paper considers the problems of fault diagnosis and supervisory control in discrete event systems through the context of a new observation paradigm. For events that are considered observable, a cost is incurred each time a sensor is activated in an attempt to make an event observation. In such a situation the best strategy is to perform an “active acquisition” of information, i.e. to choose which sensors need to be activated based on the information state generated from the previous readings of the system. Depending on the sample path executed by the system, different sensors may be turned on or off at different stages of the process. We consider the active acquisition of information problem for both logical and stochastic discrete event systems. We consider three classes of increasing complexity: firstly, for acyclic systems where events are synchronized to clock ticks; secondly, for acyclic untimed systems; and lastly, for general cyclic automata. For each of these cases we define a notion of information state for the problem, determine conditions for the existence of an optimal policy, and construct a dynamic program to find an optimal policy where one exists. For large systems, a limited lookahead algorithm for computational savings is proposed.
Demosthenis TeneketzisEmail:
  相似文献   

10.
This paper studies a distributed policy evaluation in multi-agent reinforcement learning. Under cooperative settings, each agent only obtains a local reward, while all agents share a common environmental state. To optimize the global return as the sum of local return, the agents exchange information with their neighbors through a communication network. The mean squared projected Bellman error minimization problem is reformulated as a constrained convex optimization problem with a consensus constraint; then, a distributed alternating directions method of multipliers (ADMM) algorithm is proposed to solve it. Furthermore, an inexact step for ADMM is used to achieve efficient computation at each iteration. The convergence of the proposed algorithm is established.  相似文献   

11.
We describe a heuristic control policy for a general finite‐horizon stochastic control problem, which can be used when the current process disturbance is not conditionally independent of the previous disturbances, given the current state. At each time step, we approximate the distribution of future disturbances (conditioned on what has been observed) by a product distribution with the same marginals. We then carry out dynamic programming (DP), using this modified future disturbance distribution, to find an optimal policy, and in particular, the optimal current action. We then execute only the optimal current action. At the next step, we update the conditional distribution, and repeat the process, this time with a horizon reduced by one step. (This explains the name ‘shrinking‐horizon dynamic programming’). We explain how the method can be thought of as an extension of model predictive control, and illustrate our method on two variations on a revenue management problem. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

12.
Admission control of hospitalization considering patient gender is an interesting issue in the study of hospital bed management. This paper addresses the decision on the admission of patients who should immediately be admitted into a same-gender room or rejected. Note that a patient is admitted depending on different conditions, such as his/her health condition, gender, the availability of beds, the length of stay, and the reward of hospitalization. Focusing on the key factor, patient gender, this paper sets up an infinite-horizon total discounted reward Markov decision process model with the purpose to maximize the total expected reward for the hospital, which leads to an optimal dynamic policy. Then, the structural properties of the optimal policy are analyzed. Additionally, a value iteration algorithm is proposed to find the optimal policy. Finally, some numerical experiments are used to discuss how the optimal dynamic policy depends on some key parameters of the system. Furthermore, the performance of the optimal policy is discussed though comparison with the three other policies by means of simulating different scenarios.  相似文献   

13.
Fuzzy control of arrivals to tandem queues with two stations   总被引:1,自引:0,他引:1  
We consider two queues in tandem, each of which has its own input of arriving customers, which, in turn, may either be accepted or rejected. Suppose that the system receives a fixed reward for each accepted customer and pays a holding cost per customer per unit time in the system. The objective is to dynamically determine the optimal admission policies based on the state of the system so as to maximize the average profit (reward minus cost) over an infinite horizon. This model finds applications in flow control in communication systems, industrial job shops, and traffic-flow systems. In this paper, a novel approach is presented using fuzzy control. This approach explicitly finds the optimal policy, which up until now has defied analytical solutions  相似文献   

14.
We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. We then propose a framework to obtain an ?-optimal policy from simulation. We provide sample complexity of such an approach.  相似文献   

15.
In this paper, we use the event-based optimization framework to study the admission control problem in an open Jackson network. The external arriving customers are controlled by an admission controller. The controller makes decision only at the epoch when an event of customer arrival happens. Every customer in the network obtains a fixed reward for each service completion and pays a cost with a fixed rate during its sojourn time (including waiting time and service time). The complete information of the system state is not available for the controller. The controller can observe only the number of total customers in the network. Based on the property of closed form solution of Jackson networks, we prove that the system performance is monotonic with respect to the admission probabilities and the optimal control policy has a threshold form. That is, when the number of total customers is smaller than a threshold, all of the arriving customers are admitted; otherwise, all are rejected. The sufficient condition and necessary condition of optimal policy are also derived. Furthermore, we develop an iterative algorithm to find the optimal policy. The algorithm can be executed based on a single sample path, which makes the algorithm online implementable. Simulation experiments are conducted to demonstrate the effectiveness of our approach.  相似文献   

16.
钱煜  俞扬  周志华 《软件学报》2013,24(11):2667-2675
强化学习通过从以往的决策反馈中学习,使Agent 做出正确的短期决策,以最大化其获得的累积奖赏值.以往研究发现,奖赏塑形方法通过提供简单、易学的奖赏替代函数(即奖赏塑性函数)来替换真实的环境奖赏,能够有效地提高强化学习性能.然而奖赏塑形函数通常是在领域知识或者最优策略示例的基础上建立的,均需要专家参与,代价高昂.研究是否可以在强化学习过程中自动地学习有效的奖赏塑形函数.通常,强化学习算法在学习过程中会采集大量样本.这些样本虽然有很多是失败的尝试,但对构造奖赏塑形函数可能提供有用信息.提出了针对奖赏塑形的新型最优策略不变条件,并在此基础上提出了RFPotential 方法,从自生成样本中学习奖赏塑形.在多个强化学习算法和问题上进行了实验,其结果表明,该方法可以加速强化学习过程.  相似文献   

17.
We consider a group of several non-Bayesian agents that can fully coordinate their activities and share their past experience in order to obtain a joint goal in face of uncertainty. The reward obtained by each agent is a function of the environment state but not of the action taken by other agents in the group. The environment state (controlled by Nature) may change arbitrarily, and the reward function is initially unknown. Two basic feedback structures are considered. In one of them — the perfect monitoring case — the agents are able to observe the previous environment state as part of their feedback, while in the other — the imperfect monitoring case — all that is available to the agents are the rewards obtained. Both of these settings refer to partially observable processes, where the current environment state is unknown. Our study refers to the competitive ratio criterion. It is shown that, for the imperfect monitoring case, there exists an efficient stochastic policy that ensures that the competitive ratio is obtained for all agents at almost all stages with an arbitrarily high probability, where efficiency is measured in terms of rate of convergence. It is also shown that if the agents are restricted only to deterministic policies then such a policy does not exist, even in the perfect monitoring case. This revised version was published online in June 2006 with corrections to the Cover Date.  相似文献   

18.
The sensitivity-based optimization of Markov systems has become an increasingly important area. From the perspective of performance sensitivity analysis, policy-iteration algorithms and gradient estimation methods can be directly obtained for Markov decision processes (MDPs). In this correspondence, the sensitivity-based optimization is extended to average reward partially observable MDPs (POMDPs). We derive the performance-difference and performance-derivative formulas of POMDPs. On the basis of the performance-derivative formula, we present a new method to estimate the performance gradients. From the performance-difference formula, we obtain a sufficient optimality condition without the discounted reward formulation. We also propose a policy-iteration algorithm to obtain a nearly optimal finite-state-controller policy.   相似文献   

19.
We address an unrelated parallel machine scheduling problem with R-learning, an average-reward reinforcement learning (RL) method. Different types of jobs dynamically arrive in independent Poisson processes. Thus the arrival time and the due date of each job are stochastic. We convert the scheduling problems into RL problems by constructing elaborate state features, actions, and the reward function. The state features and actions are defined fully utilizing prior domain knowledge. Minimizing the reward per decision time step is equivalent to minimizing the schedule objective, i.e. mean weighted tardiness. We apply an on-line R-learning algorithm with function approximation to solve the RL problems. Computational experiments demonstrate that R-learning learns an optimal or near-optimal policy in a dynamic environment from experience and outperforms four effective heuristic priority rules (i.e. WSPT, WMDD, ATC and WCOVERT) in all test problems.  相似文献   

20.
Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process (MDP) using belief states. However, because the belief state space is continuous and multi-dimensional, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete POMDP model of the environment, which is not always practical. This article introduces a modified memory-based reinforcement learning algorithm called modified U-Tree that is capable of learning from raw sensor experiences with minimum prior knowledge. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and also proposes a modification of the statistical test for reward estimation, which allows the algorithm to be benchmarked against some traditional model-based algorithms with a set of well known POMDP problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号