首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
A Reinforcement Learning Scheme for a Partially-Observable Multi-Agent Game   总被引:1,自引:0,他引:1  
We formulate an automatic strategy acquisition problem for the multi-agent card game Hearts as a reinforcement learning problem. The problem can approximately be dealt with in the framework of a partially observable Markov decision process (POMDP) for a single-agent system. Hearts is an example of imperfect information games, which are more difficult to deal with than perfect information games. A POMDP is a decision problem that includes a process for estimating unobservable state variables. By regarding missing information as unobservable state variables, an imperfect information game can be formulated as a POMDP. However, the game of Hearts is a realistic problem that has a huge number of possible states, even when it is approximated as a single-agent system. Therefore, further approximation is necessary to make the strategy acquisition problem tractable. This article presents an approximation method based on estimating unobservable state variables and predicting the actions of the other agents. Simulation results show that our reinforcement learning method is applicable to such a difficult multi-agent problem.Editor Risto Miikkulainen  相似文献   

2.
Reinforcement learning (RL) is an area of machine learning that is concerned with how an agent learns to make decisions sequentially in order to optimize a particular performance measure. For achieving such a goal, the agent has to choose either 1) exploiting previously known knowledge that might end up at local optimality or 2) exploring to gather new knowledge that expects to improve the current performance. Among other RL algorithms, Bayesian model-based RL (BRL) is well-known to be able to trade-off between exploitation and exploration optimally via belief planning, i.e. partially observable Markov decision process (POMDP). However, solving that POMDP often suffers from curse of dimensionality and curse of history. In this paper, we make two major contributions which are: 1) an integration framework of temporal abstraction into BRL that eventually results in a hierarchical POMDP formulation, which can be solved online using a hierarchical sample-based planning solver; 2) a subgoal discovery method for hierarchical BRL that automatically discovers useful macro actions to accelerate learning. In the experiment section, we demonstrate that the proposed approach can scale up to much larger problems. On the other hand, the agent is able to discover useful subgoals for speeding up Bayesian reinforcement learning.  相似文献   

3.
Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process (MDP) using belief states. However, because the belief state space is continuous and multi-dimensional, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete POMDP model of the environment, which is not always practical. This article introduces a modified memory-based reinforcement learning algorithm called modified U-Tree that is capable of learning from raw sensor experiences with minimum prior knowledge. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and also proposes a modification of the statistical test for reward estimation, which allows the algorithm to be benchmarked against some traditional model-based algorithms with a set of well known POMDP problems.  相似文献   

4.
Fujita H  Ishii S 《Neural computation》2007,19(11):3051-3087
Games constitute a challenging domain of reinforcement learning (RL) for acquiring strategies because many of them include multiple players and many unobservable variables in a large state space. The difficulty of solving such realistic multiagent problems with partial observability arises mainly from the fact that the computational cost for the estimation and prediction in the whole state space, including unobservable variables, is too heavy. To overcome this intractability and enable an agent to learn in an unknown environment, an effective approximation method is required with explicit learning of the environmental model. We present a model-based RL scheme for large-scale multiagent problems with partial observability and apply it to a card game, hearts. This game is a well-defined example of an imperfect information game and can be approximately formulated as a partially observable Markov decision process (POMDP) for a single learning agent. To reduce the computational cost, we use a sampling technique in which the heavy integration required for the estimation and prediction can be approximated by a plausible number of samples. Computer simulation results show that our method is effective in solving such a difficult, partially observable multiagent problem.  相似文献   

5.
在模型未知的部分可观测马尔可夫决策过程(partially observable Markov decision process,POMDP)下,智能体无法直接获取环境的真实状态,感知的不确定性为学习最优策略带来挑战。为此,提出一种融合对比预测编码表示的深度双Q网络强化学习算法,通过显式地对信念状态建模以获取紧凑、高效的历史编码供策略优化使用。为改善数据利用效率,提出信念回放缓存池的概念,直接存储信念转移对而非观测与动作序列以减少内存占用。此外,设计分段训练策略将表示学习与策略学习解耦来提高训练稳定性。基于Gym-MiniGrid环境设计了POMDP导航任务,实验结果表明,所提出算法能够捕获到与状态相关的语义信息,进而实现POMDP下稳定、高效的策略学习。  相似文献   

6.
In this paper, we first discuss the meaning of physical embodiment and the complexity of the environment in the context of multi-agent learning. We then propose a vision-based reinforcement learning method that acquires cooperative behaviors in a dynamic environment. We use the robot soccer game initiated by RoboCup (Kitano et al., 1997) to illustrate the effectiveness of our method. Each agent works with other team members to achieve a common goal against opponents. Our method estimates the relationships between a learner's behaviors and those of other agents in the environment through interactions (observations and actions) using a technique from system identification. In order to identify the model of each agent, Akaike's Information Criterion is applied to the results of Canonical Variate Analysis to clarify the relationship between the observed data in terms of actions and future observations. Next, reinforcement learning based on the estimated state vectors is performed to obtain the optimal behavior policy. The proposed method is applied to a soccer playing situation. The method successfully models a rolling ball and other moving agents and acquires the learner's behaviors. Computer simulations and real experiments are shown and a discussion is given.  相似文献   

7.
In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.  相似文献   

8.
In some video games, humans and computer programs can play together, each one controlling a virtual humanoid. These computer programs usually aim at replacing missing human players; however, they partially miss their goal, as they can be easily spotted by players as being artificial. Our objective is to find a method to create programs whose behaviors cannot be told apart from players when observed playing the game. We call this kind of behavior a believable behavior. To achieve this goal, we choose models using Markov chains to generate the behaviors by imitation. Such models use probability distributions to find which decision to choose depending on the perceptions of the virtual humanoid. Then, actions are chosen depending on the perceptions and the decision. We propose a new model, called Chameleon , to enhance expressiveness and the associated imitation learning algorithm. We first organize the sensors and motors by semantic refinement and add a focus mechanism in order to improve the believability. Then, we integrate an algorithm to learn the topology of the environment that tries to best represent the use of the environment by the players. Finally, we propose an algorithm to learn parameters of the decision model. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

9.
One of the advantages of immune based approaches is the usage of permanent memory cells. These memory cells cause to omit the process of learning for any played strategy and consequently increasing the speed of decision making process. In the proposed method of this article, memory cells represent actions that have the best local payoff for that current state of the game and are generated simultaneously by learning process. These cells help the decision making system to decide better, considering the previous and future state of the game. The decision making system that is used in this method is based on a Mamdani fuzzy inference engine (FIS). The FIS proposes a best action for the current state of the board by extracting memory cells’ data. Experiments show that the immune based fuzzy agent which is introduced here has better results among other previous methods. This new method can show proper resistance when confronting a player that uses complete game tree remarkably. Also this method is capable of suggesting an action for each state of the game by generating less number of generations in comparison with other evolutionary based methods.  相似文献   

10.
The creation of intelligent video game controllers has recently become one of the greatest challenges in game artificial intelligence research, and it is arguably one of the fastest-growing areas in game design and development. The learning process, a very important feature of intelligent methods, is the result of an intelligent game controller to determine and control the game objects behaviors’ or actions autonomously. Our approach is to use a more efficient learning model in the form of artificial neural networks for training the controllers. We propose a Hill-Climbing Neural Network (HillClimbNet) that controls the movement of the Ms. Pac-man agent to travel around the maze, gobble all of the pills and escape from the ghosts in the maze. HillClimbNet combines the hill-climbing strategy with a simple, feed-forward artificial neural network architecture. The aim of this study is to analyze the performance of various activation functions for the purpose of generating neural-based controllers to play a video game. Each non-linear activation function is applied identically for all the nodes in the network, namely log-sigmoid, logarithmic, hyperbolic tangent-sigmoid and Gaussian. In general, the results shows an optimum configuration is achieved by using log-sigmoid, while Gaussian is the worst activation function.  相似文献   

11.
Recent studies exploring the effects of dynamic visualizations on learning compared with static visualizations have yielded mixed results. Procedural motor learning is one of the few fields in which dynamic representations have shown to be effective. Many of the studies have suggested that this advantage is mainly due to the activation of the “mirror‐neuron system.” This study explores this explanation in physical education domain and analysed the effects of instructional media (video vs. photographs), showing tactical actions in basketball, on learning outcomes (i.e., game understanding and game performance), cognitive load (i.e., mental effort invested and estimated difficulty), and attitudes (i.e., attention, enjoyment, engagement, and challenge) in secondary school students. For all of the indicators, the results show that learning from video was more effective than learning from photographs. These findings have implications for the effective design of instructional media and provide confirmation of the superiority of video for teaching tactical actions involving the entire body.  相似文献   

12.
It is important to develop an understanding of children’s engagement and choices in learning experiences outside of school as this has implications for their development and orientations to other learning environments. This mixed-methods study examines relationships between the genres of video games children choose to play and the learning strategies they employ to improve at these games. It also explores students’ motivations for playing the games they choose to play. One hundred eighteen fourth- and fifth-grade students participated in this study. Qualitative analyses of student responses resulted in a model for classifying motivation for game choices. Children primarily cite reasons that can be classified as psychological or cognitive reasons for choosing to play certain video games, and are motivated by the challenge and thinking required in the games. Analyses using Chi-square tests of association demonstrated significant relationships between video game genre and learning strategy used for two of the six learning strategies (p < .05). Children playing action games are more likely to use repetition to learn the game and children playing adventure games are more likely to use their imaginations to take on the role of the character in the game and think the way the character would to make decisions in the game. There were also several gender differences in learning preferences.  相似文献   

13.
Shihao Ji 《Pattern recognition》2007,40(5):1474-1485
There are many sensing challenges for which one must balance the effectiveness of a given measurement with the associated sensing cost. For example, when performing a diagnosis a doctor must balance the cost and benefit of a given test (measurement), and the decision to stop sensing (stop performing tests) must account for the risk to the patient and doctor (malpractice) for a given diagnosis based on observed data. This motivates a cost-sensitive classification problem in which the features (sensing results) are not given a priori; the algorithm determines which features to acquire next, as well as when to stop sensing and make a classification decision based on previous observations (accounting for the costs of various types of errors, as well as the rewards of being correct). We formally define the cost-sensitive classification problem and solve it via a partially observable Markov decision process (POMDP). While the POMDP constitutes an intuitively appealing formulation, the intrinsic properties of classification tasks resist application of it to this problem. We circumvent the difficulties of the POMDP via a myopic approach, with an adaptive stopping criterion linked to the standard POMDP. The myopic algorithm is computationally feasible, easily handles continuous features, and seamlessly avoids repeated actions. Experiments with several benchmark data sets show that the proposed method yields state-of-the-art performance, and importantly our method uses only a small fraction of the features that are generally used in competitive approaches.  相似文献   

14.
This paper presents a real-time vision-based system to assist a person with dementia wash their hands. The system uses only video inputs, and assistance is given as either verbal or visual prompts, or through the enlistment of a human caregiver’s help. The system combines a Bayesian sequential estimation framework for tracking hands and towel, with a decision-theoretic framework for computing policies of action. The decision making system is a partially observable Markov decision process, or POMDP. Decision policies dictating system actions are computed in the POMDP using a point-based approximate solution technique. The tracking and decision making systems are coupled using a heuristic method for temporally segmenting the input video stream based on the continuity of the belief state. A key element of the system is the ability to estimate and adapt to user psychological states, such as awareness and responsiveness. We evaluate the system in three ways. First, we evaluate the hand-tracking system by comparing its outputs to manual annotations and to a simple hand-detection method. Second, we test the POMDP solution methods in simulation, and show that our policies have higher expected return than five other heuristic methods. Third, we report results from a ten-week trial with seven persons moderate-to-severe dementia in a long-term care facility in Toronto, Canada. The subjects washed their hands once a day, with assistance given by our automated system, or by a human caregiver, in alternating two-week periods. We give two detailed case study analyses of the system working during trials, and then show agreement between the system and independent human raters of the same trials.  相似文献   

15.
In active perception tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. For example, a mobile robot takes sensory actions to efficiently navigate in a new environment. While partially observable Markov decision processes (POMDPs) provide a natural model for such problems, reward functions that directly penalize uncertainty in the agent’s belief can remove the piecewise-linear and convex (PWLC) property of the value function required by most POMDP planners. Furthermore, as the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially with it, making POMDP planning infeasible with traditional methods. In this article, we address a twofold challenge of modeling and planning for active perception tasks. We analyze \(\rho \)POMDP and POMDP-IR, two frameworks for modeling active perception tasks, that restore the PWLC property of the value function. We show the mathematical equivalence of these two frameworks by showing that given a \(\rho \)POMDP along with a policy, they can be reduced to a POMDP-IR and an equivalent policy (and vice-versa). We prove that the value function for the given \(\rho \)POMDP (and the given policy) and the reduced POMDP-IR (and the reduced policy) is the same. To efficiently plan for active perception tasks, we identify and exploit the independence properties of POMDP-IR to reduce the computational cost of solving POMDP-IR (and \(\rho \)POMDP). We propose greedy point-based value iteration (PBVI), a new POMDP planning method that uses greedy maximization to greatly improve scalability in the action space of an active perception POMDP. Furthermore, we show that, under certain conditions, including submodularity, the value function computed using greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. We establish the conditions under which the value function of an active perception POMDP is guaranteed to be submodular. Finally, we present a detailed empirical analysis on a dataset collected from a multi-camera tracking system employed in a shopping mall. Our method achieves similar performance to existing methods but at a fraction of the computational cost leading to better scalability for solving active perception tasks.  相似文献   

16.
We consider an autonomous agent facing a stochastic, partially observable, multiagent environment. In order to compute an optimal plan, the agent must accurately predict the actions of the other agents, since they influence the state of the environment and ultimately the agent’s utility. To do so, we propose a special case of interactive partially observable Markov decision process, in which the agent does not explicitly model the other agents’ beliefs and preferences, and instead represents them as stochastic processes implemented by probabilistic deterministic finite state controllers (PDFCs). The agent maintains a probability distribution over the PDFC models of the other agents, and updates this belief using Bayesian inference. Since the number of nodes of these PDFCs is unknown and unbounded, the agent places a Bayesian nonparametric prior distribution over the infinitely dimensional set of PDFCs. This allows the size of the learned models to adapt to the complexity of the observed behavior. Deriving the posterior distribution is in this case too complex to be amenable to analytical computation; therefore, we provide a Markov chain Monte Carlo algorithm that approximates the posterior beliefs over the other agents’ PDFCs, given a sequence of (possibly imperfect) observations about their behavior. Experimental results show that the learned models converge behaviorally to the true ones. We consider two settings, one in which the agent first learns, then interacts with other agents, and one in which learning and planning are interleaved. We show that the agent’s performance increases as a result of learning in both situations. Moreover, we analyze the dynamics that ensue when two agents are simultaneously learning about each other while interacting, showing in an example environment that coordination emerges naturally from our approach. Furthermore, we demonstrate how an agent can exploit the learned models to perform indirect inference over the state of the environment via the modeled agent’s actions.  相似文献   

17.
This paper presents a framework for automatically learning rules of a simple game of cards using data from a vision system observing the game being played. Incremental learning of object and protocol models from video, for use by an artificial cognitive agent, is presented. iLearn??a novel algorithm for inducing univariate decision trees for symbolic datasets is introduced. iLearn builds the decision tree in an incremental way allowing automatic learning of rules of the game.  相似文献   

18.
In this paper, we propose and examine adaptive learning procedures for supporting a group of decision-makers with a common set of strategies and preferences who face uncertain behaviors of “nature.” First, we describe the decision situation as a hypergame situation, where each decision-maker is explicitly assumed to have misperceptions about the nature's set of strategies and preferences. Then, we propose three learning procedures about the nature, each of which consists of several activities. One of the activities is to choose “rational” actions based on current perceptions and rationality adopted by the decision-makers, while the other activities are represented by the elements of a genetic algorithm (GA) to improve current perceptions. The three learning procedures are different from each other with respect to at least one of such activities as fitness evaluation, modified crossover, and action choice, though they use the same definition for the other GA elements. Finally, we point out that examining the simulation results how to employ preference- and strategy-oriented information is critical to obtaining good performance in clarifying the nature's set of strategies and the outcomes most preferred by the nature  相似文献   

19.
一种动态不确定性环境中的持续规划系统   总被引:6,自引:1,他引:5  
李响  陈小平 《计算机学报》2005,28(7):1163-1170
规划是人工智能研究的一个重要方向,具有极其广泛的应用背景.近年来,研究重点已经转移到动态不确定性环境中的规划问题.该文将部分可观察马尔可夫决策过程(POMDP)和过程性推理系统(PRS)的优点相结合,提出一种对动态不确定环境具有更全面适应能力的持续规划系统——POMDPRS.该系统利用PRS的持续规划机制,交叉地进行规划与执行,在一定条件下提高了动态环境中POMDP决策的效率;另一方面,用POMDP的概率分布信念模型和极大效用原理替代PRS的一阶逻辑信念表示和计划选择机制,大大增强了处理环境不确定性的能力.  相似文献   

20.
一个激励学习Agent通过学习一个从状态到动作映射的最优策略来解决策问题。激励学习方法是Agent利用试验与环境交互以改进自身的行为。Markov决策过程(MDP)模型是解决激励学习问题的通用方法,而动态规划方法是Agent在具有Markov环境下与策略相关的值函数学习算法。但由于Agent在学习的过程中,需要记忆全部的值函数,这个记忆容量随着状态空间的增加会变得非常巨大。文章提出了一种基于动态规划方法的激励学习遗忘算法,这个算法是通过将记忆心理学中有关遗忘的基本原理引入到值函数的激励学习中,导出了一类用动态规划方法解决激励学习问题的比较好的方法,即Forget-DP算法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号