首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We consider the problem of learning in repeated general-sum matrix games when a learning algorithm can observe the actions but not the payoffs of its associates. Due to the non-stationarity of the environment caused by learning associates in these games, most state-of-the-art algorithms perform poorly in some important repeated games due to an inability to make profitable compromises. To make these compromises, an agent must effectively balance competing objectives, including bounding losses, playing optimally with respect to current beliefs, and taking calculated, but profitable, risks. In this paper, we present, discuss, and analyze M-Qubed, a reinforcement learning algorithm designed to overcome these deficiencies by encoding and balancing best-response, cautious, and optimistic learning biases. We show that M-Qubed learns to make profitable compromises across a wide-range of repeated matrix games played with many kinds of learners. Specifically, we prove that M-Qubed’s average payoffs meet or exceed its maximin value in the limit. Additionally, we show that, in two-player games, M-Qubed’s average payoffs approach the value of the Nash bargaining solution in self play. Furthermore, it performs very well when associating with other learners, as evidenced by its robust behavior in round-robin and evolutionary tournaments of two-player games. These results demonstrate that an agent can learn to make good compromises, and hence receive high payoffs, in repeated games by effectively encoding and balancing best-response, cautious, and optimistic learning biases.  相似文献   

2.
In this paper we introduce a new multi-agent reinforcement learning algorithm, called exploring selfish reinforcement learning (ESRL). ESRL allows agents to reach optimal solutions in repeated non-zero sum games with stochastic rewards, by using coordinated exploration. First, two ESRL algorithms for respectively common interest and conflicting interest games are presented. Both ESRL algorithms are based on the same idea, i.e. an agent explores by temporarily excluding some of the local actions from its private action space, to give the team of agents the opportunity to look for better solutions in a reduced joint action space. In a latter stage these two algorithms are transformed into one generic algorithm which does not assume that the type of the game is known in advance. ESRL is able to find the Pareto optimal solution in common interest games without communication. In conflicting interest games ESRL only needs limited communication to learn a fair periodical policy, resulting in a good overall policy. Important to know is that ESRL agents are independent in the sense that they only use their own action choices and rewards to base their decisions on, that ESRL agents are flexible in learning different solution concepts and they can handle both stochastic, possible delayed rewards and asynchronous action selection. A real-life experiment, i.e. adaptive load-balancing of parallel applications is added.  相似文献   

3.
A class of linear-quadratic Stackelberg games with many leaders and many followers is considered. For this game, a proportionality relation is assumed between some of the weighting matrices in the leaders’ cost functions. With this assumption, it is shown that the matrix characterizing the set of necessary conditions to be satisfied by an open-loop Stackelberg strategy has a special spectrum. This property is then used to solve the two-point boundary-value problem (TPBVP) associated with the game by an eigenvector method.  相似文献   

4.
5.
This article presents a novel actor‐critic‐barrier structure for the multiplayer safety‐critical systems. Non‐zero‐sum (NZS) games with full‐state constraints are first transformed into unconstrained NZS games using a barrier function. The barrier function is capable of dealing with both symmetric and asymmetric constraints on the state. It is shown that the Nash equilibrium of the unconstrained NZS guarantees to stabilize the original multiplayer system. The barrier function is combined with an actor‐critic structure to learn the Nash equilibrium solution in an online fashion. It is shown that integrating the barrier function with the actor‐critic structure guarantees that the constraints will not be violated during learning. Boundedness and stability of the closed‐loop signals are analyzed. The efficacy of the presented approach is finally demonstrated by using a simulation example.  相似文献   

6.
在有限理性的基础上,对N人合作博弈的对称Nash均衡进行了分析,并引入演化博弈理论分析了参与人的演化均衡稳定策略,得到了不同策略选择下的均衡点。进而应用生物复制动态理论对离散时间及连续时间下的复制动态稳定集进行了研究。最后通过实例说明了该方法在博弈均衡选择上的有效性。  相似文献   

7.
Autonomous Robots - Dynamic games are an effective paradigm for dealing with the control of multiple interacting actors. This paper introduces augmented Lagrangian GAME-theoretic solver (ALGAMES),...  相似文献   

8.
Constrained clustering methods (that usually use must-link and/or cannot-link constraints) have been received much attention in the last decade. Recently, kernel adaptation or kernel learning has been considered as a powerful approach for constrained clustering. However, these methods usually either allow only special forms of kernels or learn non-parametric kernel matrices and scale very poorly. Therefore, they either learn a metric that has low flexibility or are applicable only on small data sets due to their high computational complexity. In this paper, we propose a more efficient non-linear metric learning method that learns a low-rank kernel matrix from must-link and cannot-link constraints and the topological structure of data. We formulate the proposed method as a trace ratio optimization problem and learn appropriate distance metrics through finding optimal low-rank kernel matrices. We solve the proposed optimization problem much more efficiently than SDP solvers. Additionally, we show that the spectral clustering methods can be considered as a special form of low-rank kernel learning methods. Extensive experiments have demonstrated the superiority of the proposed method compared to recently introduced kernel learning methods.  相似文献   

9.
Fujita H  Ishii S 《Neural computation》2007,19(11):3051-3087
Games constitute a challenging domain of reinforcement learning (RL) for acquiring strategies because many of them include multiple players and many unobservable variables in a large state space. The difficulty of solving such realistic multiagent problems with partial observability arises mainly from the fact that the computational cost for the estimation and prediction in the whole state space, including unobservable variables, is too heavy. To overcome this intractability and enable an agent to learn in an unknown environment, an effective approximation method is required with explicit learning of the environmental model. We present a model-based RL scheme for large-scale multiagent problems with partial observability and apply it to a card game, hearts. This game is a well-defined example of an imperfect information game and can be approximately formulated as a partially observable Markov decision process (POMDP) for a single learning agent. To reduce the computational cost, we use a sampling technique in which the heavy integration required for the estimation and prediction can be approximated by a plausible number of samples. Computer simulation results show that our method is effective in solving such a difficult, partially observable multiagent problem.  相似文献   

10.
《Automatica》2014,50(12):3038-3053
This paper introduces a new class of multi-agent discrete-time dynamic games, known in the literature as dynamic graphical games. For that reason a local performance index is defined for each agent that depends only on the local information available to each agent. Nash equilibrium policies and best-response policies are given in terms of the solutions to the discrete-time coupled Hamilton–Jacobi equations. Since in these games the interactions between the agents are prescribed by a communication graph structure we have to introduce a new notion of Nash equilibrium. It is proved that this notion holds if all agents are in Nash equilibrium and the graph is strongly connected. A novel reinforcement learning value iteration algorithm is given to solve the dynamic graphical games in an online manner along with its proof of convergence. The policies of the agents form a Nash equilibrium when all the agents in the neighborhood update their policies, and a best response outcome when the agents in the neighborhood are kept constant. The paper brings together discrete Hamiltonian mechanics, distributed multi-agent control, optimal control theory, and game theory to formulate and solve these multi-agent dynamic graphical games. A simulation example shows the effectiveness of the proposed approach in a leader-synchronization case along with optimality guarantees.  相似文献   

11.
In this paper we first derive a necessary and sufficient condition for a stationary strategy to be the Nash equilibrium of discounted constrained stochastic game under certain assumptions. In this process we also develop a nonlinear (non-convex) optimization problem for a discounted constrained stochastic game. We use the linear best response functions of every player and complementary slackness theorem for linear programs to derive both the optimization problem and the equivalent condition. We then extend this result to average reward constrained stochastic games. Finally, we present a heuristic algorithm motivated by our necessary and sufficient conditions for a discounted cost constrained stochastic game. We numerically observe the convergence of this algorithm to Nash equilibrium.  相似文献   

12.
13.

This paper suggests a new approach for repeated Stackelberg security games (SSGs) based on manipulation. Manipulation is a strategy interpreted by the Machiavellianism social behavior theory, which consists on three main concepts: view, tactics, and immorality. The world is conceptualized by manipulators and manipulated (view). Players employ Machiavelli’s tactics and Machiavellian intelligence in order to manipulate attacker/defender situations. The immorality plays a fundamental role in these games, defenders are able to not be attached to a conventional moral in order to achieve their goals. We consider a security game model involving manipulating defenders and manipulated attackers engaged cooperatively in a Nash game and at the same time restricted by a Stackelberg game. The resulting game is non-cooperative bargaining game. The cooperation is represented by the Nash bargaining solution. We propose an analytical formula for solving the manipulation game, which arises as the maximum of the quotient of two Nash products. The role of the players in the Stackelberg security game are determined by the weights of the players for the Nash bargaining approach. We consider only a subgame perfect equilibrium where the solution of the manipulation game is a Strong Stackelberg Equilibrium (SSE). We employ a reinforcement learning (RL) approach for the implementation of the immorality. A numerical example related to developing a strategic schedule for the efficient use of resources for patrolling in a smart city is handled using a class of homogeneous, ergodic, controllable, and finite Markov chains for showing the usefulness of the method for security resource allocation.

  相似文献   

14.
This paper introduces a model-free reinforcement learning technique that is used to solve a class of dynamic games known as dynamic graphical games. The graphical game results from multi-agent dynamical systems, where pinning control is used to make all the agents synchronize to the state of a command generator or a leader agent. Novel coupled Bellman equations and Hamiltonian functions are developed for the dynamic graphical games. The Hamiltonian mechanics are used to derive the necessary conditions for optimality. The solution for the dynamic graphical game is given in terms of the solution to a set of coupled Hamilton-Jacobi-Bellman equations developed herein. Nash equilibrium solution for the graphical game is given in terms of the solution to the underlying coupled Hamilton-Jacobi-Bellman equations. An online model-free policy iteration algorithm is developed to learn the Nash solution for the dynamic graphical game. This algorithm does not require any knowledge of the agents’ dynamics. A proof of convergence for this multi-agent learning algorithm is given under mild assumption about the inter-connectivity properties of the graph. A gradient descent technique with critic network structures is used to implement the policy iteration algorithm to solve the graphical game online in real-time.  相似文献   

15.
A widely accepted rational behavior for non-cooperative players is based on the notion of Nash equilibrium. Although the existence of a Nash equilibrium is guaranteed in the mixed framework (i.e., when players select their actions in a randomized manner) in many real-world applications the existence of “any” equilibrium is not enough. Rather, it is often desirable to single out equilibria satisfying some additional requirements (in order, for instance, to guarantee a minimum payoff to certain players), which we call constrained Nash equilibria.In this paper, a formal framework for specifying these kinds of requirement is introduced and investigated in the context of graphical games, where a player p may directly be interested in some of the other players only, called the neighbors of p. This setting is very useful for modeling large population games, where typically each player does not directly depend on all the players, and representing her utility function extensively is either inconvenient or infeasible.Based on this framework, the complexity of deciding the existence and of computing constrained equilibria is then investigated, in the light of evidencing how the intrinsic difficulty of these tasks is affected by the requirements prescribed at the equilibrium and by the structure of players’ interactions. The analysis is carried out for the setting of mixed strategies as well as for the setting of pure strategies, i.e., when players are forced to deterministically choose the action to perform. In particular, for this latter case, restrictions on players’ interactions and on constraints are identified, that make the computation of Nash equilibria an easy problem, for which polynomial and highly-parallelizable algorithms are presented.  相似文献   

16.
We consider a class of games with real-valued strategies and payoff information available only in the form of data from a given sample of strategy profiles. Solving such games with respect to the underlying strategy space requires generalizing from the data to a complete payoff-function representation. We address payoff-function learning as a standard regression problem, with provision for capturing known structure (e.g., symmetry) in the multiagent environment. To measure learning performance, we consider the relative utility of prescribed strategies, rather than the accuracy of payoff functions per se. We demonstrate our approach and evaluate its effectiveness on two examples: a two-player version of the first-price sealed-bid auction (with known analytical form), and a five-player market-based scheduling game (with no known solution). Additionally, we explore the efficacy of using relative utility of strategies as a target of supervised learning and as a learning model selector. Our experiments demonstrate its effectiveness in the former case, though not in the latter.  相似文献   

17.
Recently, a growing number of scientific applications have been migrated into the cloud. To deal with the problems brought by clouds, more and more researchers start to consider multiple optimization goals in workflow scheduling. However, the previous works ignore some details, which are challenging but essential. Most existing multi-objective workflow scheduling algorithms overlook weight selection, which may result in the quality degradation of solutions. Besides, we find that the famous partial critical path (PCP) strategy, which has been widely used to meet the deadline constraint, can not accurately reflect the situation of each time step. Workflow scheduling is an NP-hard problem, so self-optimizing algorithms are more suitable to solve it.In this paper, the aim is to solve a workflow scheduling problem with a deadline constraint. We design a deadline constrained scientific workflow scheduling algorithm based on multi-objective reinforcement learning (RL) called DCMORL. DCMORL uses the Chebyshev scalarization function to scalarize its Q-values. This method is good at choosing weights for objectives. We propose an improved version of the PCP strategy calledMPCP. The sub-deadlines in MPCP regularly update during the scheduling phase, so they can accurately reflect the situation of each time step. The optimization objectives in this paper include minimizing the execution cost and energy consumption within a given deadline. Finally, we use four scientific workflows to compare DCMORL and several representative scheduling algorithms. The results indicate that DCMORL outperforms the above algorithms. As far as we know, it is the first time to apply RL to a deadline constrained workflow scheduling problem.  相似文献   

18.
The resource-constrained project scheduling problem (RCPSP) is encountered in many fields, including manufacturing, supply chain, and construction. Nowadays, with the rapidly changing external environment and the emergence of new models such as smart manufacturing, it is more and more necessary to study RCPSP considering resource disruptions. A framework based on reinforcement learning (RL) and graph neural network (GNN) is proposed to solve RCPSP and further solve the RCPSP with resource disruptions (RCPSP-RD) on this basis. The scheduling process is formulated as sequential decision-making problems. Based on that, Markov decision process (MDP) models are developed for RL to learn scheduling policies. A GNN-based structure is proposed to extract features from problems and map them to action probability distributions by policy network. To optimize the scheduling policy, proximal policy optimization (PPO) is applied to train the model end-to-end. Computational results on benchmark instances show that the RL-GNN algorithm achieves competitive performance compared with some widely used methods.  相似文献   

19.
This paper investigates the maintenance problem for a flow line system consisting of two series machines with an intermediate finite buffer in between. Both machines independently deteriorate as they operate, resulting in multiple yield levels. Resource constrained imperfect preventive maintenance actions may bring the machine back to a better state. The problem is modeled as a semi-Markov decision process. A distributed multi-agent reinforcement learning algorithm is proposed to solve the problem and to obtain the control-limit maintenance policy for each machine associated with the observed state represented by yield level and buffer level. An asynchronous updating rule is used in the learning process since the state transitions of both machines are not synchronous. Experimental study is conducted to evaluate the efficiency of the proposed algorithm.  相似文献   

20.
We consider the problem of learning to predict as well as the best in a group of experts making continuous predictions. We assume the learning algorithm has prior knowledge of the maximum number of mistakes of the best expert. We propose a new master strategy that achieves the best known performance for on-line learning with continuous experts in the mistake bounded model. Our ideas are based on drifting games, a generalization of boosting and on-line learning algorithms. We prove new lower bounds based on the drifting games framework which, though not as tight as previous bounds, have simpler proofs and do not require an enormous number of experts. We also extend previous lower bounds to show that our upper bounds are exactly tight for sufficiently many experts. A surprising consequence of our work is that continuous experts are only as powerful as experts making binary or no prediction in each round.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号