首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In its most basic form, bandit theory is concerned with the design problem of sequentially choosing members from a given collection of random variables so that the regret, i.e., Rnj (μ*-μj)ETn(j), grows as slowly as possible with increasing n. Here μj is the expected value of the bandit arm (i.e., random variable) indexed by j, Tn(j) is the number of times arm j has been selected in the first n decision stages, and μ*=supj μj. The present paper contributes to the theory by considering the situation in which observations are dependent. To begin with, the dependency is presumed to depend only on past observations of the same arm, but later, we allow that it may be with respect to the entire past and that the set of arms is infinite. This brings queues and, more generally, controlled Markov processes into our purview. Thus our “black-box” methodology is suitable for the case when the only observables are cost values and, in particular, the probability structure and loss function are unknown to the designer. The conclusion of the analysis is that under lenient conditions, using algorithms prescribed herein, risk growth is commensurate with that in the simplest i.i.d. cases. Our methods represent an alternative to stochastic-approximation/perturbation-analysis ideas for tuning queues  相似文献   

2.
Truong  Quoc-Tuan  Lauw  Hady W. 《Machine Learning》2021,110(8):2085-2105
Machine Learning - Recommendations are prevalent in Web applications (e.g., search ranking, item recommendation, advertisement placement). Learning from bandit feedback is challenging due to the...  相似文献   

3.
The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning. Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.  相似文献   

4.
Nash has described optimal strategies for a class of generalized bandit problems and his results have been used to analyse models in research planning and in stochastic scheduling. A general approach to the evaluation of strategies for such problems is described.  相似文献   

5.
We prove a lemma on the optimal value function for the multiarmed bandit problem which provides a simple direct proof of optimality of writeoff policies. This, in turn, leads to a new proof of optimality of the index rule.  相似文献   

6.
Many stochastic resource allocation problems may be formulated as families of alternative bandit processes. One example is the classical one-armed bandit problem recently studied by Kumar and Seidman. Optimal strategies for such problems are known to be determined by a collection of dynamic allocation indexes (DAI's). The aim of this note is to bring this important result to the attention of control theorists and to give a new proof of it. Applications and some related work are also discussed.  相似文献   

7.
We present a new multiclass algorithm in the bandit framework, where after making a prediction, the learning algorithm receives only partial feedback, i.e., a single bit indicating whether the predicted label is correct or not, rather than the true label. Our algorithm is based on the second-order Perceptron, and uses upper-confidence bounds to trade-off exploration and exploitation, instead of random sampling as performed by most current algorithms. We analyze this algorithm in a partial adversarial setting, where instances are chosen adversarially, while the labels are chosen according to a linear probabilistic model which is also chosen adversarially. We show a regret of $\mathcal{O}(\sqrt{T}\log T)$ , which improves over the current best bounds of $\mathcal{O}(T^{2/3})$ in the fully adversarial setting. We evaluate our algorithm on nine real-world text classification problems and on four vowel recognition tasks, often obtaining state-of-the-art results, even compared with non-bandit online algorithms, especially when label noise is introduced.  相似文献   

8.
We consider application of the two-armed bandit problem to processing a large number N of data where two alternative processing methods can be used. We propose a strategy which at the first stages, whose number is at most r ? 1, compares the methods, and at the final stage applies only the best one obtained from the comparison. We find asymptotically optimal parameters of the strategy and observe that the minimax risk is of the order of N α , where α = 2 r?1/(2 r ? 1). Under parallel processing, the total operation time is determined by the number r of stages but not by the number N of data.  相似文献   

9.
There areNindependent machines. Machine i is described by a sequence{X^{i}(s), F^{i}(s)}whereX^{i}(s)is the immediate reward and F^{i}(s) is the information available before i is operated for the sth lime. At each time one operates exacfiy one machine; idle machines remain frozen. The problem is to schedule the operation of the machines so as to maximize the expected total discounted sequence of rewards. An elementary proof shows that to each machine is associated an index, and the optimal policy operates the machine with the largest current index. When the machines are completely observed Markov chains, this coincides with the well-known Gittins index rule, and new algorithms are given for calculating the index. A reformulation of the bandit problem yields the tax problem, which includes, as a special case, Klimov's waiting time problem. Using the concept of superprocess, an index rule is derived for the case where new machines arrive randomly. Finally, continuous time versions of these problems are considered for both preemptive and nonpreemptive disciplines.  相似文献   

10.
We obtain minimax lower bounds on the regret for the classical two-armed bandit problem. We provide a finite-sample minimax version of the well-known log n asymptotic lower bound of Lai and Robbins (1985). The finite-time lower bound allows us to derive conditions for the amount of time necessary to make any significant gain over a random guessing strategy. These bounds depend on the class of possible distributions of the rewards associated with the arms. For example, in contrast to the log n asymptotic results on the regret, we show that the minimax regret is achieved by mere random guessing under fairly mild conditions on the set of allowable configurations of the two arms. That is, we show that for every allocation rule and for every n, there is a configuration such that the regret at time n is at least 1-ϵ times the regret of random guessing, where ϵ is any small positive constant  相似文献   

11.
We study a recommendation system problem, in which the system must be able to cover as many users’ preferences as possible while these preferences change over time. This problem can be formulated as a variation of the maximum coverage problem; specifically we introduced a novel problem of Online k-Hitting Set, where the number of sets and elements within the sets can change dynamically. When the number of distinctive elements is large, an exhaustive search for even a fixed number of elements is known to be computationally expensive. Even the static problem is known to be NP-hard (Hochba, ACM SIGACT News 28(2):40–52, 1997) and many known algorithms tend to have exponential growth in complexity. We propose a novel graph based UCB1 algorithm that effectively minimizes the number of elements to consider, thereby reducing the search space greatly. The algorithm utilizes a new rewarding scheme to choose items that satisfy more users by balancing coverage and diversity as it construct a relational graph between items to recommend. Experiments show that the new graph based algorithm performs better than existing techniques such as Ranked Bandit (Radlinski et al. 2008) and Independent Bandits (Kohli et al. 2013) in terms of satisfying diverse types of users while minimizing computational complexity.  相似文献   

12.
Hüyük  Alihan  Tekin  Cem 《Machine Learning》2021,110(6):1233-1266
Machine Learning - We consider multi-objective multi-armed bandit with (i) lexicographically ordered and (ii) satisficing objectives. In the first problem, the goal is to select arms that are...  相似文献   

13.
14.
We propose a method that learns to allocate computation time to a given set of algorithms, of unknown performance, with the aim of solving a given sequence of problem instances in a minimum time. Analogous meta-learning techniques are typically based on models of algorithm performance, learned during a separate offline training sequence, which can be prohibitively expensive. We adopt instead an online approach, named GAMBLETA, in which algorithm performance models are iteratively updated, and used to guide allocation on a sequence of problem instances. GAMBLETA is a general method for selecting among two or more alternative algorithm portfolios. Each portfolio has its own way of allocating computation time to the available algorithms, possibly based on performance models, in which case its performance is expected to improve over time, as more runtime data becomes available. The resulting exploration-exploitation trade-off is represented as a bandit problem. In our previous work, the algorithms corresponded to the arms of the bandit, and allocations evaluated by the different portfolios were mixed, using a solver for the bandit problem with expert advice, but this required the setting of an arbitrary bound on algorithm runtimes, invalidating the optimal regret of the solver. In this paper, we propose a simpler version of GAMBLETA, in which the allocators correspond to the arms, such that a single portfolio is selected for each instance. The selection is represented as a bandit problem with partial information, and an unknown bound on losses. We devise a solver for this game, proving a bound on its expected regret. We present experiments based on results from several solver competitions, in various domains, comparing GAMBLETA with another online method.  相似文献   

15.
This paper deals with the optimal stopping problem for multiarmed bandit processes. Under the assumption of independence of arms we show that optimal strategies and stopping times are expressed by the dynamic allocation indices for each arm. This paper reduces this problem to several independent one-parameter optimal stopping problems. On the basis of these results, we characterize optimal strategies and stopping times. Moreover, this paper also extends those to the case allowing time constraints. In the case where arm's state evolve according to Markov chains with finite state, linear programming calculation of optimal strategies and stopping times is discussed.  相似文献   

16.
Applied Intelligence - The main aim of multimodal optimization problems (MMOPs) is to find and deal with multiple optimal solutions using an objective function. MMOPs perform the exploration and...  相似文献   

17.
In the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. A policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. We propose the minimum empirical divergence (MED) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models. In a setting similar to ours, Burnetas and Katehakis have already proposed an asymptotically optimal policy. However, we do not assume any knowledge of the support except for its upper and lower bounds. Furthermore, the criterion for choosing an arm, minimum empirical divergence, can be computed easily by a convex optimization technique. We confirm by simulations that the MED policy demonstrates good performance in finite time in comparison to other currently popular policies.  相似文献   

18.
The K-armed bandit problem is a well-known formalization of the exploration versus exploitation dilemma. In this learning problem, a player is confronted to a gambling machine with K arms where each arm is associated to an unknown gain distribution. The goal of the player is to maximize the sum of the rewards. Several approaches have been proposed in literature to deal with the K-armed bandit problem. This paper introduces first the concept of “expected reward of greedy actions” which is based on the notion of probability of correct selection (PCS), well-known in simulation literature. This concept is then used in an original semi-uniform algorithm which relies on the dynamic programming framework and on estimation techniques to optimally balance exploration and exploitation. Experiments with a set of simulated and realistic bandit problems show that the new DP-greedy algorithm is competitive with state-of-the-art semi-uniform techniques.  相似文献   

19.
This paper reports on the development of a multi-agent approach to long-term information collection in networks of energy harvesting wireless sensors. In particular, we focus on developing energy management and data routing policies that adapt their behaviour according to the energy that is harvested, in order to maximise the amount of information collected given the available energy budget. In so doing, we introduce a new energy management technique, based on multi-armed bandit learning, that allows each agent to adaptively allocate its energy budget across the tasks of data sampling, receiving and transmitting. By using this approach, each agent can learn the optimal energy budget settings that give it efficient information collection in the long run. Then, we propose two novel decentralised multi-hop algorithms for data routing. The first proveably maximises the information throughput in the network, but can sometimes involve high communication cost. The second algorithm provides near-optimal performance, but with reduced computational and communication costs. Finally, we demonstrate that, by using our approaches for energy management and routing, we can achieve a 120% improvement in long-term information collection against state-of-the-art benchmarks.  相似文献   

20.
The problem of rational behavior in the stochastic environment, also known as the two armed bandit problem, is considered in the robust (minimax) setting. A parallel strategy is proposed leading to control, which is arbitrary close to the optimal one for environments with gains having gaussian cumulative distribution functions with unit variance. The invariant recursive equation is obtained for computing the minimax strategy and risk, which are to be found as Bayesian ones associated with the worst-case a priori distribution. As a result, the well-known Vogel’s estimates of the minimax risk can be improved. Numerical experiments show that the strategy is efficient in the environments with non-gaussian distributions, e.g., the binary ones.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号