# An asymptotically optimal policy for finite support models in the multiarmed bandit problem

@article{Honda2011AnAO, title={An asymptotically optimal policy for finite support models in the multiarmed bandit problem}, author={Junya Honda and Akimichi Takemura}, journal={Machine Learning}, year={2011}, volume={85}, pages={361-391} }

In the multiarmed bandit problem the dilemma between exploration and exploitation in reinforcement learning is expressed as a model of a gambler playing a slot machine with multiple arms. A policy chooses an arm in each round so as to minimize the number of times that arms with suboptimal expected rewards are pulled. We propose the minimum empirical divergence (MED) policy and derive an upper bound on the finite-time regret which meets the asymptotic bound for the case of finite support models… Expand

#### Figures, Tables, and Topics from this paper

#### 106 Citations

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

- Mathematics, Computer Science
- COLT
- 2010

Deterministic Minimum Empirical Divergence policy is proposed and proved that DMED achieves the asymptotic bound and the index used in DMED for choosing an arm can be computed easily by a convex optimization technique. Expand

Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2015

This paper modifications this policy and derive a finite-time regret bound for the new policy, Indexed Minimum Empirical Divergence (IMED), by refining large deviation probabilities to a simple nonasymptotic form and shows that IMED much improves DMED and performs competitively to other state-of-the-art policies. Expand

Finite-time Regret Bound of a Bandit Algorithm for the Semi-bounded Support Model

- Mathematics
- 2012

In this paper we consider stochastic multiarmed bandit problems. Recently a policy, DMED, is proposed and proved to achieve the asymptotic bound for the model that each reward distribution is… Expand

Minimal-Exploration Allocation Policies with Almost Sure , Arbitrarily Slow Growing Regret

- 2015

This paper considers the problem of sampling with stochastic multi-armed bandits, when each arm generates i.i.d. rewards that follow an unknown distribution. Typical sampling policies in the… Expand

Kullback–Leibler upper confidence bounds for optimal sequential allocation

- Mathematics
- 2013

We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper… Expand

ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT

- Mathematics
- Probability in the Engineering and Informational Sciences
- 2016

We consider the multi-armed bandit problem under a cost constraint. Successive samples from each population are i.i.d. with unknown distribution and each sample incurs a known population-dependent… Expand

Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations

- Mathematics, Computer Science
- 2018

This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints and provides the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances. Expand

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET

- Mathematics
- Probability in the Engineering and Informational Sciences
- 2020

The purpose of this paper is to provide further understanding into the structure of the sequential allocation (“stochastic multi-armed bandit”) problem by establishing probability one finite horizon… Expand

Sub-sampling for Multi-armed Bandits

- Computer Science, Mathematics
- ECML/PKDD
- 2014

A novel algorithm that is based on sub-sampling that demonstrates excellent empirical performances against state-of-the-art algorithms, including Thompson sampling and KL-UCB is introduced. Expand

Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret

- Mathematics, Computer Science
- ArXiv
- 2015

The purpose of this paper is to provide further understanding into the structure of the sequential allocation problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies. Expand

#### References

SHOWING 1-10 OF 39 REFERENCES

An Asymptotically Optimal Bandit Algorithm for Bounded Support Models

- Mathematics, Computer Science
- COLT
- 2010

Deterministic Minimum Empirical Divergence policy is proposed and proved that DMED achieves the asymptotic bound and the index used in DMED for choosing an arm can be computed easily by a convex optimization technique. Expand

Finite-time Analysis of the Multiarmed Bandit Problem

- Computer Science
- Machine Learning
- 2004

This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Expand

The Continuum-Armed Bandit Problem

- Mathematics
- 1995

In this paper we consider the multiarmed bandit problem where the arms are chosen from a subset of the real line and the mean rewards are assumed to be a continuous function of the arms. The problem… Expand

SAMPLE MEAN BASED INDEX POLICIES WITH O(logn) REGRET FOR THE MULTI-ARMED BANDIT PROBLEM

- Mathematics
- 1995

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases sldwly with time. In their seminal work on… Expand

The Nonstochastic Multiarmed Bandit Problem

- Mathematics, Computer Science
- SIAM J. Comput.
- 2002

A solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. Expand

PAC Bounds for Multi-armed Bandit and Markov Decision Processes

- Computer Science
- COLT
- 2002

The bandit problem is revisited and considered under the PAC model, and it is shown that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?. Expand

Nearly Tight Bounds for the Continuum-Armed Bandit Problem

- Computer Science, Mathematics
- NIPS
- 2004

This work considers the case when the set of strategies is a subset of ℝd, and the cost functions are continuous, and improves on the best-known upper and lower bounds, closing the gap to a sublogarithmic factor. Expand

Multi-armed Bandit Algorithms and Empirical Evaluation

- Computer Science
- ECML
- 2005

A new algorithm is described and analyzed, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments and proves to be often hard to beat. Expand

Optimal Adaptive Policies for Sequential Allocation Problems

- Mathematics
- 1996

Consider the problem of sequential sampling frommstatistical populations to maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parametersformula, it is… Expand

The Multi-Armed Bandit Problem: Decomposition and Computation

- Mathematics, Computer Science
- Math. Oper. Res.
- 1987

It is shown that an approximate largest-index rule yields an approximately optimal policy for the N-project problem, and more efficient methods of computing the indices on-line and/or for sparse transition matrices in large state spaces than have been suggested heretofore. Expand