# The non-stationary stochastic multi-armed bandit problem

## Abstract

We consider a variant of the stochastic multi-armed bandit with *K* arms where the rewards are not assumed to be identically distributed, but are generated by a non-stationary stochastic process. We first study the *unique best arm* setting when there exists one unique best arm. Second, we study the general *switching best arm* setting when a best arm switches at some unknown steps. For both settings, we target problem-dependent bounds, instead of the more conservative problem-free bounds. We consider two classical problems: (1) identify a best arm with high probability (best arm identification), for which the performance measure by the sample complexity (number of samples before finding a near-optimal arm). To this end, we naturally extend the definition of sample complexity so that it makes sense in the switching best arm setting, which may be of independent interest. (2) Achieve the smallest cumulative regret (regret minimization) where the regret is measured with respect to the strategy pulling an arm with the best instantaneous mean at each step.

### Keywords

Multi-armed bandit Non-stationarity Adversarial bandit## 1 Introduction

The theoretical framework of the multi-armed bandit problem formalizes the fundamental exploration/exploitation dilemma that appears in decision-making problems facing partial information. At a high level, a set of *K* arms is available to a player. At each turn, she has to choose one arm and receives a reward corresponding to the played arm, without knowing what would have been the received reward had she played another arm. The player faces the dilemma of *exploring*, that is playing an arm whose mean reward is loosely estimated in order to build a better estimate, or *exploiting*, that is playing a seemingly best arm based on current mean estimates in order to maximize her cumulative reward. The accuracy of the player policy at time horizon *T* is typically measured in terms of *sample complexity* or of *regret*. The sample complexity is the number of plays required to find an approximation of the best arm with high probability. In that case, the player can stop playing after identifying this arm. The regret is the difference between the cumulative rewards of the player and the one that could be acquired by a policy assumed to be optimal.

The **stochastic** multi-armed bandit problem assumes the rewards to be generated independently from stochastic distribution associated with each arm. Stochastic algorithms usually assume distributions to be constant over time like with the Thompson Sampling (TS) [17], UCB [2] or Successive Elimination (SE) [6]. Under this assumption of *stationarity*, TS and UCB achieve optimal upper bounds on the cumulative regret with logarithmic dependencies on *T* . SE algorithm achieves a near-optimal sample complexity.

In the **adversarial** multi-armed bandit problem, rewards are chosen by an adversary. This formulation can model any form of non-stationarity. EXP3 algorithm [3, 14] achieves an optimal regret of \(O(\sqrt{T})\) against an oblivious opponent that chooses rewards before the beginning of the game, with respect to the best policy that pulls the same arm over the totality of the game. This weakness is partially overcame by EXP3.S [3], a variant of EXP3, that forgets the past adding at each time-step a proportion of the mean gain and achieves controlled regret with respect to policies that allow arm switches during the run.

The **switching bandit** problem introduces non-stationarity within the *stochastic* bandit problem by allowing means to change at some time-steps. As mean rewards stay stationary between those changes, this setting is also qualified as *piecewise-stationary*. *Discounted*UCB [13] and *sliding-window*UCB [8] are adaptations of UCB to the switching bandit problem and achieve a regret bound of \(O(\sqrt{M T \log T})\), where \(M-1\) is the number of distribution changes. It is also worth citing Meta-Eve [10] that associates UCB with a mean change detector, resetting the algorithm when a change is detected. While no analysis is provided, it has demonstrated strong empirical performances.

*Stochastic and adversarial* Several variants combining stochastic and adversarial rewards have been proposed by Seldin and Slivkins [15] or Bubeck and Slivkins [5]. For instance, in the setting with *contaminated rewards*, rewards are mainly drawn from stationary distributions except for a minority of mean rewards chosen in advance by an adversary. In order to guarantee their proposed algorithm EXP3++ [15] achieves logarithmic guarantees, the adversary is constrained in the sense it cannot lowered the gap between arms more than a factor 1 / 2. They also proposed another variant called *adversarial with gap* [15] which assumes the existence of a round after which an arm persists to be the best. These works are motivated by the desire to create generic algorithms able to perform bandit tasks with various reward types, stationary, adversary or mainly stationary. However, despite achieving good performances on a wide range of problems, each one needs a specific parameterization (i.e., an instance of EXP3++ parameterized for stationary rewards may not perform well if rewards are chosen by an adversary).

*Our contribution* We consider a generalization of the stationary stochastic, piecewise-stationary and adversarial bandit problems. In this formulation, rewards are drawn from stochastic distributions of arbitrary means defined before the beginning of the game. Our first contribution is for the unique best arm setting. We introduce a deceptively simple variant of the Successive Elimination (SE) algorithm, called Successive Elimination with Randomized Round-Robin (SER3), and we show that the seemingly minor modification—a randomized round-robin procedure—leads to a dramatic improvement of the performance over the original SE algorithm. We identify a notion of gap that generalizes the gap from stochastic bandits to the non-stationary case and derive *gap-dependent* (also known as problem-dependent) sample complexity and regret bounds, instead of the more classical and less informative *problem-free* bounds. We show for instance in Theorem 1 and Corollary 1 that SER3 achieves a non-trivial problem-dependent sample complexity scaling with \({\varDelta }^{-2}\) and a cumulative regret in \(O(K\log (TK/{\varDelta })/{\varDelta })\) after *T* steps, in situations where SE may even suffer from a linear regret, as supported by numerical experiments (see Sect. 5). This result positions, under some assumptions, SER3 as an alternative to EXP3 when the rewards are non-stationary.

Our second contribution is to manage best arm switches during the game. First, we extend the definition of the sample complexity in order to analyze the best arm identification algorithms when best arm switches during the game. SER4 takes advantages of the low regret of SER3 by resetting the reward estimators randomly during the game and then starting a new phase of optimization. Against an optimal policy with \(N-1\) switches of the optimal arm (but arbitrarily many distribution switches), this new algorithm achieves an expected sample complexity of \(O({\varDelta }^{-2}\sqrt{NK\delta ^{-1} \log (K \delta ^{-1})})\), with probability \(1-\delta \), and an expected cumulative regret of \(O({\varDelta }^{-1}{\sqrt{NTK \log (TK)}})\) after *T* time-steps. A second algorithm for the non-stationary stochastic multi-armed bandit with switches is an alternative to the passive approach used in SER4 (the random resets). Second, algorithm EXP3.R takes advantage of the exploration factor of EXP3 to evaluate unbiased estimations of the mean rewards. Combined with a drift detector, this active approach resets the weights of EXP3 when a change of best arm is detected. We finally show that EXP3.R also obtains competitive problem-dependent regret minimization guarantees in \(O\left( 3NCK\sqrt{TK \log T}\right) \), where *C* depends on \({\varDelta }\).

## 2 Setting

We consider a generalization of the stationary stochastic, piecewise-stationary and adversarial bandit problems where the adversary chooses before the beginning of the game a sequence of *distributions* instead of directly choosing a sequence of rewards. This formulation generalizes the adversarial setting since choosing arbitrarily a reward \(y_k(t)\) is equivalent to drawing this reward from a distribution of mean \(y_k(t)\) and a variance of zero. The stationary stochastic formulation of the bandit problem is a particular case, where the distributions do not change.

### 2.1 The problem

*K*arms. The reward \(y_{k_t}(t) \in [0,1]\) obtained by the player after playing the arm \(k_t\) is drawn from a distribution of mean \(\mu _{k_t}(t) \in [0,1]\). The instantaneous gap between arms

*k*and \(k^\prime \) at time

*t*is:

*t*.

### 2.2 The notion of sample complexity

In the literature [12], the sample complexity of an algorithm is the number of samples needed by this algorithm to find a policy achieving a specific level of performance with high probability. We denote \(\delta \in (0,1]\) the probability of failure. For instance, for the best arm identification in the stochastic stationary bandit (that is when \(\forall k \forall t, \mu _k(t)=\mu _k(t+1)\) and \(k^*(t)=k^*(t+1)\)), the sample complexity is the number of sample needed to find, with a probability at least \(1-\delta \), the arm \(k^*\) with the maximum mean reward. Analysis in sample complexity is useful for situations where the knowledge of the optimal arm is needed to make one impact-full decision, for example to choose which one of several possible products to manufacture or for building hierarchical models of contextual bandits in a greedy way [7], reducing the exploration space.

### Definition 1

*(Sample complexity)* Let A be an algorithm. An arm *k* is epsilon optimal if \(\mu _k \ge \mu ^* - \epsilon \), with \(\epsilon \in [0,1]\). The sample complexity of A performing a best arm identification task is the number of observations needed to find an \(\epsilon \)-optimal arm with a probability of at least \(1-\delta \).

The usual notion of sample complexity—the minimal number of observations required to find a near-optimal arm with high probability—is well adapted to the case when there exists a unique best arm during all the game, but makes little sense in the general scenario when the best arm can change. Indeed, after a best arm change, a learning algorithm requires some time-steps before recovering. Thus, we provide in Sect. 4 a meaningful extension of the sample complexity definition to the *switching best arm* scenario. This extended notion of sample complexity now takes into account not only the number of time-steps required by the algorithm to identify a near-optimal arm, but more generally the number of time-steps required before recovering a near-optimal arm after each change.

### 2.3 The notion of regret

When the decision process does not lead to one final decision, minimizing the sample complexity may not be an appropriate goal. Instead, we may want to maximize the cumulative gain obtained through the game which is equivalent to minimize the difference between the choices of an optimal policy and those of the player. We call this difference, the regret. We define the *pseudocumulative regret* as the difference of mean rewards between the arms chosen by the optimal policy and those chosen by the player.

### Definition 2

*Pseudocumulative regret*)

Usually, in the stochastic bandit setting, the distributions of rewards are stationary and the instantaneous gap \({\varDelta }_{k,k^\prime }(t) = \mu _k(t) -\mu _{k^\prime }(t)\) is the same for all the time-steps.

There exists a non-reciprocate relation between the minimization of the sample complexity and the minimization of the pseudocumulative regret. For instance, algorithm UCB has an order optimal regret, but it does not minimize the sample complexity. UCB will continue to play suboptimal arms, but with a decreasing frequency as the number of plays increases. However, an algorithm with an optimal sample complexity, like Median Elimination [6], will also have an optimal pseudocumulative regret (up to some constant factors). More details on the relation between both lower bounds can be found in [4, 9].

Therefore, the algorithms presented in this paper slightly differ according to the quantity to minimize, the regret or the sample complexity. For instance, when the target is the regret minimization, after identifying the best arm, the algorithms continue to sample it, whereas in the case of sample complexity minimization, the algorithms stop the sampling process when the best arm is identified. When best arm switches are considered, algorithms minimizing the sample complexity enter a waiting state after identifying the current best arm and do not sample the sequence for exploitation purposes (sampling the optimal arm still increases the sample complexity). However, they still have to parsimoniously collect samples for each action in order to detect best arm changes and face a new trade-off between the rate of sampling and the time needed to find the new best arm after a switch.

## 3 Non-stationary stochastic multi-armed bandit with unique best arm

In this section, we present algorithm Successive Elimination with Randomized Round-Robin (SER3, see Algorithm 1), a randomized version of Successive Elimination which tackles the best arm identification problem when rewards are non-stationary.

### 3.1 A modified successive elimination algorithm

We elaborate on several notions required to understand the behavior of the algorithm and to relax the constraint of stationarity.

#### 3.1.1 The elimination mechanism

#### 3.1.2 Hoeffding’s inequality

Successive Elimination assumes that the rewards are drawn from stochastic distributions that are identical over time (rewards are identically distributed). However, the Hoeffding inequality used by this algorithm does not require stationarity and only requires independence. We remember the Hoeffding inequality:

### Lemma 1

Thus, we can use this inequality to calculate confidence bounds of empirical means computed with rewards drawn from non-identical distributions.

#### 3.1.3 Randomization of the round-robin

A sequence of mean rewards tricking a deterministic bandit algorithm

\(\mu _k(t)\) | \(t=1\) | \(t=2\) | \(t=3\) | \(t=4\) | \(t=5\) | \(t=6\) |
---|---|---|---|---|---|---|

\(k=1\) | 0.6 | 1 | 0.6 | 1 | 0.6 | 1 |

\(k=2\) | 0.4 | 0.8 | 0.4 | 0.8 | 0.4 | 0.8 |

The best arm seems to be \(k=1\) as \(\mu _1(t)\) is greater than \(\mu _2(t)\) at every time-step *t*. However, by sampling the arms with a deterministic policy playing sequentially \(k=1\) and then \(k=2\), after \(t=6\) the algorithm has only sampled rewards from a distribution of mean 0.6 for \(k=1\) and of mean 0.8 for \(k=2\). After enough time following this pattern, an elimination algorithm will eliminate the first arm. Our algorithm SER3 adds a shuffling of the arm set after each round-robin cycle to Successive Elimination and avoids this behavior.

#### 3.1.4 Uniqueness of the best arm

**optimal arm**as:

As an efficient algorithm will find the best arm before the end of the run, we use Assumption 1 to ensure its uniqueness at every time-step. First, we define some notations. A run of SER3 is a succession of round-robin. The set \([\tau ] = \{(t_1,|S_1|),\ldots ,(t_\tau , |S_\tau |)\}\) is a realization of SER3, and \(t_i\) is the time-step when the round-robin \(i{\text {th}}\) of size \(|S_i|\) starts (\(t_i = 1 + \sum _{j=1}^{i-1} |S_j|\)). As arms are only eliminated, \(|S_i| \ge |S_{i+1}|\). We denote \({\mathbb {T}}(\tau )\) the set containing all possible realizations of \(\tau \) round-robin steps. Now, we can introduce Assumption 1 that ensures the best arm is the same at any time-step.

### Assumption 1

*(Positive mean gap)*For any \(k \in [K]-\{k^*\}\) and any \([\tau ] \in {\mathbb {T}}(\tau )\) with \(\tau \ge \tau _{\min }\), we have:

*moderately contaminated rewards*, i.e., the adversary does not lower the averaged gap too much. Another analogy can be done with the

*adversarial with gap*setting [15], \(\tau _{\min }\) representing the time needed for the optimal arm to accumulate enough rewards and to distinguish itself from the suboptimal arms.

Figure 1a illustrates Assumption 1. In this example the mean of the optimal arm \(k^*\) is lower than the second one on time-steps \(t \in \{5,6,7\}\). Thus, even if the instantaneous gap is negative during these time-steps, the mean gap \({\varDelta }^*_k\left( [\tau ]\right) \) stays positive. The parameter \(\tau _{\min }\) protects the algorithm from local noise at the initialization of the algorithm. In order to ease the reading of the results in the next sections, we here assume \(\tau _{\min } = \log \frac{K}{\delta }\).

Assumption 1 can be seen as a sanity-check assumption ensuring that the best arm identification problem indeed makes sense. In Sect. 4, we consider the more general switching bandit problem. In this case, Assumption 1 may not be verified (see Fig. 1b) and is naturally extended by dividing the game in segments wherein Assumption 1 is satisfied (Figs. 2, 3 and 4).

### 3.2 Analysis

All theoretical results are provided for \(\epsilon = 0\) and therefore accept only \(k^*\) as the optimal arm.

### Theorem 1

The proof is given in “Proof of Theorem 1 and Theorem 2” of Appendix 2.

Guarantee on the sample complexity can be transposed in guarantee on the pseudocumulative regret. In that case, when only one arm remains in the set, the player continues to play this last arm until the end of the game.

### Corollary 1

The proof is given in “Proof of Corollary 1” of Appendix 2.

*t*and all \([\tau ]\):

*near adversarial*sequence of reward while achieving a gap-dependent logarithmic pseudocumulative regret.

### Remark

These logarithmic guarantees result from Assumption 1 that allows to stop the exploration of eliminated arms. They do not contradict the lower bound for non-stationary bandit whose scaling is in \({\varOmega }(\sqrt{T})\) [8] as it is due to the cost of the constant exploration for the case where the best arm changes.

### 3.3 Non-stationary stochastic multi-armed bandit with budget

We study the case when the sequence from which the rewards are drawn does not satisfy Assumption 1.

The sequence of mean rewards is build by the adversary in two steps. First, the adversary choose the mean rewards \(\mu _k(1),\ldots ,\mu _k(T)\) associated with each arm in such a way that Assumption 1 is satisfied. The adversary can then apply a malus \(b_k(t) \in [0, \mu _k(t)]\) to each mean reward to obtain the final sequence. The mean reward of the arm *k* at time *t* is \(\mu _k(t) - b_k(t)\). The budget spent by the adversary for the arm *k* is \(B_k = \sum _{t=1}^{T} b_k(t)\). We denote \(B \ge \arg \max _k B_k\) the upper bound on the budget of the adversary.

*B*is known. To achieve that, we replace the condition of elimination (Inequality ((3)) in Algorithm 1) is replaced by the following:

### Theorem 2

The proof is given in “Proof of Theorem 1 and Theorem 2” of Appendix 2.

## 4 Non-stationary stochastic multi-armed bandit with best arm switches

The switching bandit problem has been proposed by Garivier et al. [8] and assumes means to be stationary between switches. In particular, algorithm SW–UCB is built on this assumption and is a modification of UCB using only the rewards obtained inside a sliding window. In our setting, we allow mean rewards to change at every time-steps and consider that a best arm switch occurs when the arm with the highest mean change. This setting provides an alternative to the adversarial bandit with budget, when *B* is very high or unknown.

**optimal policy**is the sequence of couples (optimal arm, time when the switch occurred):

### 4.1 Successive Elimination with Randomized Round-Robin and Resets (SER4)

Definition 1 of the sample complexity is not adapted to the switching bandit problem. Indeed, this definition is used to measure the number of observations needed by an algorithm to find one unique best arm. When the best arm changes during the game, this definition is too limiting. In Sect. 4.1.1 we introduce a generalization of the sample complexity for the case of switching policies.

#### 4.1.1 The sample complexity of the best arm identification problem with switches

A cost associated is added to the usual sample complexity. This cost is equal to the number of iterations after a switch during which the player does not know the optimal arm and does not sample.

### Definition 3

*(Sample complexity with switches)*Let A be an algorithm. The sample complexity of A performing a best arms identification task for a segmentation \(\{T_n\}_{n=1..N}\) of [1 :

*T*], with \(T_1=1< T_2< \dots< T_N<T\), is:

*s*(

*t*) is a binary variable equal to 1 if and only if the time-step

*t*is used by the sampling process of A, \(k_t\) is the arm identified as optimal by A at time t, \(k^*_n\) is the optimal arm over the segment

*n*and \(T_{N+1}=T+1\).

\(s(t)=1\) if the algorithm is sampling an arm during the time-step

*t*. In the case of SER4, \(s(t)=1\) when \(|S_\tau | \ne 1\) and the sample complexity increases by one.\(s(t)=0\) if the algorithm submits an arm as the optimal one during the time-step

*t*. In the case of SER4, \(s(t)=0\) when \(|S_\tau | = 1\). The sample complexity increases by one if \(k_t\ne k^*(t)\).

#### 4.1.2 Algorithm

#### 4.1.3 Analysis.

We now provide the performance guarantees of SER4 algorithm, both in terms of sample complexity and of pseudocumulative regret.

The following results are given in expectation and in high probability. The expectations are taken with regard to the randomization of the resets. The sample complexity or the pseudocumulative regret achieved by the algorithm between each resets (given by the analysis of SER3) still results in high probability.

### Theorem 3

The proof is given in “Proof of Theorem 3” of Appendix 2.

We tune \(\varphi \) in order to minimize the sample complexity.

### Corollary 2

### Remark 2

Transposing Theorem 3 for the case where \(\epsilon \in [\frac{1}{KT},1]\) is straightforward. This allows to tune the bound by setting \(\varphi = \epsilon \sqrt{( N \delta )/(K \log (TK))}\).

This result can also be transposed in bound on the expected cumulative regret. We consider that the algorithm continues to play the last arm of the set until a reset occurs.

### Corollary 3

The proof is given in “Proof of Corollary 3” of Appendix 2.

### Remark 3

A similar dependency in \(\sqrt{T}{\varDelta }^{-1}\) appears also in SW–UCB (see Theorem 1 in [8]) and is standard in this type of results.

### 4.2 EXP3 with resets

#### 4.2.1 EXP3 algorithm

EXP3 algorithm (see Algorithm 3) minimizes the regret against the best arm using an unbiased estimation of the cumulative reward at time *t* for computing the choice probabilities of each action. While this policy can be viewed as optimal in an actual adversarial setting, in many practical cases the non-stationarity within a time period exists but is weak and is only noticeable between different periods. If an arm performs well in a long time period but is extremely bad on the next period, EXP3 algorithm can need a number of trial equal to the first period’s length to switch its most played arm.

#### 4.2.2 The detection test

*H*and \(\delta \) define the minimal number of \(\gamma \)-observations by arm needed to call a test of accuracy \(\epsilon \) with a probability \(1-\delta \). They will be fixed in the analysis (see Corollary 4), and the correctness of the test is proven in Lemma 2. We denote \({\bar{\mu }}^k(I)\) the empirical mean of the rewards acquired from the arm

*k*on the interval

*I*using only \(\gamma \)-observations and \({\varGamma }_{\text {min}}\)(I) the smallest number of \(\gamma \)-observations among each action on the interval

*I*. The detector is called only when \({\varGamma }_{\text {min}}(I) \ge \frac{\gamma H}{K}\). The detector raises an alert when the action \(k_{\max }\) with the highest empirical mean \({\hat{\mu }}^k(I-1)\) on the interval \(I-1\) is eliminated by an other on the current interval.

#### 4.2.3 EXP3.R algorithm

#### 4.2.4 Analysis

In this section we analyze the drift detector, and then, we bound the expected regret of EXP3.R algorithm.

### Assumption 2

*(Accuracy of the drift detector)*During each of the segments

*S*where \(k^*_S\) is the optimal arm, the gap between \(k^*_S\) and any other arm is of at least \(4\epsilon \) with

Lemma 2 guarantees that when Assumption 1 holds and the interval *I* is included into the interval *S*, then, with high probability, the test will raise a detection if and only if the optimal action \(k^*_S\) eliminates a suboptimal action.

### Lemma 2

The proof is given in “Proof of Lemma 2” of Appendix 2.

Theorem 4 bounds the expected cumulative regret of EXP3.R.

### Theorem 4

The proof is given in “Proof of Theorem 4” of Appendix 2.

In Corollary 4 we optimize parameters of the bound obtained in Theorem 4.

### Corollary 4

The proof is given in “Proof of Corollary 4” of Appendix 2.

*C*, the precision \(\epsilon \) is:

*T*increases, \(\sqrt{\frac{\log {\sqrt{\frac{KT}{\log T}}}}{\log T} \sqrt{\frac{K}{\log K}}}\) tends toward a constant.

## 5 Numerical experiments

We compare our algorithm with the state of the art. For each problem, \(K=20\) and \(T=10^7\). The instantaneous gap between the optimal arm and the others is constant, \({\varDelta }= 0.05\), i.e., the mean of the optimal arm is \(\mu ^*(t) = \mu (t)+ {\varDelta }\). During all experiments, probabilities of failure of Successive Elimination (SE), SER3 and SER4 are set to \(\delta = 0.05\). Constant explorations of all algorithms of the EXP3 family are set to \(\gamma = 0.05\). Results are averaged over 50 runs. On problems 1 and 2 (Figs. 2 and 3), variances are low (in the order of \(10^3\)) and thus not showed. On problem 3 (Fig. 4), variances are plotted as the gray areas under the curves.

### 5.1 Problem 1: sinusoidal means

The index of the optimal arm \(k^*\) is drawn before the game and does not change. The mean of all suboptimal arm is \(\mu (t) = \cos (2\pi t / K)/5 + 0.5\).

This problem challenges SER3 against SE, UCB and EXP3. SER3 achieves a low cumulative regret, successfully eliminating suboptimal arms at the beginning of the run. Contrarily, SE is tricked by the periodicity of the sinusoidal means and eliminates the optimal arm. The deterministic policy of UCB is not adapted to the non-stationarity of rewards, and thus, the algorithm suffers from a high regret. The unbiased estimators of EXP3 enable the algorithm to quickly converge on the best arm. However, EXP3 suffers from a linear regret due to its constant exploration until the end of the game.

### 5.2 Problem 2: decreasing means with positive gap

The optimal arm \(k^*\) does not change during the game. The mean of all suboptimal arms is \(\mu (t) = 0.95 - \min (0.45 , 10^{-7} t )\).

On this problem, SER3 is challenged against SE, UCB and EXP3. SER3 achieves a low cumulative regret, successfully eliminating suboptimal arms at the beginning of the run. Contrarily to problem 1, mean rewards evolve slowly and Successive Elimination (SE) achieves the same level of performance than SER3. Similarly to problem 1, UCB achieves a high cumulative regret. The cumulative regret of EXP3 is low at the end of the game but would still increase linearly with time.

### 5.3 Problem 3: decreasing means with arm switches

At every turn, the optimal arm \(k^*(t)\) changes with a probability of \(10^{-6}\). In expectation, there are 10 switches by run. The mean of all suboptimal arms is \(\mu (t) = 0.95 - \min (0.45 , 10^{-7} (t [\text {mod } 10^6] )\).

On problem 3, SER4 is challenged against SW–UCB, EXP3.S, EXP3.R and Meta-Eve. The probability of reset of SER4 is \(\varphi = 5^{-5}\). The size of the window of SW–UCB is \(10^5\). The historic considered by EXP3.R is \(H=4\,\times \,10^5\), and the regularization parameter of EXP3.S is \(\alpha = 10^{-5}\).

SER4 obtains the lowest cumulative regret, confirming the random resets approach to overcome switches of best arm. SW–UCB suffers from the same issues as UCB in previous problems and obtains a very high regret. Constant changes of mean cause Meta-Eve to reset very frequently and to obtain a lower regret than SWUCB. EXP3.S and EXP3.R achieve both low regrets but EXP3.R suffers from the large size of historic needed to detect switches with a gap of \({\varDelta }\). We can notice that the randomization of resets in SER4, while allowing to achieve the best performances on this problem, involves a highest variance. Indeed, on some runs, a reset may occur lately after a best arm switch, whereas the use of windows or regularization parameters will be more consistent.

## 6 Conclusion

We proposed a new formulation of the multi-armed bandit problem that generalize the stationary stochastic, piecewise-stationary and adversarial bandit problems. This formulation allows to manage difficult cases, where the means rewards and/or the best arm may change at each turn of the game. We studied the benefit of *random shuffling* in the design of sequential elimination bandit algorithms. We showed that the use of *random shuffling* extends their range of application to a new class of best arm identification problems involving non-stationary distributions, while achieving the same level of guarantees than SE with stationary distributions. We introduced SER3 and extended it to the switching bandit problem with SER4 by adding a probability of restarting the best arm identification task. We extended the definition of the sample complexity to include switching policies. Up to our knowledge, we proved the first sample complexity-based upper bound for the best arm identification problem with arm switches. The upper bound over the cumulative regret of SER4 depends only of the number \(N-1\) of arm switches, as opposed to the number of distribution changes \(M-1\) in SW–UCB (\(M\ge N\) can be of order *T* in our setting). Algorithm EXP3.R also achieves a competitive regret bound. The adversarial nature of EXP3 makes it robust to non-stationarity, and the detection test accelerates the switch when the optimal arm changes while allowing convergence of the bandit algorithm during periods where the best arm does not change.

## Notes

### Acknowledgements

This work was supported by Team TAO (CNRS - Inria Saclay, Île de France - LRI), Team Profiling and Data-mining (Orange Labs).

### References

- 1.Allesiardo, R., Féraud, R.: EXP3 with drift detection for the switching bandit problem. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA 2015)Google Scholar
- 2.Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn.
**47**(2–3), 235–256 (2002a)CrossRefMATHGoogle Scholar - 3.Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput.
**32**(1), 48–77 (2002b)MathSciNetCrossRefMATHGoogle Scholar - 4.Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. In: Foundations and Trends in Machine Learning. (2012) http://dblp.uni-trier.de/rec/bibtex/journals/ftml/BubeckC12
- 5.Bubeck, S., Slivkins, A.: The best of both worlds: stochastic and adversarial bandits. In: COLT 2012—the 25th Annual Conference on Learning Theory, pp. 42.1–42.23. Edinburgh, Scotland, 25–27 June 2012Google Scholar
- 6.Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res.
**7**, 1079–1105 (2006)Google Scholar - 7.Féraud, R., Allesiardo, R., Urvoy, T., Clérot, F.: Random forest for the contextual bandit problem. In: AISTATS. (2016)Google Scholar
- 8.Garivier, A., Moulines, E.: On upper-confidence bound policies for non-stationary bandit problems. In: Algorithmic Learning Theory, pp. 174–188. (2011) http://dblp.uni-trier.de/rec/bibtex/conf/alt/GarivierM11
- 9.Garivier, A., Kaufmann, E., Lattimore, T.: On explore-then-commit strategies. In: NIPS, vol. 30. (2016) http://dblp.uni-trier.de/rec/bibtex/conf/nips/GarivierLK16
- 10.Hartland, C., Baskiotis, N., Gelly, S., Teytaud, O., Sebag, M.: Multi-armed bandit, dynamic environments and meta-bandits. In: Online Trading of Exploration and Exploitation Workshop, NIPS. (2006)Google Scholar
- 11.Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc.
**58**(301), 13–30 (1963)MathSciNetCrossRefMATHGoogle Scholar - 12.Kaufmann, E., Cappé, O., Garivier, A.: On the complexity of best arm identification in multi-armed bandit models. J. Mach. Learn. Res.
**17**(1), 1–42 (2016)MathSciNetMATHGoogle Scholar - 13.Kocsis, L., Szepesvári, C.: Discounted UCB. In: 2nd PASCAL Challenges Workshop. (2006)Google Scholar
- 14.Neu, G.: Explore no more: improved high-probability regret bounds for non-stochastic bandits. In: NIPS. pp. 3168–3176 (2015)Google Scholar
- 15.Seldin, Y., Slivkins, A.: One practical algorithm for both stochastic and adversarial bandits. In: 31th International Conference on Machine Learning (ICML). (2014)Google Scholar
- 16.Serfling, R.J.: Probability inequalities for the sum in sampling without replacement. Ann. Stat.
**2**(1), 39–48 (1974)MathSciNetCrossRefMATHGoogle Scholar - 17.Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika
**25**, 285–294 (1933)CrossRefMATHGoogle Scholar - 18.Yu, J.Y., Mannor, S.: Piecewise-stationary bandit problems with side observations. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML. (2009)Google Scholar