Abstract
We present algorithms for solving multiarmed and linearcontextual bandit tasks in the face of adversarial corruptions in the arm responses. Traditional algorithms for solving these problems assume that nothing but mild, e.g., i.i.d. subGaussian, noise disrupts an otherwise clean estimate of the utility of the arm. This assumption and the resulting approaches can fail catastrophically if there is an observant adversary that corrupts even a small fraction of the responses generated when arms are pulled. To rectify this, we propose algorithms that use recent advances in robust statistical estimation to perform arm selection in polynomial time. Our algorithms are easy to implement and vastly outperform several existing UCB and EXPstyle algorithms for stochastic and adversarial multiarmed and linearcontextual bandit problems in wide variety of experimental settings. Our algorithms enjoy minimaxoptimal regret bounds, as well as can tolerate an adversary that is allowed to corrupt upto a universally constant fraction of the arms pulled by the algorithm.
Introduction
The recent years have witnessed a surge in the applications of online learning, especially those of exploreexploit techniques such as multiarmed bandits and linearcontextual bandits, to online recommendation (Li et al. 2010), online advertising (Chakrabarti et al. 2008), web analytics (Tang et al. 2013), crowdsourcing (Padmanabhan et al. 2016), and even mobile health (Tewari and Murphy 2017). The result has been a diverse and rich literature, accompanied by a deep understanding of how these algorithms work on largescale data. However, the point of application of these techniques to realworld data throws up several unforeseen challenges, such as those of scale and data quality. In particular, when working with consumer/user data, it is inadvisable to assume clean theoretical models for data to hold ground beyond a point. Some concrete examples are outlined below.
Click fraud via malware: malware present on user systems can be used to effectively sabotage an advertisement campaign run by a competitor by suppressing clicks on the ads pertaining to that campaign, causing a typical online advertising platform to reject those ads from consideration.
Fake reviews and ratings via automated bots: automated bots can alternatively be used to artificially boost products by posting fake reviews or simulating clicks on a compromised website, which can cause recommendation platforms to get tricked into giving those products more visibility.
Transient sociopolitical effects: for companies that employ celebrity brand ambassadors, actions taken by those ambassadors in their personal lives can often adversely affect brand popularity (Times, 2015) and cause a large number of users to post negative reviews or downgrade their ratings in a short period. This can adversely affect the functioning of recommendation systems, as well as the experience of users unconcerned with the event, in the short term.
Outlier behavior: not all data corruption need be malicious or even intended, but may nevertheless adversely affect the functioning of the decision making systems operating on that data. For example, in mobile health applications, temporary issues with the mobile device or mobile connectivity may cause the algorithm to conclude that a patient has become unresponsive and then target that patient more aggressively, which may adversely affect patient cooperation.
Multiarmed and linearcontextual bandits algorithms are two of the most popularly used techniques in recommendation and advertising settings. If executed in the above settings with data corruption, these bandit algorithms will encounter corrupted arm rewards/responses and their performance may degrade.
Now, note that in all the settings mentioned above, the corruptions/aberrations to the data are sparse, and sometimes even transient. For example, it is reasonable to assume that only a fraction of clicks can be suppressed by malware or be synthesized by bots. Even in the mobile health and brandambassador examples, the effects of data corruption are transient, hence sparse when viewed as a fraction of longterm data. Thus, a direct solution to the problems mentioned above would be to make these bandit algorithms robust to sparse corruptions in arm responses.
The recent years have indeed seen a resurgence of interest in developing algorithms that are resilient to data corruption. We will review these shortly. These contemporary lines of work trace their origin at least half a century back to the area of robust statistics (Huber 1964; Tukey 1960; Maronna et al. 2006). However, recent works have focused more on developing robust algorithms that are scalable and efficient, whereas classical works usually paid scant attention to scalability.
In our work, we develop online learning algorithms for two settings, namely multiarmed and linearcontextual bandit problems, that are tolerant to sparse corruptions in the arm responses that they receive. Our algorithms enjoy minimaxoptimal regret bounds in the face of fully adaptive adversaries, as well as vastly outperform several existing approaches to both stochastic, as well as adversarial multiarmed and linearcontextual bandit problems, in experiments. We believe our results come at an opportune moment, at a time when scalable robust algorithms are being actively investigated, as are online algorithms.
Organization
We address two bandit settings and present a total of three new algorithms. In Sect. 2, we give a brief overview of bandit literature, and discuss related work from three areas: adversarial bandits, robust algorithms, and heavytailed bandits. In Sect. 3, we introduce the notation we use in the rest of the paper.
In Sect. 4, we discuss the multiarmed bandits (MAB) setting that is popular when the set of actions is small and fixed, e.g., in web analytics and mobile health. We introduce two algorithms rUCBMAB and rUCBTune for this setting.
In Sect. 5, we discuss linear contextual bandits, a more general setting which allows arms to be parametrized, as well as the set of available arms to change from time step to time step. This is most applicable in online advertising and recommendation settings where the set of available ads/products may change across time. We introduce rUCBLin for this setting.
In Sect. 6, we perform extensive experimentation, comparing our proposed algorithms against stochastic bandit algorithms such as UCB, KLUCB, UCBV and many others, adversarial bandit algorithms such as EXP3 and SAO, and algorithms for heavytailed bandits from Medina and Yang (2016). We conclude with an overview of interesting directions for future work in Sect. 7.
Related works and our contributions
Literature on bandits is too vast to be surveyed here. Starting with the early work of Auer et al. (2002a) on multiarmed bandits (MAB), the field has seen progress in linear bandits (AbbasiYadkori et al. 2011), contextual bandits (Chu et al. 2011), as well as applications to recommendation (Li et al. 2010), advertising (Chakrabarti et al. 2008), web analytics (Tang et al. 2013), crowdsourcing (Padmanabhan et al. 2016), and mobile health (Tewari and Murphy 2017).
The three lines of work that relate most closely to ours are (1) those on adversarial bandits where arm rewards/responses need not be stochastic at all, (2) those on developing corruptionresilient learning and estimation algorithms, and (3) those on bandits that suffer heavytailed albeit still stochastic and nonadversarial noise (since these algorithms are also sometimes referred to as “robust”). We review all three lines of work below and clarify our contributions in context.
Adversarial bandits
Given the presence of an adversary in our setting, it is tempting to utilize algorithms designed to work with nonstochastic arm reward assignments. There does exist a large body of work on EXPstyle algorithms starting with Auer et al. (2002b), namely EXP3 for multiarmed bandits and EXP4 for linear contextual bandits, as well as variants such as EXP3++ (Bubeck and Slivkins 2012) and SAO (Seldin and Slivkins 2014), that are indeed able to offer sublinear regret even if all (not just a fraction of) arm responses are chosen by an adversary.
This in itself is too pessimistic a view given that we have observed in Sect. 1 that in reallife settings, it is reasonable to expect only a fraction of the arm responses to be corrupted. Moreover, their attractive regret bounds notwithstanding, there is a price to pay for using EXPstyle algorithms. Indeed, most recent works on adversarial bandits (Bubeck and Slivkins 2012; Lykouris et al. 2018; Seldin and Slivkins 2014) focus only on multiarmed bandits and not linearcontextual bandits. This is possibly because EXPstyle algorithms (such as EXP4) rapidly become infeasible to execute in practice for linearcontextual bandits.
However, we propose rUCBLin, a practical and efficient algorithm for linearcontextual bandits that can tolerate adversarial corruptions. Moreover, we also experimentally compare to EXP3 and SAO in the MAB setting and show that our proposed algorithms rUCBMAB and rUCBTune outperform it. We also note that from a theoretical standpoint, the regret bounds offered by EXPstyle algorithms do not compare directly to the pseudoregret style bounds prevalent for stochastic bandits that we provide for our algorithms.
The recent work of Lykouris et al. (2018) deserves special mention since it considers a problem setting similar to ours wherein the adversarial corruption is not rampant. Our work is independent and indeed, our algorithms and analyses differ significantly from those of Lykouris et al. Their work considers only multiarmed bandits whereas we consider multiarmed bandits as well as the more challenging case of linearcontextual bandits. Indeed, arm elimination, the strategy adopted by Lykouris et al., cannot be reliably practiced in contextual settings where the set of available “arms” may change arbitrarily from time step to time step. Moreover, in experiments, we find that rUCBMAB and rUCBTune beat strategies such as SAO that also use a form of armelimination.
From a theoretical standpoint, Lykouris et al. do not explicitly model the fraction of arm responses that are corrupted but instead consider the total amount of corruption introduced by the adversary during the entire online process, say \({\mathcal {C}}_\text {tot}\). Their regret bounds are of the form \({\mathcal {C}}_\text {tot}\cdot K\cdot \log ^2(KT)\cdot \sum _{i \ne i^*}\frac{1}{\varDelta _i}\) where K is the number of arms, \(i^*\) is the optimal arm, \(\varDelta _i\) is the suboptimality in arm i and T is the time horizon. Since we can have \({\mathcal {C}}_\text {tot} = \varOmega \left( {{T}}\right) \) if a constant fraction of responses are corrupted, it is not desirable that the regret bound have \({\mathcal {C}}_\text {tot}\) and the number of arms K in a multiplicative union.
In contrast, we explicitly model the fraction \(\eta \) of arm responses that are corrupted and offer regret bounds of the form (see Theorem 2) \(\bar{R}_T({\textsc {rUCBMAB}}) \le \sum _{i \ne i^*}\frac{\log T}{\varDelta _i} + \eta \cdot B\cdot T\) where B is an upper bound on the corruption magnitudes. Note that the term \(\eta \cdot B\cdot T\) plays the same role as \({\mathcal {C}}_\text {tot}\) does for Lykouris et al. Also notice that in our bound, this term is completely independent of the number of arms and that our bound is additive in this term, not multiplicative.
The best of both worlds?
Given the wide gap between settings with stochastic arm responses and those with adversarial responses, there has been interest in developing algorithms that can seamlessly address both: offer a superior \(\log T\) regret bound if all arm responses are stochastic and regress to a more conservative \(\sqrt{T}\) bound if arm responses are adversarial. Existing works achieve this either by starting out optimistically assuming a stochastic setting and then switching to EXPstyle policies upon detecting signs of adversarial behavior, e.g., SAO (Bubeck and Slivkins 2012), or else carefully tuning EXPstyle policies so as to offer \(\log T\) regret if arm responses are completely stochastic, e.g., EXP3++ (Seldin and Slivkins 2014).
Experimentally, we compare to both SAO and EXP3 and find that rUCBMAB and rUCBTune outperform both. From a theoretical standpoint, we too can provide “bestofbothworlds” style guarantees for rUCBMAB and rUCBLin (see Theorems 2, 7). This is because our bounds for both, multiarmed as well as linear contextual bandits, gracefully upgrade to minimaxoptimal bounds for stochastic bandits if the corruption rate \(\eta \) goes to zero. \(\eta = 0\) is the case when there is no malicious adversary and all rewards are truly stochastic. Thus, we are indeed able to recover the “best of the stochastic world”.
Moreover, we offer minimaxoptimal regret bounds even if a bounded fraction of the arm responses are corrupted, thus offering the “best of the adversarial world” too. Our bounds cannot handle a totally rampant adversary that, for example, corrupts all the rewards, i.e., when \(\eta \rightarrow 1\). This is because our algorithms are robust versions of UCB whereas “bestofbothworlds” style results typically choose EXP3 as the base algorithm but this choice has drawbacks as discussed earlier.
Robust learning and estimation algorithms
Robust algorithms have recently attracted a lot of attention in several areas of machine learning, signal processing, and algorithm design. Some prominent applications for which robust algorithms have been investigated are statistical estimation (Diakonikolas et al. 2018; Lai et al. 2016), optimization (Charikar et al. 2017), principal component analysis (Candès et al. 2009), regression (Bhatia et al. 2015; Chen et al. 2013; Nguyen and Tran 2013) and classification (Feng et al. 2014).
Our algorithms make novel use of recent advances in robust estimation techniques viz moment estimation (Lai et al. 2016) and linear regression (Bhatia et al. 2015). However, these adaptations are not immediate or trivial, especially for linear bandit settings where the proof progression of OFULstyle analyses has to be adapted in a novel way to accommodate the complex estimation steps carried out by robust linear regression algorithms.
Heavytailed bandits
There has been recent interest in developing bandit algorithms where the arm responses are samples from heavytailed distributions such as the works of Bubeck et al. (2013), Medina and Yang (2016), Padmanabhan et al. (2016). A point of confusion may arise here since these algorithms are also sometimes referred to as “robust” algorithms. However, crucial differences exist in our problem setting that makes these results inapplicable directly.
We note that in heavytailed settings, arm responses are still generated from a static distribution. However, in our problem setting, there will be an adaptive adversary which need not follow any predeclared distribution heavytailed or otherwise, when introducing corruptions. For example, our experiments consider an adversary that flips the sign of the response of an arm to make that arm seem unnaturally good or bad. Heavytailed distributions cannot model such a sentient and malicious adversary and as such, existing analyses do not apply.
Thus, works on heavytailed bandits do not apply in our setting. We nevertheless experimentally compare to these algorithms and show that our proposed algorithm rUCBLin outperforms them. Moreover, our algorithms tolerate as much as a constant fraction of corrupted responses, e.g., \(\eta \cdot n\) out of a total of n responses for some constant \(\eta > 0\), whereas in heavytailed analyses, due to assumptions made on the arm distributions, often only a logarithmic number of the total responses, e.g., \(\log n\), come from “the tail”, a fact often exploited by these analyses.
Another work of interest is that of Gajane et al. (2018), which considers privacypreserving bandit algorithms. To achieve privacypreservation, the algorithm transforms the arm responses using a known and invertible stochastic corruption process. However, there is no external malicious adversary in this process and the reward transformations are indeed known to the algorithm.
Notation
We will denote vectors using boldface lower case Latin or Greek letters, e.g., \({\mathbf {x}},{\mathbf {y}},{\mathbf {z}}\) and \(\varvec{\alpha },\varvec{\beta },\varvec{\gamma }\). The ith component of a vector \({\mathbf {x}}\) will be denoted as \({\mathbf {x}}_i\). Upper case Latin letters will be used to denote random variables and matrices, e.g., A, X, I.
[n] will denote the set of natural numbers \(\left\{ {1,2,\ldots ,n}\right\} \). We will use the shorthand \(\left\{ {v_i}\right\} _S\) to denote the set \(\left\{ {v_i: i \in S}\right\} \). In particular \(\left\{ {v_i}\right\} _{[n]}\) will denote the set \(\left\{ {v_1,\ldots ,v_n}\right\} \). \({\mathbb {I}}\left\{ {{\cdot }}\right\} \) will denote the indicator operator signaling the occurrence of an event, i.e., \({\mathbb {I}}\left\{ {{E}}\right\} = 1\) if event E takes place and \({\mathbb {I}}\left\{ {{E}}\right\} = 0\) otherwise. The expectation of a random variable X will be denoted by \({\mathbb {E}}\left[ {{X}}\right] \).
Given a matrix \(X \in {\mathbb {R}}^{d \times n}\) and any set \(S \subset [n]\), we let \(X_S := \left[ {{\mathbf {x}}_i}\right] _{i \in S} \in {\mathbb {R}}^{d \times \left {S} \right }\) denote the matrix whose columns correspond to entries in the set S. Also, for any vector \({\mathbf {v}}\in {\mathbb {R}}^n\) we use the notation \({\mathbf {v}}_S\) to denote the \(\left {S} \right \)dimensional vector consisting of those components that are in S. We use the notation \(\lambda _{\min }(M)\) and \(\lambda _{\max }(M)\) to denote, respectively, the smallest and largest eigenvalues of a square symmetric matrix M.
Robust multiarmed bandits
In this section, we will discuss the classical multiarmed bandit, introduce various adversary models and present the rUCBMAB and rUCBTune algorithms.
Problem setting
The Karmed bandit problem is characterized by an ensemble of K distributions \(\varvec{\nu }= \left\{ {\nu _1,\ldots ,\nu _K}\right\} \) over reals, one corresponding to each arm, with corresponding means \(\varvec{\mu }= \left\{ {\mu _1,\ldots ,\mu _K}\right\} \in {\mathbb {R}}^K\). At each time step, the player selects and pulls an arm \(I_t \in [K]\) guided by some armselection strategy \(\pi \). In response, a reward\(r_t \in {\mathbb {R}}\) is generated (see below for details). Let \({\mathcal {H}}^t = \left\{ {I_1,r_1,\ldots ,I_{t1},r_{t1},I_t}\right\} \) denote the past history of the plays, \(i^*\in \arg \max _{i \in [K]} \mu _i\) denote an arm with the highest expected reward, \(\mu ^*= \mu _{i^*}\) denote the highest expected reward, \(\varDelta _i = \mu ^* \mu _i\) denote the suboptimality of arm i, and \(\varDelta _{\min } := \min _{\varDelta _i > 0}\varDelta _i\) denote the suboptimality of the closest competitor to the best arm(s).
Adversary model
In the stochastic setting, after the player pulls the arm \(I_t\) at time t, the reward is generated (conditioned on \({\mathcal {H}}^t\)) from the distribution \(\nu _{I_t}\) so that \({\mathbb {E}}\left[ {{r_t\,\,{\mathcal {H}}^t}}\right] = \mu _{I_t}\). Thus, in this “clean” setting, the reward obtained for an arm is always an unbiased estimate of its mean reward. Previous works such as those of Bubeck et al. (2013), Medina and Yang (2016) have studied settings where the distributions \(\nu _i\) are heavytailed. However, we are more interested in cases where occasionally, the reward that is generated for the played arm is not the one received by the player at all, for applications to click fraud and other settings.
Several adversary models are prevalent in literature. To present the essential aspects of our methods, we choose a simple stochastic adversary model for the first discussion. We will consider a much more powerful fully adaptive adversary in the next section on linearcontextual bandits. We note that although algorithms for heavytailed bandits can handle stochastic adversaries, we will be able to handle polynomially many corruptions and, as we point out later, we can modify our algorithms to handle adaptive adversaries in this setting itself as well.
Let \(\eta \) denote the corruption rate. A stochastic adversary closely follows the progress of the arm pulls and reward generation. At each time step t, after the algorithm has decided to pull an arm \(I_t\), the adversary first decides whether to corrupt this arm pull or not by performing a Bernoulli trial \(z_t \in \left\{ {0,1}\right\} \) with bias \(\eta \), i.e., if \({\mathcal {H}}^t = \left\{ {I_1,z_1,r_1,\ldots ,I_{t1},z_{t1},r_{t1},I_t}\right\} \), then \({\mathbb {E}}\left[ {{z_t\,\,{\mathcal {H}}^t}}\right] = \eta \). Then it generates a corruption \(\zeta _t\) arbitrarily but independent of \({\mathcal {H}}^t\). After this, the “clean reward” is generated in the classical manner satisfying \({\mathbb {E}}\left[ {{r^*_t\,\,{\mathcal {H}}^t}}\right] = \mu _{I_t}\) and the reward received by the player is calculated as follows
Let B denote the largest magnitude of any corruption, i.e., \(\left {\zeta _t} \right \le B\). This bound B need not be known to the learner. Note that we allow the adversary to generate the corruption completely arbitrarily and that too after it is known which arm will be pulled. This allows the adversary to give different corruptions if it knows that the best arm is being played, i.e., \(I_t = i^*\) than if a nonbest arm is being played. We will later study more powerful adversarial models where the adversary can choose to corrupt the arm pull and even decide the corruption after the clean reward \(r^*_t\) has been generated and even in a manner dependent on the complete history \({\mathcal {H}}^t\).
Notions of regret
In classical bandit learning, the goal of the algorithm is to minimize regret or alternatively, maximize the cumulative reward \(\sum _{t=1}^Tr_t\) accumulated over the entire play of T rounds. However, in our corrupted setting, this, may not be the most appropriate. To address this, we consider two notions of regret.
The first notion, which we simply refer to as Regret in this paper, captures how the expected cumulative reward actually received by algorithm compares to the expected cumulative reward that it could have gotten had it only played the best arm again and again and had there been no adversary to corrupt those fictional arm pulls. We define this notion for an algorithm over a sequence of T plays as
However, one may complain that this notion of regret is unfair since it pits uncorrupted rewards of the best arm against the corrupted rewards of the arms that are played. To address this concern, we also look at the notion of Uncorrupted Regret, defined below, which is a more fair comparison since it compares expected uncorrupted rewards of the arms played with those of the best arm:
We note that this notion exactly corresponds to the popular notion of pseudoregret which looks at the expected performance of a single best arm in hindsight.
A minimax regret lower bound
The presence of an adversary (even a stochastic one) can make life difficult for a player. Indeed, consider a setting where \(\mu ^*>0\) and we have an adversary that, whenever allowed to, corrupts the reward to a default value of \(r_t = 0\). For this simple setting, even for the optimal policy that always plays \(I_t \equiv i^*\), the expected regret is still \(\bar{R}_T = \eta \mu ^*\cdot T\). The following result demonstrates this crisply by establishing a minimax regret lower bound for the stochastic adversary model.
Theorem 1
Let \(K > 1\) and \(T \ge K1\). Then for any policy \(\pi \), and any constant \(c \in (0,1)\), there exists an MAB instance characterized by K distributions \(\varvec{\nu }= \left\{ {\nu _1,\ldots ,\nu _K}\right\} \) all of which are Gaussian with unit variance and means that lie in the the interval [0, 1], i.e., \(\nu _i = {\mathcal {N}}(\mu _i,1)\) where \(\mu _i \in [0,1]\), and a stochastic adversary with corruption rate \(\eta \) such that
rUCBMAB: a minimaxoptimal robust algorithm for MAB
For any arm \(i \in [K]\) let \(I_i(t) := \left\{ {\tau < t: I_\tau = i}\right\} \) denote the set of past time steps when arm i was pulled, let \(T_i(t) := \left {I_i(t)} \right \) denote the number of times the arm was pulled in the past, let \(R_i(t) := \left\{ {r_\tau : \tau \in I_i(t)}\right\} \) denote the (possibly corrupted) rewards that were received by this arm so far, and let \(\tilde{\mu }_{i,t} := \text {median}(R_i(t))\) denote the median of these rewards.
The rUCBMAB algorithm described in Algorithm 3 builds upon the classic UCB algorithm by Auer et al. (2002a). At every step it computes an upper confidence estimate of the mean of every arm \(i \in [K]\) and pulls the arm with the highest estimate. However, it makes a two crucial changes to the classical estimate.
Whereas UCB uses the mean and a simple agnostic variance term to construct its upper confidence bound, rUCBMAB uses the median, and a varianceaware estimate (notice the use of a variance upper bound \(\sigma _0\) in the algorithm) to construct its upper confidence bound. This helps overcome the confounding effects of the adversarial rewards that may be present in the sets \(R_i(t)\). We show that rUCBMAB enjoys the following regret bound for Gaussian reward distributions.
Theorem 2
When executed on a collection of K arms with Gaussian reward distributions \(\nu _i \equiv {\mathcal {N}}(\mu _i,\sigma _i)\) with \(\sigma _i \le \sigma _0\) and a stochastic adversary with a corruption rate \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _0}\) and \(\left {\zeta _t} \right \le B\), the rUCBMAB algorithm ensures a gapdependent regret bound
as well as a gapagnostic regret bound
for constants \(C,C'\) clarified in the proof. Moreover, in the stochastic setting with no adversary, i.e., \(\eta = 0\), we recover the following regret bounds
We note that for \(\eta = 0\) we indeed recover minimaxoptimal regret bounds for stochastic bandits. Also note that if \(\eta = \varOmega (1)\), Theorem 1 rules out sublinear regret bounds for any algorithm and hence the linear regret offered by Theorem 2 is no surprise. However, it is also important to note that for small values of \(\eta \) such as \(\eta \approx \frac{1}{T^a}\) for \(a > 0\), which still allow as many as \(T^{1a}\) number of the samples to be corrupted, rUCBMAB actually gets sublinear regret \(T^{\max \left\{ {0.5,1a}\right\} }\).
However, below we establish a much stronger, sublinear uncorrupted regret guarantee for rUCBMAB. This shows that rUCBMAB is able to identify the best arm after sublinearly many pulls and incur vanishing regret thereafter.
Theorem 3
When executed on a collection of K arms with Gaussian reward distributions \(\nu _i \equiv {\mathcal {N}}(\mu _i,\sigma _i)\) with \(\sigma _i \le \sigma _0\) and a stochastic adversary with a corruption rate \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _0}\), the rUCBMAB algorithm ensures an uncorrupted regret bound
Improving the upper bound on\(\eta \) Theorem 2 requires the corruption rate to be bounded as \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _i}\) which may be very stringent if \(\varDelta _{\min } = \min _{\varDelta _i > 0} \varDelta _i\) is very small. Although the need to assume such bounds on the corruption rate is very common in robust learning and robust statistics literature (Bhatia et al. 2015; Diakonikolas et al. 2018) and represents the breakdown point of the algorithm, we can improve this upper bound on \(\eta \) to a problemindependent, universal constant.
To do so, a standard sieve is applied by separating arms that satisfy \(\varDelta _i > 4e\eta \sigma _i\) (for which Theorem 2 itself applies) and those that do not (for which \(\varDelta _i \le 4e\eta \sigma _0\)). The total regret due to the second set of arms cannot exceed \(4e\eta \sigma _0T\). Bounding the regret separately for these arms gives us the following regret bound which puts a much milder requirement on \(\eta \).
Corollary 1
If initialized with \(\sigma _0 = \max _i\sigma _i\) with the corruption rate satisfying \(\eta \le 1/4\), rUCBMAB incurs a regret,
We note that the constraint \(\eta < 1/4\) involves a universal constant and is required to satisfy the requirements for the results of Lai et al. (2016) to hold. Note that even this new regret bound becomes sublinear if \(\eta = o(1)\) such as \(\eta = 1/\sqrt{T}\). We note that all the above results can be extended to several useful nonGaussian, and indeed heavytailed distributions including those studied by Bubeck et al. (2013). This is because Lai et al. (2016, Theorem 1.3) show that the median estimator, with some modifications, is able to recover the mean faithfully for general distributions with bounded fourth moments.
rUCBTune: robust tuned MABs
The rUCBMAB algorithm assumes access to a uniform bound on the variances of the different arms. In their early work itself, Auer et al. (2002a) noticed that performing variance estimation can greatly boost the accuracies of the estimation procedure. This intuition was taken up by Audibert et al. (2007) who developed algorithms that automatically tune to the variance of the arms. We present one such “tuned” algorithm for the MAB settings with adversarial corruptions.
The robust estimates are not as straightforward in this case, as most variance estimates available in literature are relative estimates whereas the UCB framework works primarily with estimates which incur bounded additive error. To handle this, we propose a novel variance upper confidence bound algorithm rVUCB based on a robust variance estimation technique by Lai et al. (2016).
The rVUCB estimator turns out to be crucial for the regret bound to be established. For sake of simplicity, we present the regret bound for Gaussian reward distributions but remind the reader that these results readily extend to several interesting families of nonGaussian and heavy distributions with minor changes to the procedure. This is because the underlying result of Lai et al. (2016, Theorem 1.3) can be adapted to show that medianbased mean and variance estimation techniques do work for nonGaussian, heavytailed distributions too.
Theorem 4
When executed on a collection of K arms with Gaussian reward distributions \(\nu _i \equiv {\mathcal {N}}(\mu _i,\sigma _i)\) and a stochastic adversary with a corruption rate \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _i}\), the rUCBTune algorithm, when executed with a setting \(\eta _0 \ge \eta \), ensures a regret bound
for a constant C clarified in the proof.
Note that rUCBTune requires an estimate of an upper bound \(\eta _0\) the corruption rate in order to operate. This can be done in practice via an (online) grid search. In our experiments, we did not find rUCBTune to be sensitive to imprecise setting of \(\eta _0\). As before, we can introduce two improvements: show a truly sublinear uncorrupted regret bound for the rUCBTune algorithm, and remove the constraint on the corruption rate \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _i}\), here as well.
Theorem 5
When executed on a collection of K arms with Gaussian reward distributions \(\nu _i \equiv {\mathcal {N}}(\mu _i,\sigma _i)\) and a stochastic adversary with a corruption rate \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _i}\), the rUCBTune algorithm, when executed with a setting \(\eta _0 \ge \eta \), ensures an uncorrupted regret bound
Corollary 2
When executed on a collection of K arms with Gaussian reward distributions \(\nu _i \equiv {\mathcal {N}}(\mu _i,\sigma _i)\) and a stochastic adversary with a corruption rate \(\eta \le 1/4\), the rUCBTune algorithm, when executed with a setting \(\eta _0 \ge \eta \), ensures a regret bound
where \(\sigma _{\max } = \max _i \sigma _i\). Note that rUCBTune does not require knowledge of \(\sigma _{\max }\).
Before concluding, we note that rUCBMAB and rUCBTune can be made robust against stronger, adaptive adversaries, that can decide their corruptions based on the entire history of the play rather than independently of it, by replacing the simple medianbased estimators with more detailed, convex optimizationbased estimators of Diakonikolas et al. (2018, 2016). However, these algorithms, as well as their analyses are much more intricate, and we defer these to future work.
Robust linear contextual bandits
In this section, we discuss the linear contextual bandit problem under a much stronger adversary model and present the rUCBLin algorithm.
Problem setting
The stochastic linear contextual bandit framework (AbbasiYadkori et al. 2011; Li et al. 2010) extends to settings where every arm \({\mathbf {a}}\) is parametrized by a vector \({\mathbf {a}}\in {\mathbb {R}}^d\) (abusing notation). However, the set of all arms is potentially infinite, and moreover, not all arms may be available at every time step.
At each time step t, the player receives a set of \(n_t\) arms (called contexts) \(A_t = \left\{ {{\mathbf {x}}^{t,1},\ldots ,{\mathbf {x}}^{t,n_t}}\right\} \subset {\mathbb {R}}^d\). These are the only arms that can be pulled in this round. A good example from the advertising world is a limited number of items that are available for display at the moment the user arrives at the website. Items that are not available cannot be displayed to the user at that time instant. The set, as well as the number \(n_t\) of contexts available can vary from time step to time step. The player selects and pulls an arm \(\hat{\mathbf {x}}^t \in A_t\) as per its arm selection policy. In response, a reward \(r_t\) is generated. Let \({\mathcal {H}}^t = \left\{ {A_1,\hat{\mathbf {x}}^1,r_1,\ldots ,A_{t1},\hat{\mathbf {x}}^{t1},r_{t1},A_t,\hat{\mathbf {x}}^t}\right\} \).
Adversary model
In the stochastic linear bandit setting, the reward is generated using a model vector\({\mathbf {w}}^*\in {\mathbb {R}}^d\) (that is unknown to the algorithm) as follows: \(r_t = \left\langle {{\mathbf {w}}^*},{\hat{{\mathbf {x}}}^t}\right\rangle + \epsilon _t\), where \(\epsilon _t\) is a noise value that is typically assumed to be (conditionally) centered and \(\sigma \)subGaussian, i.e., \({\mathbb {E}}\left[ {{\epsilon _t\,\,{\mathcal {H}}^t}}\right] = 0\) (centering), as well as for some \(\sigma > 0\), for any \(\lambda > 0\), we have \({\mathbb {E}}\left[ {{\exp (\lambda \epsilon _t)\,\,{\mathcal {H}}^t}}\right] \le \exp (\lambda ^2\sigma ^2/2)\) (subGaussianity).
Here we consider an adaptive adversary that is able to view the ongoings of the online process and at any time instant t, after observing the history \({\mathcal {H}}^t\) and the “clean” reward value, i.e., \(\left\langle {{\mathbf {w}}^*},{\hat{{\mathbf {x}}}^t}\right\rangle + \epsilon _t\), able to add a corruption value \(b_t\) to the reward. For notational uniformity, we will assume that for time instants where the adversary chooses not to do anything, \(b_t = 0\). Thus, the final reward to the player at every time step is \(r_t = \left\langle {{\mathbf {w}}^*},{\hat{{\mathbf {x}}}^t}\right\rangle + \epsilon _t + b_t\). For sake of simplicity we will assume that, for some \(B > 0\), the final (possibly corrupted) reward presented to the player satisfies \(r_t \in [B,B]\) almost surely.
Note that this is a much more powerful adversary than the stochastic adversary we looked at earlier. This adversary is allowed to look at previous rewards and arm pulls, as well as the currently pulled arm and its clean reward before deciding if to corrupt and if so, by how much. There are no independence restrictions on this adversary. The only constraint we place is that at no point in the online process, should the adversary have corrupted more than an \(\eta \) fraction of the observed rewards. Formally, let \(G_t = \left\{ {\tau < t: b_\tau = 0}\right\} \) and \(B_t = \left\{ {\tau < t: b_\tau \ne 0}\right\} \) denote the “good” and “bad” time instances. We insist that \(\left {B_t} \right \le \eta \cdot t\) for all t.
Notion of regret
The goal of the algorithm is to maximize the cumulative reward it receives over the time steps \(\sum _{t=1}^Tr_t\). However, a more popular technique of casting this objective is in the form of cumulative pseudo regret. At time t, let \({\mathbf {x}}^{t,*}= \arg \max _{{\mathbf {x}}\in A_t}\left\langle {{\mathbf {w}}^*},{{\mathbf {x}}}\right\rangle \) be the arm among the available contexts that yields the highest expected (uncorrupted) reward. The cumulative pseudo regret of a policy \(\pi \) is defined as follows
Note that unlike the MAB case, the best arm here may change across timesteps. For sake of simplicity, we assume that \(\left\ {{\mathbf {w}}^*} \right\ _2 \le 1\), and \(\left\ {{\mathbf {x}}} \right\ _2 \le 1\) almost surely for all \({\mathbf {x}}\in A_t\) for all t. We postpone introducing and analysing a notion of uncorrupted regret, as we did for multiarmed bandits, to future work.
Note that the regret lower bound in Theorem 1 applies to the linear bandit setting as well due to a reduction of the MAB problem to the linear bandit problem (let \(d = K\) where K is the number of arms in the MAB problem, \({\mathbf {w}}^*_i = \mu _i\) and contexts \(A_t \subseteq \left\{ {{\mathbf {e}}_1,\ldots ,{\mathbf {e}}_d}\right\} \) where \({\mathbf {e}}_i\) are canonical vectors). Thus, any policy for linear bandits under an adversary must incur regret at least \(\varOmega \left( {{\eta \cdot T}}\right) \) which rules out sublinear regret bounds for robust linear bandits if \(\eta = \varOmega \left( {{1}}\right) \).
rUCBLin: a robust algorithm for linear contextual bandits
We use the notation \(\left\ {{\mathbf {x}}} \right\ _M = \sqrt{{\mathbf {x}}^\top M{\mathbf {x}}}\) for a vector \({\mathbf {x}}\in {\mathbb {R}}^d\) and a matrix \(M \in {\mathbb {R}}^{d \times d}\). The rUCBLin algorithm is described in Algorithm 5 and builds upon the OFUL algorithm (AbbasiYadkori et al. 2011) for linear contextual bandits. At every step, the algorithm performs an estimation \({\mathbf {w}}^t\) of the true model vector \({\mathbf {w}}^*\), as well as creates a confidence set to explicate the region of uncertainty. At prediction time, it uses the Optimism in the Face of Uncertainty principle to select an arm to pull.
However, unlike OFUL that uses a simple ridge regression estimator for \({\mathbf {w}}^t\) and a direct ellipsoidal confidence set constructed using all arms pulled so far, rUCBLin needs to do a much more refined job. Neither can it use a simple estimator due to the adaptive adversarial corruptions, nor can it use all arms pulled so far in its confidence ball creation. We describe how to overcome these challenges below.
For model estimation, we chose the Torrent algorithm of Bhatia et al. (2015). Even though there are several approaches to robust regression (Chen et al. 2013; Nguyen and Tran 2013), we chose Torrent since it is simple to implement yet offers guarantees against an adaptive adversary. This method requires a technical condition called subset regularity to be satisfied which we will address shortly.
Given the model estimate, rUCBLin performs a pruning step and constructs a confidence set, which, as we shall see, has a noise removal effect. It lets in previously pulled arms whose rewards were not corrupted but stops those which experienced severe corruptions. We note that step 8 in Algorithm 6, although inexpensive, was not found to greatly affect the performance of the algorithm. However, including this step makes our analysis much more convenient.
rUCBLin is extremely simple to implement and scales to large problems with ease. Extensions of rUCBLin to high dimensional settings where the model \({\mathbf {w}}^*\) is sparse are possible by using highdimensional variants of Torrent. However, we postpone these to future work. Before presenting the regret analysis, we first address the subset regularity condition required by Torrent.
Data hardness
Given the powerful adaptive adversary model in our setting, it would not be possible to make much headway unless we have some niceness in the problem structure given to us. More specifically, if the set of arms \(A_t\) that are supplied to us at each step is skewed (for instance, if they are chosen by the adversary as well), then we cannot hope to do much. To prevent this, we require the set of contexts to satisfy some regularity conditions. We note that there exist past works in linear bandit settings, such as those of Gentile et al. (2014, 2017), that do place restrictions on the context sets. The following notion of subset regularity succinctly captures the notion of a wellconditioned set of arms being presented during the course of the play. In the following, for \(n>0, \gamma \in (0,1]\), let \({\mathcal {S}}_\gamma = \left\{ {S \subset [n]: S = \gamma \cdot n}\right\} \) denote the set of all subsets of S of size \(\gamma \cdot n\).
Definition 1
(SSC and SSS properties Bhatia et al. 2015) A matrix \(X \in {\mathbb {R}}^{d\times n}\) satisfies the Subset Strong Convexity Property (resp. Subset Strong Smoothness Property) at level \(\gamma \) with strong convexity constant \(\lambda \) (resp. strong smoothness constant \(\varLambda \)) if we have:
Definition 2
(Subset regularity) A sequence of context sets \(A_1,A_2,\ldots ,A_T\) satisfies the \((\eta ,\left\{ {\lambda _t}\right\} _{[T]},\left\{ {\varLambda _t}\right\} _{[T]},T_0)\) subset regularity property if for some \(T_0 > 0\), for every \(t \ge T_0\), and every possible choice of \({\mathbf {x}}^\tau \in A_\tau \) for \(\tau = 1,\ldots ,t\), the matrix \([{\mathbf {x}}^1 {\mathbf {x}}^2 \ldots {\mathbf {x}}^t] \in {\mathbb {R}}^{d \times t}\) satisfies the SSC and SSS properties at level \(\eta \) with constants \(\lambda _t\) and \(\varLambda _t\) respectively.
Note that the \((1\eta ,\left\{ {\lambda _t}\right\} _{[T]},\left\{ {\varLambda _t}\right\} _{[T]},T_0)\) subset regularity property helps ensure that after enough, i.e., \(T_0\) iterations have passed, at every time step \(t \ge T_0\), no matter which arms we have chosen till now, and no matter which of those arms have had their responses corrupted by the adversary (so long as only an \(\eta \) fraction of the total number of arms pulled till now have been corrupted), the matrix of arm vectors whose responses were not corrupted has bounded eigenvalues. Such a property is immensely helpful in performing robust regression in the face of an adaptive adversary. As Bhatia et al. comment, such a condition is in some sense necessary if there is no restriction on which arms the adversary may corrupt. Recall that the stochastic adversary in the previous section had less power as the arms to corrupt were decided on the basis of a Bernoulli trial.
Satisfying subset regularity It might be worrisome as to how a property such as Subset Regularity may be satisfied. However, it turns out that if the arm sets \(A_t\) are generated i.i.d. (conditioned on the history) from some subGaussian distribution over \({\mathbb {R}}^d\) then the property is satisfied with high probability for a value \(T_0\) that has only polylogarithmic dependence on T. To avoid notational clutter we show this result below for the case when contexts are drawn from the standard multivariate Gaussian distribution but stress that similar results do hold for all subGaussian distributions as well. Indeed, the reader may refer to the work of Bhatia et al. (2015) for proofs of such results in the batch setting which can be extended to the online setting using the technique used to prove Lemma 1.
Lemma 1
For any \(\eta > 0\), and each round t, suppose the context vectors \(A_t = \left\{ {{\mathbf {x}}^{t,1}, \ldots , {\mathbf {x}}^{t,n_t}}\right\} \) are generated i.i.d. (conditioned on \(n_t\) and past history \({\mathcal {H}}^t\)) from the standard multivariate normal distribution \({\mathcal {N}}(\mathbf {0},I_{d\times d})\). Let \(n_t = {\mathcal O}\left( {{1}}\right) \) for all t. Then with probability at least \(1\delta \), the sequence \(A_1,A_2,\ldots ,A_T\) satisfies the \((\eta ,\left\{ {\lambda _t}\right\} _{[T]},\left\{ {\varLambda _t}\right\} _{[T]},T_0)\) subset regularity property with \(T_0 \ge {\mathcal O}\left( {{\log ^2\left( {\frac{Td}{\delta }}\right) }}\right) \). Moreover, with the same confidence, we have \(\lambda _t \ge t/4  {\mathcal O}\left( {{\log (T/\delta ) + \sqrt{T\log (T/\delta )}}}\right) \), as well as \(\varLambda _t \le t/4 + {\mathcal O}\left( {{\log (T/\delta ) + \sqrt{T\log (T/\delta )}}}\right) \).
We are now ready to prove the regret bound for rUCBLin. The proof hinges on a crucial confidence ellipsoid result which does not follow directly from existing works, e.g., that of AbbasiYadkori et al. (2011), since existing works never have to selectively throw away points due to them being corrupted. Since rUCBLin does perform such a pruning step, we have to prove this result afresh.
Theorem 6
For any \(\delta , \eta > 0\), if the sequence of context sets is generated such that it satisfies the two subset regularity properties \((\eta ,\left\{ {\lambda _t}\right\} _{[T]},\left\{ {\varLambda _t}\right\} _{[T]},T_0)\) and \((1\eta ,\left\{ {\tilde{\lambda }_t}\right\} _{[T]},\left\{ {\tilde{\varLambda }_t}\right\} _{[T]},T_0)\) such that \(\frac{\varLambda _t}{\tilde{\lambda }_t} \le \frac{1}{16}\) for all \(t \ge T_0\), then for all \(t \ge T_0\),
where \(M_t\) is obtained after the pruning step (see Algorithm 5 Steps 69).
The above result at first glance seems weaker than that for OFUL by Abbasi Ypadkori et al. (2011, Theorem 2) that offers a radius logarithmic in the horizon \(\sqrt{d\log T}\) whereas Theorem 6 offers \(\sqrt{d\log T} + \eta \cdot T\). This is no accident and simply another confession that even an algorithm that does have complete knowledge of the model \({\mathbf {w}}^*\), cannot achieve sublinear regret, given the regret lower bound.
Theorem 6 gives a formal reasoning for this. Since corruptions abound, rUCBLin can never decrease the size of its confidence ball for fear of excluding \({\mathbf {w}}^*\). However, notice that for small values of \(\eta \approx 1/\sqrt{T}\), the radius of the ball used in Theorem 6 does shrink to \(\sqrt{d\log T} + \eta \cdot \sqrt{T}\), while still allowing \(\sqrt{T}\) corruptions. We now state a regret bound for rUCBLin.
Theorem 7
If the sequence of context sets is generated (conditionally) such that it satisfies the \((\eta ,\left\{ {\lambda _t}\right\} _{[T]},\left\{ {\varLambda _t}\right\} _{[T]},T_0)\) and \((1\eta ,\left\{ {\tilde{\lambda }_t}\right\} _{[T]},\left\{ {\tilde{\varLambda }_t}\right\} _{[T]},T_0)\) subset regularity properties such that \(\frac{\varLambda _t}{\tilde{\lambda }_t} \le \frac{1}{16}\) for all \(t \ge T_0\), then rUCBLin ensures
for a constant C clarified in the proof. Moreover, in the stochastic setting with no adversary, i.e., \(\eta = 0\), rUCBLin ensures \({\mathbb {E}}\left[ {{\bar{R}_T(\textsc {rUCBLin})}}\right] \le C\cdot d\sqrt{T\log T}\).
Breakdown point analysis If we are generating arms from a standard Gaussian distribution, then \(\frac{\varLambda _t}{\tilde{\lambda }_t} \le \frac{1}{16}\) can be ensured, for instance, when \(\eta < \frac{1}{100}\) (Bhatia et al. 2015). Also note that for small values of \(\eta \) such as \(\eta \approx \frac{1}{T^a}\) for \(a > 0\), which still allow as many as \(T^{1a}\) number of the samples to be corrupted, rUCBLin actually gets sublinear regret \(T^{\max \left\{ {0.5,1a}\right\} }\). We note that we have not attempted to optimize constants such as 1/100 in the above result. In practice, we find rUCBLin to be able to tolerate very well upto 10–15% of arm pulls being corrupted.
Experiments
We discuss the experimental design and results for rUCBMAB/rUCBTune and rUCBLin here. The experiments show that these algorithms are robust to corruptions and significantly outperform other UCBstyle algorithms.^{Footnote 1}
Robust multiarmed bandit experiments
We compare the empirical performance of rUCBMAB and rUCBTune against several algorithms for stochastic, adversarial, and “bestofbothworld” bandits.
Data For each arm i, the arm means were sampled as \(\mu _i \sim {\mathcal {U}}(0,1)\) and the arm variances as \(\sigma _i \sim {\mathcal {U}}(0,1)\). The arm rewards were sampled for each arm from \({\mathcal {N}}(\mu _i, \sigma _i)\). Experiments were run with the number of arms set to 100 and 10, and for 1100 and 11,000 iterations respectively.
Adversary The corruptions were generated by conducting Bernoulli trials with bias \(\eta \). If given a chance to corrupt an arm, our adversary offered a zero reward if the selected arm was the best arm and a corrupted reward of \(\frac{s}{\eta }\) if the selected arm was not the best arm. We used \(s=0.04\) to prevent the adversary from rewarding the bad arms too much and hence violating the goodness order of the arms. We note that while other adversary models are indeed possible, we believe the adversary model used here does not unfairly benefit any particular algorithm.
Algorithms We tested rUCBMAB and rUCBTune against a large number of Upper Confidence Bound algorithms popular in literature including KLUCB (Garivier and Cappé 2011), UCB1, UCB2, UCBNormal, UCBTuned (Auer et al. 2002a) and UCBV (Audibert et al. 2009). The last three algorithms estimate the variance of the arms, while UCBNormal is an algorithm specially designed for cases when the reward distributions are normal. We tuned the value of the \(\alpha \) parameter in UCB2 as suggested by Auer et al. (2002a) and found \(\alpha =0.14\) to work well. We also run tests against the EXP3 and SAO algorithms (Bubeck and Slivkins 2012) which offer regret bounds in adversarial and bestofbothworld settings. We set a default value of \(\sigma _0=1\) as the upper bound on standard deviations for rUCBMAB.^{Footnote 2} For EXP3 we tuned the \(\gamma \) value and found it to be optimal at about 0.2. The variant of UCBV used was taken from the original work of Audibert et al. (2007), with the constants and exploration function as suggested by the authors. For finding the median in an online fashion, we used a two heaps, which allowed us to get \({\mathcal {O}}(\log n)\) time complexity for finding the median at each time step. This made the algorithm very efficient for extensive experiments.
Evaluation metric We compare the regret \(\bar{R}_T\) and uncorrupted regret \(\bar{R}^*_T\) for all algorithms. All results are averaged over 50 repetitions of the same experiment.
Results The results are shown in Figs. 1 and 2. We observe that while rUCBMAB performs poorly when compared to UCB2 and UCBTuned for low values of error rate, it quickly overtakes them with an increase in error rate. On the other hand, rUCBTune enjoys much lower regret than all other algorithms as the number of iterations and the corruption rate increase. However, for the zero corruption case, the performance is very closely followed by KLUCB. We credit this result to the fact that the exploration term estimates are typically lower for rUCBTune which reduces performance for such small number of arms. For the case of uncorrupted regret, results are similar. As evident in both the graphs, the slope of regret versus iterations (or regret vs. the corruption rate) decreases as we plot the uncorrupted rewards.
It is interesting to note that we outperform EXP3 and SAO in this setting, since neither is able to reconcile the fact that not all, but only a fraction of arms are corrupted by the adversary, and end up choosing arms as though every pull were corrupted. Variance estimating algorithms (UCBNormal, UCBTuned, UCBV, rUCBTune) perform better than those that don’t estimate variance. Overall, it seems that rUCBMAB, rUCBTune work well even for high corruption rates with hundreds of arms, which is a setting of interest.
Robust linear contextual bandit experiments: comparison with LINUCB
We also compare the empirical performance of rUCBLin with LINUCB across error rates, the dimension of the context vectors, and the magnitude of corruption.
Data The true model vector \({\mathbf {w}}^*\in {\mathbb {R}}^d\) was chosen to be a random unit norm vector with \(d = 10\). The arms at each timestep were sampled as \({\mathbf {x}}^{t,i} \sim {\mathcal {N}}(0, I_d)\), and the reward for the selected arm was generated as \(y_i = \left\langle {{\mathbf {w}}^*},{{\mathbf {x}}_i}\right\rangle + \epsilon _i\) where \(\epsilon _i \sim {\mathcal {N}}(0,\sigma ^2)\). All experiments used \(n_t = 50\) arms being generated afresh at each time step, a corruption rate of \(\eta = 0.1\), \(d = 10\), and the scale of the corruptions to be \(c_t=10\), unless stated otherwise. All results reported are averaged over 50 repetitions of the same experiment.
Adversary The corruptions were generated as \(b_t = r^*_t  c_t\cdot \left\langle {{\mathbf {w}}^*},{{\mathbf {x}}^{t,*}}\right\rangle \), where \({\mathbf {x}}^{t,*}\) is the best possible arm and \(c_t\) is the magnitude of corruption. We note that while other adversary models are indeed possible, we believe the adversary model used here does not unfairly benefit any particular algorithm.
Algorithms We compared rUCBLin to LINUCB (AbbasiYadkori et al. 2011) and used the TorrentFC implementation by Bhatia et al. (2015).
Evaluation metric We measured regret \(\bar{R}_T\) and uncorrupted regret \(\bar{R}^*_T\) over 1000 iterations.
Results Figure 3 shows that rUCBLin incurs much lower regret as compared to LINUCB as the corruption rate increases. While LINUCB has a slight edge in the case without corruptions, it quickly starts losing out to rUCBLin when the error rate increases. A more interesting result is in the case of uncorrupted regret. From the graph of uncorrupted regret plotted against time we can see the true gains rUCBLin has over LINUCB. While the LINUCB algorithm continues to incur linearly increasing uncorrupted regret with time, rUCBLin eventually converges to the best model vector. The ability of rUCBLin to retrospectively mark points as corrupted allows it to make increasingly better decisions as the number of iterations increases, since it can identify the correct model vector. LINUCB is not able to determine the correct model vector.
Robust linear bandit experiments: comparison with heavytailed methods
In this section, we compare empirical performance of rUCBLin with the algorithms for heavytailed bandits proposed by Medina and Yang (2016):

crTrunc1 represents the ConfidenceRegion algorithm of Medina and Yang (2016) (Algorithm 1 therein) with the Truncation estimator defined in the paper, and parameter \(\alpha _t = \sqrt{t}\). We found no significant improvement in performance of crTrunc1 even upon carefully tuning the exponent of t in \(\alpha _t\).

crTrunc2 represents our alternate implementation of the same algorithm which offers better empirical performance. While crTrunc1 has truncation levels that increase with time as \({\mathcal O}\left( {{\sqrt{t}}}\right) \), for crTrunc2 we fix the truncation levels to be a constant value \(\alpha = 20\), set equal to the largest magnitudes any uncorrupted reward could take. This amounts to giving crTrunc2 an unfair advantage by revealing to it the optimal truncation level.

crMoM represents the MiniBatch Confidence Region algorithm of Medina and Yang (2016) (Algorithm 3 therein) which uses the median of means estimator defined in the paper. We run this algorithm with \(\delta = 0.1\) and \(r = 10 \approx T^{1/3}\).
Executing the crMoM algorithm requires a modification to the experimental setup. Recall that our algorithms are presented with a set of available arms (contexts) at each step and only those arms can be pulled. However, the crMoM algorithm likes to pull the same arm repeatedly, in order to take the median of means of the observed pulls. To satisfy this need, we ensured that the context set stayed constant at all time steps, i.e., the same set of arms was available for pulls at all steps which allowed crMoM repeated pulls of the same arm. Thus, whereas the experimental setup remains the same as Sect. 6.2 for Fig. 4a, b, the change for Fig. 4c, d in that we do not change the set of arms at each timestep, with the rest of the experiment setting same as Sect. 6.2.
Results In Fig. 4a, b, we observe that rUCBLin maintains its lead. Both crTrunc1 and crTrunc2 are unable to discern the true model vector (as evidenced by their uncorrupted regret \(\bar{R}^*_T\) increasing linearly with time). Figure 4c, d similarly showcase rUCBLin maintaining its lead. However, given enough iterations, crMoM is able to recover the true model vector, despite performing poorly in the coldstart region. This is because crMoM needs to collect repeated pulls of arms in order to get discern the true rewards from the corrupted rewards set by the adversary. This leads to poor performance in the beginning, but it does eventually converge to the true model vector.
Discussion and future work
In this work, we reported three algorithms – rUCBMAB, rUCBTune and rUCBLin to address the task of corruptiontolerant bandit learning in the multiarmed and linearcontextual settings. All our algorithms are extremely scalable and easy to implement and enjoy crisp and tight regret bounds, as well as superior performance to a wide range of competitor methods in experiments.
Using more powerful estimators, e.g., those by Diakonikolas et al. (2016, 2018) within rUCBMAB and rUCBTune should offer stronger results, albeit at the cost of making the algorithms more expensive. Extending the analysis for rUCBMAB to nonGaussian distributions and deriving high probability regret bounds [as Lykouris et al. (2018) do] would be interesting. For rUCBLin, extending the algorithm to highdimensional settings as well as deriving sublinear uncorrupted regret bounds by making additional assumptions on the corruption rate \(\eta \) (as we did in Theorem 3 for rUCBMAB) would be useful.
From an applications standpoint, it is of interest to apply rUCBMAB and rUCBLin to recommendation settings. As our experiments indicate, these algorithms tend to outperform existing methods not only when corruptions abound, but also in when there is no adversary present. This may put rUCBLin in an advantageous position wherein it is able to neglect nonadversarial variations in user behavior to capture the core user profile. The applications to settings where we suspect clickfraud or other malicious behavior are of course, immediate.
Notes
Code and datasets for our experiments are available at https://github.com/purushottamkar/rUCB.
This value can be further improved by tuning the parameter.
References
AbbasiYadkori, Y., Pal, D., & Szepesvari, C. (2011). Improved algorithms for linear stochastic bandits. In Proceedings of the 25th annual conference on neural information processing systems (NIPS).
Audibert, J.Y., Munos, R., & Szepesvári, C. (2007). Tuning bandit algorithms in stochastic environments. In Proceedings of the 18th international conference on algorithmic learning theory (ALT).
Audibert, J.Y., Munos, R., & Szepesvári, C. (2009). Explorationexploitation tradeoff using variance estimates in multiarmed bandits. Theoretical Computer Science, 410(19), 1876–1902.
Auer, P., CesaBianchi, N., & Fischer, P. (2002a). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
Auer, P., CesaBianchi, N., Freund, Y., & Schapire, R. (2002b). The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 31(1), 48–77.
Bhatia, K., Jain, P., & Kar, P. (2015). Robust regression via hard thresholding. In Proceedings of the 29th annual conference on neural information processing systems (NIPS).
Bubeck, Sébastian., & Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. In Proceedings of the 25th annual conference on learning theory (COLT).
Bubeck, S., CesaBianchi, N., & Lugosi, G. (2013). Bandits with heavy tail. IEEE Transaction on Information Theory, 59(11), 7711–7717.
Candès, E. J., Li, X., & Wright, J. (2009). Robust principal component analysis? Journal of the ACM, 58(1), 1–37.
Chakrabarti, D., Kumar, R., Radlinski, F., & Upfal, E. (2008). Mortal multiarmed bandits. In Proceedings of the 21st international conference on neural information processing systems (NIPS).
Charikar, M., Steinhardt, J., & Valiant, G. (2017). Learning from untrusted data. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing (STOC) (pp. 47–60).
Chen, Y., Caramanis, C., & Mannor, S. (2013). Robust sparse regression under adversarial corruption. In Proceedings of the 30th international conference on machine learning (ICML).
Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the 14th international conference on artificial intelligence and statistics (AISTATS).
Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., & Stewart, A. (2016). Robust estimators in high dimensions without the computational intractability. In Proceedings of the 57th IEEE annual symposium on foundations of computer science (FOCS).
Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., & Stewart, A. (2018). Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the twentyninth annual acmsiam symposium on discrete algorithms (SODA) (pp. 2683–2702).
Feng, J., Xu, H., Mannor, S., & Yan, S. (2014). Robust logistic regression and classification. In Proceedings of the 28th annual conference on neural information processing systems (NIPS).
Gajane, P., Urvoy, T., & Kaufmann, E. (2018). Corrupt bandits for preserving local privacy. In Proceedings of the 29th international conference on algorithmic learning theory (ALT).
Garivier, A., & Cappé, O. (2011). The KLUCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory (COLT).
Gentile, C., Li, S., Kar, P., Karatzoglou, A., Zappella, G., & Etrue, E. (2017). On contextdependent clustering of bandits. In Proceedings of the 34th international conference on machine learning (ICML).
Gentile, C., Li, S., & Zappella, G. (2014). Online clustering of bandits. In Proceedings of the 31st international conference on machine learning (ICML).
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73–101.
Lai, K. A., Rao, A. B., & Vempala, S. (2016). Agnostic estimation of mean and covariance. In Proceedings of the 57th IEEE annual symposium on foundations of computer science (FOCS).
Li, L., Chu, W., Langford, J., & Schapire, R. (2010). A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international world wide web conference (WWW).
Lykouris, T., Mirrokni, V., & Leme, R. P. (2018). Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing (STOC) (pp. 114–122).
Maronna, R. A., Martin, R. D., & Yohai, V. J. (2006). Robust statistics: Theory and methods. New York: Wiley.
Medina, A. M., & Yang, S. (2016). Noregret algorithms for heavytailed linear bandits. In Proceedings of the 33rd international conference on machine learning (ICML).
Nguyen, N. H., & Tran, T. D. (2013). Exact recoverability from dense corrupted observations via \(\ell _1\)minimization. IEEE Transactions on Information Theory, 59(4), 2017–2035.
Padmanabhan, D., Bhat, S., Garg, D., Shevade, S. K., & Narahari, Y. (2016). A robust UCB scheme for active learning in regression from strategic crowds. In Proceedings of the international joint conference on neural networks (IJCNN).
Seldin, Y., & Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. In Proceedings of the 31st international conference on machine learning (ICML).
Tang, L., Rosales, R., Singh, A. P., & Agarwal, D. (2013). Automatic Ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on information and knowledge management (CIKM).
Tewari, A., & Murphy, S. A. (2017). Mobile health, chapter From Ads to interventions: Contextual bandits in mobile health (pp. 495–517). New York: Springer.
The Hindustan Times. #Appwapsi: Snapdeal gets blowback from Aamir Khan controversy, Nov 24, (2015). https://www.hindustantimes.com/india/appwapsisnapdealgetsblowbackfromaamirkhancontroversy/storyN3HwOObJ0WMe9vz7GjXFBO.html. Accessed July 15, 2018.
Tsybakov, A. B. (2009). Introduction to nonparametric estimation. New York: Springer.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to Probability and Statistics, 2, 448–485.
Acknowledgements
The authors would like to thank the reviewers and editors for pointing out several relevant works, as well as helping improve the presentation of the paper. S.K. is supported by the National Talent Search Scheme under the National Council of Education, Research and Training (Ref. No. 41/X/2013NTS). K.K.P. thanks Honda Motor India Pvt. Ltd. for an award under the 2017 YES Award program. P.K. is supported by the Deep Singh and Daljeet Kaur Faculty Fellowship and the ResearchI foundation at IIT Kanpur, and thanks Microsoft Research India and Tower Research for research grants.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Jesse Davis, Elisa Fromont, Derek Greene, and Bjorn Bringmann.
Appendices
A Proofs from Sect. 4
Proof of Theorem 1
Fix a policy \(\pi \) and let the reward distributions be Gaussians with unit variance \({\mathbf {u}}_i = {\mathcal {N}}(\mu _i,1)\). Let \(\varDelta > 0\) be a constant to be determined later. Given a constant \(c \in (0,1)\), consider two settings, one where the vector of the arm means is \(\varvec{\mu }= \left\{ {c+\varDelta ,c,c,\ldots ,c}\right\} \in {\mathbb {R}}^K\) for the K arms and the other where the arm means are \(\varvec{\mu }' = \varvec{\mu }+ 2\varDelta \cdot {\mathbf {e}}_j\) where \({\mathbf {e}}_j = (0,\ldots ,0,1,0,\ldots ,0)\in {\mathbb {R}}^K\) is the jth canonical vector. The coordinate j will be decided momentarily.
Clearly, in the first setting, the first arm is the best and in the second setting the jth arm is the best. In both settings, the adversary acts simply by assigning a (corrupted) reward of 0 whenever it gets a chance to corrupt an arm pull. Clearly such an adversary is a stochastic adversary.
Let \(T_i(T,\pi )\) denote the number of times the player obeying a policy \(\pi \) pulls the ith arm in a sequence of T trials. Also, for any \(\varvec{\mu }\in {\mathbb {R}}^K\), policy \(\pi \) and \(T > 0\), define \({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}\) to be the distribution induced on the history \({\mathcal {H}}^{T}\) by the action of policy \(\pi \) on the arms with mean rewards as given by the vector \(\varvec{\mu }\) and the adversary described above with corruption rate \(\eta \) (a cleaner construction of the distribution \({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}\) is possible by properly defining filtrations but we avoid that to keep the discussion focused).
Also let \({\mathbb {E}}_{\varvec{\mu },\pi ,\eta ,T}\) denote expectations taken with respect to \({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}\) and let \(\bar{R}_T(\pi ,\varvec{\mu },\eta )\) denote the expected regret with respect to the same. Also define
and use this to define \(\varvec{\mu }' = \varvec{\mu }+ 2\varDelta \cdot {\mathbf {e}}_j\). Note that j is taken to be the suboptimal arm in the first setting least likely to be played by the policy \(\pi \) when interacting with the arms with means \(\varvec{\mu }\) and the adversary. Given the above, it is easy to see that since
we have
We now apply the Pinkser’s inequality (Tsybakov 2009)[Lemma 2.6] to get
where KL stands for the KullbackLeibler divergence. Now, applying straightforward manipulations we can get
Now, using the fact that \(KL({\mathcal {N}}(c,1),{\mathcal {N}}(c+\varDelta ,1)) = 2\varDelta ^2\), applying an averaging argument to get \({\mathbb {E}}_{\varvec{\mu },\pi ,\eta ,T}[T_i(T,\pi )] \ge \frac{T}{K1}\), setting \(\varDelta = \sqrt{(K1)/4T}\), and using the sum of the two inequalities in (1) shows that
which, by an application of another averaging argument, tells us that for at least one setting \(\tilde{\varvec{\mu }} \in \left\{ {\varvec{\mu },\varvec{\mu }'}\right\} \), we must have
which finishes the proof. \(\square \)
Proof of Theorem 2
First of all, note that step 4 in Algorithm 3 can be seen as executing the strategy
The only difference between the above expression and the one used by Algorithm 3 is an additive term \(e\eta \sigma _0\) which does not change the output of the \(\arg \max \) operation. We next note that the corruption model considered by Lai et al. (2016) is exactly the stochastic corruption model. Next, we note that in the unidimensional case, the AgnosticMean algorithm presented by Lai et al. (2016, Algorithm 3) is simply the median estimator. Given this, at every time step t, Lai et al. (2016, Theorem 1.1) guarantee that with probability at least \(1  \frac{4}{t^2}\)
Now suppose we have played an arm \(i \ne i^*\) enough number of times to ensure \(T_i(t) \ge \frac{16e^2\sigma _0^2\log T}{\varDelta _i^2}\), then we have the following chain of inequalities
where the first and fourth steps follow from (2), the second step follows from the definitions, the third step uses the fact that \(T_i(t)\) is large enough and \(\eta _0 \le \frac{\varDelta _i}{4e\sigma _0}\), and the final step uses the fact that \(\sigma _{i^*} \le \sigma _0\) by construction.
The above shows that once an arm is pulled sufficiently many times, it will never appear as the highest upper bound estimate in the rUCBMAB algorithm and hence will never get pulled again. This allows us to estimate, using a standard proof technique, the expected number of times each arm would be pulled, as follows
where in the first step, we use the fact that initially, each arm gets played once in a roundrobin fashion in step 1 of Algorithm 3. We now have
Combining with the previous bound on \({\mathbb {E}}\left[ {{T_i(t)}}\right] \) and using \(\eta > 0\) gives us the gapdependent regret bound
To convert to the gapagnostic form claimed in Theorem 2, we simply use the CauchySchwartz inequality as follows
which establishes the result. \(\square \)
Proof
(Sketch of Theorem 3) Notice that the proof of Theorem 2 shows that once a suboptimal arm is pulled sufficiently many times, it will never appear as the highest upper bound estimate in the rUCBMAB algorithm and hence will never get pulled again. Hereon, the standard analysis applies.
Notice that this result relies on the assumption that the corruption rate is bounded \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _0}\). \(\square \)
Proof of Corollary 1
The proof of Theorem 2 assures us that for arms that satisfy \(\varDelta _i > 4e\sigma _0\eta _0\) we have
The total contribution to the regret due to these arms is already bounded by Theorem 2 as
For arms that do not satisfy the above condition, i.e., for whom we have \(\varDelta _i \le 4e\sigma _0\eta _0\), the above does not apply. However, notice that the total contribution to the regret due to these arms can be at most
since we must have \(\sum _{i: \varDelta _i \le 4e\sigma _0\eta _0} T_i(T) \le T\). Combining the two results gives us the claimed bound. Notice that no assumptions are made regarding \(\varDelta _{\min }\) in this proof.\(\square \)
Proof of Theorem 4
In this case, we notice that the in the unidimensional case, the CovarianceEstimation algorithm proposed by Lai et al. (2016, Algorithm 4) is simply Step 1 and Step 2 of the rVUCB algorithm (see Algorithm 2). Given this, at every time step t, Lai et al. (2016, Theorem 1.5) guarantee that with probability at least \(1  \frac{4}{t^2}\)
for some constant D, which establishes, with probability at least \(1  \frac{4}{t^2}\), that
where \(c = D\left( {\eta ^{1/2} + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) ^{3/4}}\right) \). To avoid a dividebyzero error, we set a maximum bound \(2\eta \) on c and assume that \(\eta < 1/2\). This establishes that the algorithm rVUCB does indeed provide a high confidence upper bound on the variance of the distributions.
After noticing this, the rest of the analysis is routine. Given that an arm \(i \ne i^*\) has been pulled enough number of times to ensure that we have \(T_i(t) \ge \max \left\{ {\frac{16e^2\sigma _i^2(1+p)\log T}{\varDelta _i^2},\frac{\log T}{\eta ^2}}\right\} \), where \(p = D(\sqrt{\eta }+ (2\eta )^{3/4})\), we have the following chain of inequalities
where the first step follows from (2), the second step follows from the definitions, the third step uses the fact that \(T_i(t)\) is large enough and \(\eta _0\) is small enough, and the final step uses (3) and the fact that \(\eta \le \eta _0\) by definition. The above shows that once an arm is pulled sufficiently many times, it will never appear as the highest upper bound estimate in the rUCBTune algorithm and hence will never get pulled again. The rest of the proof is routine now. \(\square \)
B Proofs from Sect. 5
Proof
(Sketch of Lemma 1) The proof is similar to that of previous results by Gentile et al. (2014, Lemma 2) and Gentile et al. (2017, Lemma 1). We need only show the result for one specific value of t and one specific subset \(S \subset [t], S = (1\eta )\cdot S\). The result then follows from first a union bound over all subsets, as is done by Bhatia et al. (2015), and then a union bound over all \(t \le T\) which imposes an additional logarithmic factor.
For a fixed \({\mathbf {z}}\in {\mathbb {R}}^d\), and any \(t \in [T]\), Gentile et al. (2014, Claim 1) show that
since we have assumed for sake of simplicity that the arms are being sampled from a standard Gaussian. A similar result holds for general subGaussian distributions too. Now for any subset \(S \subset [t]\), the proof then continues as in the analysis of Gentile et al. (2014, Lemma 2) by using optional skipping and setting up a Freedmanstyle matrix tail bound to get, as a consequence of the above, the following highconfidence estimate, holding with probability at least \(1\delta \),
where
Continuing with the union bounds as described above finishes the proof. \(\square \)
Proof of Theorem 6
To avoid clutter, we will replace \(\hat{G}_t\) by G in the following. Let \(\varvec{\epsilon }_G\) and \({\mathbf {b}}_G\) denote the noise and corruption values in those time instances so that \({\mathbf {r}}_G = X_G^\top {\mathbf {w}}^*+ \varvec{\epsilon }_G + {\mathbf {b}}_G\). Note that \(M_t = X_GX_G^\top \). We have
Now, following the proof technique of AbbasiYadkori et al. (2011) requires us to bound \(\left\ {X_G(\varvec{\epsilon }_G + {\mathbf {b}}_G)} \right\ _{M_t}\). Using the fact that \(M_t = X_GX_G^\top \) gives us
Let \(G_t = \left\{ {\tau \le t: b_\tau = 0}\right\} \) be the set of clean points till time t. Since the results of Bhatia et al. (2015, Theorem 10) ensure that the output of Torrent satisfies \(\left\ {\hat{{\mathbf {w}}}^t {\mathbf {w}}^*} \right\ _2 \le {\mathcal O}\left( {{\sigma _0}}\right) \), we are assured with probability at least \(1  \frac{1}{t^2}\) that \(G_t \subseteq \hat{G}_t\). Thus, we get
where the second step follows from the fact that we can canonically define \(\epsilon _\tau = 0\) for the corrupted time instances, i.e., if \(\tau < t\) and \(\tau \notin G_t\)) by setting \(b_t = b_t + \epsilon _t\), and the last step uses the fact that \(G_t \subset G\). However, the quantity \(\varvec{\epsilon }_{G_t}^\top X_{G_t} (X_{G_t}X_{G_t}^\top )^{1}X_{G_t}^\top \varvec{\epsilon }_{G_t}\) can be bounded by \(\sigma _0\sqrt{d\log T}\) using the self normalized martingale inequality by AbbasiYadkori et al. (2011, Theorem 1) as it is the set of uncorrupted points to which standard results keep applying. The second quantity \(\left\ {X_G{\mathbf {b}}_G} \right\ _{M_t}\) can be similarly bounded by using the fact that \(\left\ {{\mathbf {b}}_G} \right\ _0 \le 2\eta \cdot t\) and since \(\left\ {\hat{{\mathbf {w}}}^t {\mathbf {w}}^*} \right\ _2 \le {\mathcal O}\left( {{\sigma _0}}\right) \) by Bhatia et al. (2015, Theorem 10), any so, corrupted points \(\tau \) that may have landed into the set \(\hat{G}_t\) must satisfy \(\left {b_\tau } \right \le \sigma _0\sqrt{\log T}\). This finishes the proof. Note that the last argument \(\left {b_\tau } \right \le \sigma _0\sqrt{\log T}\) reveals that the pruning step is indeed a noiseremoval step. It prunes away any arm which had its reward excessively corrupted. \(\square \)
Proof of Theorem 7
The proof is mostly routine and follows the proof of a similar result by AbbasiYadkori et al. (2011, Theorem 3). Let us define \((\hat{{\mathbf {x}}}^t,\tilde{{\mathbf {w}}}^t) = \underset{{\mathbf {x}}\in A_t}{\arg \max }\ \underset{{\mathbf {w}}\in C_{t1}}{\arg \max }\ \left\langle {{\mathbf {x}}},{{\mathbf {w}}}\right\rangle \). Then
Now, the SSC properties guarantee \(\lambda _{\min }(M_t) = \varOmega \left( {{t}}\right) \) which gives us \(\left\ {\hat{{\mathbf {x}}}^t} \right\ _{M_t^{1}} \le {\mathcal O}\left( {{\frac{1}{\sqrt{t}}}}\right) \). This finishes the proof upon using Theorem 6 and simple manipulations. \(\square \)
Rights and permissions
About this article
Cite this article
Kapoor, S., Patel, K.K. & Kar, P. Corruptiontolerant bandit learning. Mach Learn 108, 687–715 (2019). https://doi.org/10.1007/s1099401857585
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401857585
Keywords
 Robust learning
 Online learning
 Bandit algorithms