A Proofs from Sect. 4
Proof of Theorem 1
Fix a policy \(\pi \) and let the reward distributions be Gaussians with unit variance \({\mathbf {u}}_i = {\mathcal {N}}(\mu _i,1)\). Let \(\varDelta > 0\) be a constant to be determined later. Given a constant \(c \in (0,1)\), consider two settings, one where the vector of the arm means is \(\varvec{\mu }= \left\{ {c+\varDelta ,c,c,\ldots ,c}\right\} \in {\mathbb {R}}^K\) for the K arms and the other where the arm means are \(\varvec{\mu }' = \varvec{\mu }+ 2\varDelta \cdot {\mathbf {e}}_j\) where \({\mathbf {e}}_j = (0,\ldots ,0,1,0,\ldots ,0)\in {\mathbb {R}}^K\) is the jth canonical vector. The coordinate j will be decided momentarily.
Clearly, in the first setting, the first arm is the best and in the second setting the jth arm is the best. In both settings, the adversary acts simply by assigning a (corrupted) reward of 0 whenever it gets a chance to corrupt an arm pull. Clearly such an adversary is a stochastic adversary.
Let \(T_i(T,\pi )\) denote the number of times the player obeying a policy \(\pi \) pulls the ith arm in a sequence of T trials. Also, for any \(\varvec{\mu }\in {\mathbb {R}}^K\), policy \(\pi \) and \(T > 0\), define \({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}\) to be the distribution induced on the history \({\mathcal {H}}^{T}\) by the action of policy \(\pi \) on the arms with mean rewards as given by the vector \(\varvec{\mu }\) and the adversary described above with corruption rate \(\eta \) (a cleaner construction of the distribution \({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}\) is possible by properly defining filtrations but we avoid that to keep the discussion focused).
Also let \({\mathbb {E}}_{\varvec{\mu },\pi ,\eta ,T}\) denote expectations taken with respect to \({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}\) and let \(\bar{R}_T(\pi ,\varvec{\mu },\eta )\) denote the expected regret with respect to the same. Also define
$$\begin{aligned} j := \arg \min _{i \ne 1}\ {\mathbb {E}}_{\varvec{\mu },\pi ,\eta ,T}[T_i(T,\pi )], \end{aligned}$$
and use this to define \(\varvec{\mu }' = \varvec{\mu }+ 2\varDelta \cdot {\mathbf {e}}_j\). Note that j is taken to be the suboptimal arm in the first setting least likely to be played by the policy \(\pi \) when interacting with the arms with means \(\varvec{\mu }\) and the adversary. Given the above, it is easy to see that since
$$\begin{aligned} \bar{R}_T(\pi ,\varvec{\mu },\eta ) = \varDelta \cdot \sum _{i=2}^K{\mathbb {E}}_{\varvec{\mu },\pi ,\eta ,T}[T_i(T,\pi )] + c\eta \cdot T, \end{aligned}$$
we have
$$\begin{aligned} \begin{aligned} \bar{R}_T(\pi ,\varvec{\mu },\eta )&\ge {\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}[T_1(T,\pi ) \le T/2]\cdot \frac{T\varDelta }{2} + c\eta \cdot T\\ \bar{R}_T(\pi ,\varvec{\mu }',\eta )&\ge {\mathbb {P}}_{\varvec{\mu }',\pi ,\eta ,T}[T_1(T,\pi ) > T/2]\cdot \frac{T\varDelta }{2} + c\eta \cdot T \end{aligned} \end{aligned}$$
(1)
We now apply the Pinkser’s inequality (Tsybakov 2009)[Lemma 2.6] to get
$$\begin{aligned} {\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}\left[ {T_1(T,\pi ) \le \frac{T}{2}}\right] + {\mathbb {P}}_{\varvec{\mu }',\pi ,\eta ,T}\left[ {T_1(T,\pi ) > \frac{T}{2}}\right] \ge \exp \left[ {-KL({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}||{\mathbb {P}}_{\varvec{\mu }',\pi ,\eta ,T})}\right] , \end{aligned}$$
where KL stands for the Kullback-Leibler divergence. Now, applying straightforward manipulations we can get
$$\begin{aligned} KL({\mathbb {P}}_{\varvec{\mu },\pi ,\eta ,T}||{\mathbb {P}}_{\varvec{\mu }',\pi ,\eta ,T}) = {\mathbb {E}}_{\varvec{\mu },\pi ,\eta ,T}[T_j(T,\pi )]\cdot KL({\mathcal {N}}(\mu _j,1),{\mathcal {N}}(\mu _j',1)). \end{aligned}$$
Now, using the fact that \(KL({\mathcal {N}}(c,1),{\mathcal {N}}(c+\varDelta ,1)) = 2\varDelta ^2\), applying an averaging argument to get \({\mathbb {E}}_{\varvec{\mu },\pi ,\eta ,T}[T_i(T,\pi )] \ge \frac{T}{K-1}\), setting \(\varDelta = \sqrt{(K-1)/4T}\), and using the sum of the two inequalities in (1) shows that
$$\begin{aligned} \bar{R}_T(\pi ,\varvec{\mu },\eta ) + \bar{R}_T(\pi ,\varvec{\mu }',\eta ) \ge \frac{2}{27}\sqrt{(K-1)T} + 2c\eta \cdot T \end{aligned}$$
which, by an application of another averaging argument, tells us that for at least one setting \(\tilde{\varvec{\mu }} \in \left\{ {\varvec{\mu },\varvec{\mu }'}\right\} \), we must have
$$\begin{aligned} \bar{R}_T(\pi ,\tilde{\varvec{\mu }},\eta ) \ge \frac{1}{27}\sqrt{(K-1)T} + c\eta \cdot T, \end{aligned}$$
which finishes the proof. \(\square \)
Proof of Theorem 2
First of all, note that step 4 in Algorithm 3 can be seen as executing the strategy
$$\begin{aligned} I_t = \arg \max _{i \in [K]} \tilde{\mu }_{i,t} + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _0 \end{aligned}$$
The only difference between the above expression and the one used by Algorithm 3 is an additive term \(e\eta \sigma _0\) which does not change the output of the \(\arg \max \) operation. We next note that the corruption model considered by Lai et al. (2016) is exactly the stochastic corruption model. Next, we note that in the uni-dimensional case, the AgnosticMean algorithm presented by Lai et al. (2016, Algorithm 3) is simply the median estimator. Given this, at every time step t, Lai et al. (2016, Theorem 1.1) guarantee that with probability at least \(1 - \frac{4}{t^2}\)
$$\begin{aligned} \left| {\mu _i - \tilde{\mu }_{i,t}} \right| \le \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _i \end{aligned}$$
(2)
Now suppose we have played an arm \(i \ne i^*\) enough number of times to ensure \(T_i(t) \ge \frac{16e^2\sigma _0^2\log T}{\varDelta _i^2}\), then we have the following chain of inequalities
$$\begin{aligned} \tilde{\mu }_{i,t} + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _0&\le \mu _i + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _0 + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _i\\&= \mu ^*- \varDelta _i + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _0 + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _i\\&\le \mu ^*\\&\le \tilde{\mu }_{i^*,t} + \left( {\eta + \sqrt{\frac{\log t}{T_{i^*}(t)}}}\right) e\sigma _{i^*}\\&\le \tilde{\mu }_{i^*,t} + \left( {\eta + \sqrt{\frac{\log t}{T_{i^*}(t)}}}\right) e\sigma _0 \end{aligned}$$
where the first and fourth steps follow from (2), the second step follows from the definitions, the third step uses the fact that \(T_i(t)\) is large enough and \(\eta _0 \le \frac{\varDelta _i}{4e\sigma _0}\), and the final step uses the fact that \(\sigma _{i^*} \le \sigma _0\) by construction.
The above shows that once an arm is pulled sufficiently many times, it will never appear as the highest upper bound estimate in the rUCB-MAB algorithm and hence will never get pulled again. This allows us to estimate, using a standard proof technique, the expected number of times each arm would be pulled, as follows
$$\begin{aligned} {\mathbb {E}}\left[ {{T_i(t)}}\right]&= 1+ \sum _{t = K+1}^T {\mathbb {I}}\left\{ {{I_t = i}}\right\} \\&= 1 + {\mathbb {E}}\left[ {{\sum _{t = K+1}^T {\mathbb {I}}\left\{ {{I_t = i \wedge T_i(t) \le \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2}}}\right\} + {\mathbb {I}}\left\{ {{I_t = i \wedge T_i(t)> \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2}}}\right\} }}\right] \\&\le 1 + \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2} + \sum _{t = K+1}^T {\mathbb {P}}\left[ {{I_t = i \wedge T_i(t)> \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2}}}\right] \\&= 1 + \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2} + \sum _{t = K+1}^T {\mathbb {P}}\left[ {{I_t = i \,|\,T_i(t)> \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2}}}\right] {\mathbb {P}}\left[ {{T_i(t) > \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2}}}\right] \\&\le 1 + \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2} + \sum _{t = K+1}^T \frac{16}{t^2}\\&\le \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2} + 35, \end{aligned}$$
where in the first step, we use the fact that initially, each arm gets played once in a round-robin fashion in step 1 of Algorithm 3. We now have
$$\begin{aligned} {\mathbb {E}}\left[ {{\sum _{t=1}^T r_t}}\right]&= {\mathbb {E}}\left[ {{\sum _{i=1}^K\sum _{t=1}^Tr_t{\mathbb {I}}\left\{ {{I_t = i}}\right\} }}\right] \\&= \sum _{i=1}^K\sum _{t=1}^T{\mathbb {E}}\left[ {{{\mathbb {E}}\left[ {{r_t{\mathbb {I}}\left\{ {{I_t = i}}\right\} \,|\,{\mathcal {H}}^t}}\right] {\mathbb {I}}\left\{ {{I_t = i}}\right\} }}\right] \\&\ge \sum _{i=1}^K\sum _{t=1}^T(1-\eta )\mu _i{\mathbb {E}}\left[ {{{\mathbb {I}}\left\{ {{I_t = i}}\right\} }}\right] - B\eta \cdot T\\&= (1-\eta )\sum _{i=1}^K\mu _i{\mathbb {E}}\left[ {{T_i(t)}}\right] - B\eta \cdot T \end{aligned}$$
Combining with the previous bound on \({\mathbb {E}}\left[ {{T_i(t)}}\right] \) and using \(\eta > 0\) gives us the gap-dependent regret bound
$$\begin{aligned} \bar{R}_T({\textsc {rUCB-MAB}}) \le \sum _{i\ne i^*}\frac{16e^2\sigma _0^2\ln T}{\varDelta _i} + 35\varDelta _i + \eta \cdot (\mu ^*+B)T \end{aligned}$$
To convert to the gap-agnostic form claimed in Theorem 2, we simply use the Cauchy-Schwartz inequality as follows
$$\begin{aligned} {\bar{R}_T({\textsc {rUCB-MAB}})}&= (1-\eta )\mu ^*\cdot T - {\mathbb {E}}\left[ {{\sum _{t=1}^T r_t}}\right] + \eta \cdot (\mu ^*+B)T\\&= (1-\eta )\sum _{i=1}^K\varDelta _i{\mathbb {E}}\left[ {{T_i(t)}}\right] + \eta \cdot (\mu ^*+B)T\\&\le (1-\eta )\sqrt{\sum _{i=1}^K\varDelta _i^2{\mathbb {E}}\left[ {{T_i(t)}}\right] }\sqrt{\sum _{i=1}^K{\mathbb {E}}\left[ {{T_i(t)}}\right] } + \eta \cdot (\mu ^*+B)T\\&=(1-\eta )\sqrt{16e^2\sigma _0^2KT\ln T + 35T\sum _{i=1}^K\varDelta _i^2} + \eta \cdot (\mu ^*+B)T, \end{aligned}$$
which establishes the result. \(\square \)
Proof
(Sketch of Theorem 3) Notice that the proof of Theorem 2 shows that once a suboptimal arm is pulled sufficiently many times, it will never appear as the highest upper bound estimate in the rUCB-MAB algorithm and hence will never get pulled again. Hereon, the standard analysis applies.
$$\begin{aligned} {\mathbb {E}}\left[ {{\sum _{t=1}^T r^*_t}}\right]&= {\mathbb {E}}\left[ {{\sum _{i=1}^K\sum _{t=1}^Tr^*_t{\mathbb {I}}\left\{ {{I_t = i}}\right\} }}\right] \\&= \sum _{i=1}^K\sum _{t=1}^T{\mathbb {E}}\left[ {{{\mathbb {E}}\left[ {{r^*_t{\mathbb {I}}\left\{ {{I_t = i}}\right\} \,|\,{\mathcal {H}}^t}}\right] {\mathbb {I}}\left\{ {{I_t = i}}\right\} }}\right] \\&= \sum _{i=1}^K\sum _{t=1}^T \mu _i{\mathbb {E}}\left[ {{{\mathbb {I}}\left\{ {{I_t = i}}\right\} }}\right] \\&= \sum _{i=1}^K\mu _i{\mathbb {E}}\left[ {{T_i(t)}}\right] \end{aligned}$$
Notice that this result relies on the assumption that the corruption rate is bounded \(\eta \le \frac{\varDelta _{\min }}{4e\sigma _0}\). \(\square \)
Proof of Corollary 1
The proof of Theorem 2 assures us that for arms that satisfy \(\varDelta _i > 4e\sigma _0\eta _0\) we have
$$\begin{aligned} {\mathbb {E}}\left[ {{T_i(t)}}\right] \le \frac{16e^2\sigma _0^2\ln T}{\varDelta _i^2} + 35 \end{aligned}$$
The total contribution to the regret due to these arms is already bounded by Theorem 2 as
$$\begin{aligned} \sum _{i: \varDelta _i > 4e\sigma _0\eta _0}\varDelta _i\cdot {\mathbb {E}}\left[ {{T_i(t)}}\right] \le C(1-\eta )\sqrt{KT\ln T} + \eta \cdot (\mu ^*+B)T \end{aligned}$$
For arms that do not satisfy the above condition, i.e., for whom we have \(\varDelta _i \le 4e\sigma _0\eta _0\), the above does not apply. However, notice that the total contribution to the regret due to these arms can be at most
$$\begin{aligned} \sum _{i: \varDelta _i \le 4e\sigma _0\eta _0}\varDelta _i\cdot {\mathbb {E}}\left[ {{T_i(t)}}\right] \le 4e\sigma _0\eta _0\sum _{i: \varDelta _i \le 4e\sigma _0\eta _0}{\mathbb {E}}\left[ {{T_i(t)}}\right] \le 4e\sigma _0\eta _0T, \end{aligned}$$
since we must have \(\sum _{i: \varDelta _i \le 4e\sigma _0\eta _0} T_i(T) \le T\). Combining the two results gives us the claimed bound. Notice that no assumptions are made regarding \(\varDelta _{\min }\) in this proof.\(\square \)
Proof of Theorem 4
In this case, we notice that the in the uni-dimensional case, the CovarianceEstimation algorithm proposed by Lai et al. (2016, Algorithm 4) is simply Step 1 and Step 2 of the rVUCB algorithm (see Algorithm 2). Given this, at every time step t, Lai et al. (2016, Theorem 1.5) guarantee that with probability at least \(1 - \frac{4}{t^2}\)
$$\begin{aligned} \left| {\sigma _i - \tilde{\sigma }_{i,t}} \right| \le D\left( {\eta ^{1/2} + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) ^{3/4}}\right) \sigma _i, \end{aligned}$$
(3)
for some constant D, which establishes, with probability at least \(1 - \frac{4}{t^2}\), that
$$\begin{aligned} \sigma _i \le \tilde{\sigma }_{i,t}/(1-c), \end{aligned}$$
where \(c = D\left( {\eta ^{1/2} + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) ^{3/4}}\right) \). To avoid a divide-by-zero error, we set a maximum bound \(2\eta \) on c and assume that \(\eta < 1/2\). This establishes that the algorithm rVUCB does indeed provide a high confidence upper bound on the variance of the distributions.
After noticing this, the rest of the analysis is routine. Given that an arm \(i \ne i^*\) has been pulled enough number of times to ensure that we have \(T_i(t) \ge \max \left\{ {\frac{16e^2\sigma _i^2(1+p)\log T}{\varDelta _i^2},\frac{\log T}{\eta ^2}}\right\} \), where \(p = D(\sqrt{\eta }+ (2\eta )^{3/4})\), we have the following chain of inequalities
$$\begin{aligned} \tilde{\mu }_{i,t} + \left( {\eta _0 + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\tilde{\sigma }_{i,t}&\le \mu _i + \left( {\eta _0 + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\tilde{\sigma }_{i,t} + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _i\\&= \mu ^*- \varDelta _i + \left( {\eta _0 + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\tilde{\sigma }_{i,t} + \left( {\eta + \sqrt{\frac{\log t}{T_i(t)}}}\right) e\sigma _i\\&\le \mu ^*\\&\le \tilde{\mu }_{i^*,t} + \left( {\eta + \sqrt{\frac{\log t}{T_{i^*}(t)}}}\right) e\sigma _{i^*}\\&\le \tilde{\mu }_{i^*,t} + \left( {\eta _0 + \sqrt{\frac{\log t}{T_{i^*}(t)}}}\right) e\tilde{\sigma }_{i^*,t} \end{aligned}$$
where the first step follows from (2), the second step follows from the definitions, the third step uses the fact that \(T_i(t)\) is large enough and \(\eta _0\) is small enough, and the final step uses (3) and the fact that \(\eta \le \eta _0\) by definition. The above shows that once an arm is pulled sufficiently many times, it will never appear as the highest upper bound estimate in the rUCB-Tune algorithm and hence will never get pulled again. The rest of the proof is routine now. \(\square \)
B Proofs from Sect. 5
Proof
(Sketch of Lemma 1) The proof is similar to that of previous results by Gentile et al. (2014, Lemma 2) and Gentile et al. (2017, Lemma 1). We need only show the result for one specific value of t and one specific subset \(S \subset [t], |S| = (1-\eta )\cdot |S|\). The result then follows from first a union bound over all subsets, as is done by Bhatia et al. (2015), and then a union bound over all \(t \le T\) which imposes an additional logarithmic factor.
For a fixed \({\mathbf {z}}\in {\mathbb {R}}^d\), and any \(t \in [T]\), Gentile et al. (2014, Claim 1) show that
$$\begin{aligned} {\mathbb {E}}\left[ {{\min _{k \in \{1,\ldots ,n_t\}}({\mathbf {z}}^\top {\mathbf {x}}^{t,k})^2\,|\,n_t}}\right] \ge 1/4, \end{aligned}$$
since we have assumed for sake of simplicity that the arms are being sampled from a standard Gaussian. A similar result holds for general sub-Gaussian distributions too. Now for any subset \(S \subset [t]\), the proof then continues as in the analysis of Gentile et al. (2014, Lemma 2) by using optional skipping and setting up a Freedman-style matrix tail bound to get, as a consequence of the above, the following high-confidence estimate, holding with probability at least \(1-\delta \),
$$\begin{aligned} \min _{\begin{array}{c} {\tau \in S}\\ {k_\tau \in \{1,\ldots ,n_\tau \}} \end{array}} \lambda _{\min }\left( {\sum _{\tau \in S} {\mathbf {x}}^{\tau ,k_\tau } ({\mathbf {x}}^{\tau ,k_\tau })^\top }\right) \ge B\left( |S|,\frac{\delta }{2d}\right) ~, \end{aligned}$$
(4)
where
$$\begin{aligned} B(T,\delta ) = T/4 - 8\left( \log (T/\delta ) + \sqrt{T\,\log (T/\delta )} \right) . \end{aligned}$$
Continuing with the union bounds as described above finishes the proof. \(\square \)
Proof of Theorem 6
To avoid clutter, we will replace \(\hat{G}_t\) by G in the following. Let \(\varvec{\epsilon }_G\) and \({\mathbf {b}}_G\) denote the noise and corruption values in those time instances so that \({\mathbf {r}}_G = X_G^\top {\mathbf {w}}^*+ \varvec{\epsilon }_G + {\mathbf {b}}_G\). Note that \(M_t = X_GX_G^\top \). We have
$$\begin{aligned} \bar{\mathbf {w}}^t&= (X_GX_G^\top )^{-1}X_G^\top {\mathbf {r}}_G\\&= (X_GX_G^\top )^{-1}X_G^\top (X_G^\top {\mathbf {w}}^*+ \varvec{\epsilon }_G + {\mathbf {b}}_G)\\&= {\mathbf {w}}^*+ (X_GX_G^\top )^{-1}X_G^\top (\varvec{\epsilon }_G + {\mathbf {b}}_G) \end{aligned}$$
Now, following the proof technique of Abbasi-Yadkori et al. (2011) requires us to bound \(\left\| {X_G(\varvec{\epsilon }_G + {\mathbf {b}}_G)} \right\| _{M_t}\). Using the fact that \(M_t = X_GX_G^\top \) gives us
$$\begin{aligned} \left\| {X_G(\varvec{\epsilon }_G + {\mathbf {b}}_G)} \right\| _{M_t} \le \left\| {X_G\varvec{\epsilon }_G} \right\| _{M_t} + \left\| {X_G{\mathbf {b}}_G} \right\| _{M_t}. \end{aligned}$$
Let \(G_t = \left\{ {\tau \le t: b_\tau = 0}\right\} \) be the set of clean points till time t. Since the results of Bhatia et al. (2015, Theorem 10) ensure that the output of Torrent satisfies \(\left\| {\hat{{\mathbf {w}}}^t- {\mathbf {w}}^*} \right\| _2 \le {\mathcal O}\left( {{\sigma _0}}\right) \), we are assured with probability at least \(1 - \frac{1}{t^2}\) that \(G_t \subseteq \hat{G}_t\). Thus, we get
$$\begin{aligned} \left\| {X_G\varvec{\epsilon }_G} \right\| ^2_{M_t}&= \varvec{\epsilon }_G^\top X_G (X_GX_G^\top )^{-1}X_G^\top \varvec{\epsilon }_G\\&= \varvec{\epsilon }_{G_t}^\top X_{G_t} (X_GX_G^\top )^{-1}X_{G_t}^\top \varvec{\epsilon }_{G_t}\\&\le \varvec{\epsilon }_{G_t}^\top X_{G_t} (X_{G_t}X_{G_t}^\top )^{-1}X_{G_t}^\top \varvec{\epsilon }_{G_t} \end{aligned}$$
where the second step follows from the fact that we can canonically define \(\epsilon _\tau = 0\) for the corrupted time instances, i.e., if \(\tau < t\) and \(\tau \notin G_t\)) by setting \(b_t = b_t + \epsilon _t\), and the last step uses the fact that \(G_t \subset G\). However, the quantity \(\varvec{\epsilon }_{G_t}^\top X_{G_t} (X_{G_t}X_{G_t}^\top )^{-1}X_{G_t}^\top \varvec{\epsilon }_{G_t}\) can be bounded by \(\sigma _0\sqrt{d\log T}\) using the self normalized martingale inequality by Abbasi-Yadkori et al. (2011, Theorem 1) as it is the set of uncorrupted points to which standard results keep applying. The second quantity \(\left\| {X_G{\mathbf {b}}_G} \right\| _{M_t}\) can be similarly bounded by using the fact that \(\left\| {{\mathbf {b}}_G} \right\| _0 \le 2\eta \cdot t\) and since \(\left\| {\hat{{\mathbf {w}}}^t- {\mathbf {w}}^*} \right\| _2 \le {\mathcal O}\left( {{\sigma _0}}\right) \) by Bhatia et al. (2015, Theorem 10), any so, corrupted points \(\tau \) that may have landed into the set \(\hat{G}_t\) must satisfy \(\left| {b_\tau } \right| \le \sigma _0\sqrt{\log T}\). This finishes the proof. Note that the last argument \(\left| {b_\tau } \right| \le \sigma _0\sqrt{\log T}\) reveals that the pruning step is indeed a noise-removal step. It prunes away any arm which had its reward excessively corrupted. \(\square \)
Proof of Theorem 7
The proof is mostly routine and follows the proof of a similar result by Abbasi-Yadkori et al. (2011, Theorem 3). Let us define \((\hat{{\mathbf {x}}}^t,\tilde{{\mathbf {w}}}^t) = \underset{{\mathbf {x}}\in A_t}{\arg \max }\ \underset{{\mathbf {w}}\in C_{t-1}}{\arg \max }\ \left\langle {{\mathbf {x}}},{{\mathbf {w}}}\right\rangle \). Then
$$\begin{aligned} {\mathbb {E}}\left[ {{\left\langle {{\mathbf {w}}^*},{{\mathbf {x}}^{t,*}}\right\rangle - r_t \,|\,{\mathcal {H}}^t}}\right] \le {}&(1-\eta )\left( {\left\langle {{\mathbf {w}}^*},{{\mathbf {x}}^{t,*}}\right\rangle - \left\langle {{\mathbf {w}}^*},{\hat{{\mathbf {x}}}^t}\right\rangle }\right) \\&{}+ \eta \left( {\left\langle {{\mathbf {w}}^*},{{\mathbf {x}}^{t,*}}\right\rangle + B}\right) \\ \le {}&(1-\eta )\left( {\left\langle {\tilde{{\mathbf {w}}}^t},{\hat{{\mathbf {x}}}^t}\right\rangle - \left\langle {{\mathbf {w}}^*},{\hat{{\mathbf {x}}}^t}\right\rangle }\right) + \eta \left( {\left\langle {{\mathbf {w}}^*},{{\mathbf {x}}^{t,*}}\right\rangle + B}\right) \\ ={}&(1-\eta )\left\langle {\tilde{{\mathbf {w}}}^t - {\mathbf {w}}^*},{\hat{{\mathbf {x}}}^t}\right\rangle + \eta \left( {\left\langle {{\mathbf {w}}^*},{{\mathbf {x}}^{t,*}}\right\rangle + B}\right) \\ ={}&(1-\eta )\left( {\left\langle {\tilde{{\mathbf {w}}}^t - \bar{\mathbf {w}}^t},{\hat{{\mathbf {x}}}^t}\right\rangle - \left\langle {{\mathbf {w}}^*- \bar{\mathbf {w}}^t},{\hat{{\mathbf {x}}}^t}\right\rangle }\right) \\&{}+ \eta \left( {\left\langle {{\mathbf {w}}^*},{{\mathbf {x}}^{t,*}}\right\rangle + B}\right) \\ \le {}&(1-\eta )\left\| {\hat{{\mathbf {x}}}^t} \right\| _{M_t^{-1}}\left( {\left\| {\tilde{{\mathbf {w}}}^t - \bar{\mathbf {w}}^t} \right\| _{M_t} + \left\| {{\mathbf {w}}^*- \bar{\mathbf {w}}^t} \right\| _{M_t}}\right) \\&{}+ \eta \left( {1 + B}\right) , \end{aligned}$$
Now, the SSC properties guarantee \(\lambda _{\min }(M_t) = \varOmega \left( {{t}}\right) \) which gives us \(\left\| {\hat{{\mathbf {x}}}^t} \right\| _{M_t^{-1}} \le {\mathcal O}\left( {{\frac{1}{\sqrt{t}}}}\right) \). This finishes the proof upon using Theorem 6 and simple manipulations. \(\square \)