1 Introduction

Multi-armed bandits have undergone a renaissance in machine learning research [14, 26] with a range of deep theoretical results discovered, while applications to real-world sequential decision making under uncertainty abound, ranging from news [15] and movie recommendation [22], to crowd sourcing [30] and self-driving databases [19, 21]. The relative simplicity of the stochastic bandit setting, as compared to more general partially-observable Markov decision processes (POMDPs), typically admits regret analysis where bandit learners enjoy bounded cumulative regret—the gap between a learner’s cumulative reward to time T and the cumulative reward possible with a fixed but optimal-with-hindsight policy. While many bandit learners are celebrated for attaining sublinear regret or average regret converging to zero, such long-term performance goals say little about the short-term performance of today’s popular bandit algorithms.

Indeed, the bandit setting is well known to be the simplest Markov decision process setting to require balancing of exploration—attempting infrequent actions in case of higher-than-expected rewards—with exploitation—greedy selection of actions that so far appear fruitful. Even in the stochastic setting, where rewards are drawn from stationary (context conditional) distributions, the underlying distributions are unknown and considered adversarially chosen. In other words, there is no free lunch (in the worst case) without significant exploration in early rounds.

The relatively poor early round performance of bandit learners is known as the cold start problem and can be costly in high-stakes domains. Li et al. [15] suggested that bandit learners be warm started or pre-trained somehow prior to such deployment, in the context of online media recommendation and advertising where poor performance leads to user dissatisfaction and financial loss. However, little systematic research has explored the cold start problem. Intuitively, warm start is related to transfer learning [9] and domain adaptation [10] while while Shivaswamy and Joachims [25] proposed warm-starting methods for non-contextual bandits and Zhang et al. [32] modify any bandit policy to make use of pre-training from (batch) supervised learning via manipulation of the policy’s importance sampling and weighting, which determines the relative importance of one data \((\varvec{x},y)\) over the other data—ultimately resulting in a weighted linear regression. Another work by Li et al. [16] employs virtual plays before committing to an action in every round, which implicitly assumes that the existing logged data are perfectly aligned with the unknown bandit data. A similar assumption is made implicitly by Bouneffouf et al. [6], who combine prior historical observations and clustering information. Other works have proposed approaches to the item-user cold-start problem, such as that proposed by Wang et al. [31], who passively assign a user to each item on top of the usual bandit which selects an item for a user. The warm-start problem is also related to the conservative bandit problem, where the usual bandit setting applies under the existence of a baseline policy and a performance constraint [13]. This paper advocates for Thompson sampling (TS) [28] as a natural framework for warm start bandits. Although the prior used in Thompson sampling can be misspecified, as discussed by Liu and Li [17], our extension to the LinTS contextual bandit not only affords more flexible forms of warm start, but quantifies prior uncertainty, and admits regret analysis. Furthermore, this idea can be extended into other bandit algorithms, such as \(\epsilon \)-greedy and LinUCB.

Flexibility in warm start is paramount, as not all settings requiring warm start will necessarily admit prior supervised learning as assumed previously [32]. Indeed, bandits are typically motivated when there is an absence of direct supervision, and only indirect rewards are available. Our framework offers unprecedented flexibility. We advocate that prior knowledge could come from: bandit learning on a previous, related task; domain expert knowledge or knowledge extracted from a rule-based, non-adaptive baseline system; or indeed prior supervised learning.

We introduce a new motivation for warm-start bandits from the database systems domain. Database indices, a data structure used by database management systems to execute queries more rapidly, may be formed on any combination of table columns. Unfortunately the best choice of index depends on unknown query workloads and potentially unstable system performance. Offline solutions to index selection have been the foundations of the automated tools provided by database vendors [3, 11, 33]. Recognising that database administrators cannot practically foresee future database loads, online solutions, where the choice of the representative workload and the cost-benefit analysis of materialising a configuration are automated, have been proposed [7, 8, 12, 18, 23, 24]. Unfortunately, most such approaches lack any form of performance guarantee. Recent work has demonstrated compelling potential for linear bandits for index selection [21] complete with regret bound guarantees, however the cold start problem is likely to limit deployment as vendors and users alike may be concerned about out-of-box performance. We demonstrate that a warm-start bandit can deliver strong short-term improvement for database index selection without costing long-term results.

In summary, this paper makes the following contributions:

  • We propose a framework for warm-starting contextual bandits based on LinTS and extend our technique to \(\epsilon \)-greedy and LinUCB;

  • Unlike past efforts to warm-start bandit learners, which strictly apply to supervised learning only, our warm start linear bandit seen in Algorithms 2, 3 and 4 can incorporate prior knowledge from any form of prior learning, such as supervised learning [32], prior bandit learning, or manual construction of a prior by a domain expert. Notably, our warm start approach incorporates uncertainty quantification;

  • We introduce a method to automatically tune the hyperparameters used in Algorithms 2, 3 and 4;

  • We present regret bounds for warm start LinTS and LinUCB that demonstrate sublinear regret for long-term performance;

  • Experiments on database index selection (using data derived from standard system benchmarks), classification task data and synthetic data demonstrates performance improvement in the short term with performance competitive with baselines (where such baselines are able to be run); and

  • We have expanded experiments to demonstrate the effect of increased pre-training on the performance in both accurate and misspecified settings.

2 Background: contextual bandits and linear Thompson sampling

The stochastic contextual multi-armed bandit (MAB) problem is a game proceeding in rounds \(t \in [T]=\{1, 2, \ldots , T\}\). In round t the MAB learner,

  1. 1.

    observes k possible actions or arms \(i \in [k]\) each with adversarially chosen context vector \(\varvec{x}_t(i) \in {\mathbb {R}}^d\);

  2. 2.

    selects or pulls an arm \(i_t \in [k]\);

  3. 3.

    observes random reward \(R_{i_t}(t)\) for the pulled arm \(i_t\), where each \(R_i(t)\mid \varvec{x}_t(i)\sim P_{i\mid \varvec{x}_t(i)}\) independently over \(i\in [k], t\in [T]\).

The MAB learner’s goal is to maximise its cumulative expected reward—the total expected reward over all rounds—which is equivalent to minimising the cumulative regret up to round T:

$$\begin{aligned} Reg(T) = \sum _{t=1}^T {\mathbb {E}}\left[ R_{i^\star _t}(t)\mid \varvec{x}_t(i^\star _t)\right] - {\mathbb {E}}\left[ R_{i_t}(t)\mid \varvec{x}_t(i_t)\right] , \end{aligned}$$

where \(i^\star _t \in {\arg \max }_{i\in [k]} {\mathbb {E}}\left[ R_i(t)\mid \varvec{x}_t(i)\right] \), that is, an optimal arm to pull at round t. When a MAB algorithm’s cumulative regret Reg(T) is sub-linear in T, the average regret Reg(T)/T goes to zero. Such an algorithm is said to be a “no regret” learner or Hannan consistent.

Thompson sampling (TS), a Bayesian approach within the family of randomised probability matching algorithms, is one of the earliest design patterns for MAB learning [28]. Each modelled arm’s reward likelihood is endowed with a prior. Arms are then pulled based on their posteriors: e.g., parameters for each arm can be drawn from the corresponding posteriors, and then arm selection may proceed (greedily) by maximising reward likelihood.

Linear Thompson sampling (LinTS) [2, 4] is an algorithm with sub-linear cumulative regret, when the context-conditional reward satisfies a linear relationship

$$\begin{aligned} r_t(i_t) = R_{i_t}(t) \mid \varvec{x}_t(i_t) = \varvec{\theta }_\star ^T \varvec{x}_t(i_t) + \epsilon _{t}(i_t)~, \end{aligned}$$

where additive noise \(\epsilon _{t}(i_t)\) is conditionally R-sub-Gaussian and \(\varvec{\theta _\star } \in {\mathbb {R}}^d\) is an unknown vector-valued parameter shared among all of the k arms.

Like most approaches to linear contextual bandit learning, LinTS adopts (online) ridge regression fitting for estimating the unknown parameter. For any regularisation parameter \(\lambda \in {\mathbb {R}}^+\), define the matrix \(\varvec{V}_t\) as

$$\begin{aligned} \varvec{V}_t = \lambda \varvec{I} + \sum _{s=1}^{t-1}\varvec{x}_s(i_s)\varvec{x}_s^T(i_s)~. \end{aligned}$$

Then, Abeille et al. [2] demonstrated that we can estimate the unknown parameter \(\varvec{\theta }_\star \) as

$$\begin{aligned} \hat{\varvec{\theta }}_t = \varvec{V}_t^{-1}\sum _{s=1}^{t-1}\varvec{x}_s(i_s)r_t(i_s). \end{aligned}$$

Earlier versions of LinTS [4] do not include a tunable regularisation parameter.

A result due to Abbasi-Yadkori et al. [1] is used within LinTS. Assuming \(\Vert \varvec{\theta }_\star \Vert \le S\), then with probability at least \(1-\delta \in (0,1)\):

$$\begin{aligned} \Vert \hat{\varvec{\theta }}_t - \varvec{\theta }_\star \Vert _{\varvec{V}_t}&\le \beta _t(\delta )~,\\ \beta _t(\delta )&= R \sqrt{2\log \frac{\det (\varvec{V}_t)^{1/2}\det (\varvec{V}_1)^{-1/2}}{\delta }} + \sqrt{\lambda } S~. \end{aligned}$$

In Thompson sampling, we may introduce a perturbation parameter \(\varvec{\eta }_t\in {\mathbb {R}}^d\), which, after rotation and scaling by the inverse square root of the matrix \(\varvec{V}_t^{-1/2}\), and scaling by oversampling factor \(\beta _t(\delta ')\), promotes exploration around the point estimate \(\hat{\varvec{\theta }}_t\):

$$\begin{aligned} \tilde{\varvec{\theta }}_t = \hat{\varvec{\theta }}_t + \beta _t(\delta ')\varvec{V}_t^{-1/2}\varvec{\eta }_t~. \end{aligned}$$

Moreover, Abeille et al. [2] have shown, that if \(\varvec{\eta }_t\) follows distribution \({\mathcal {D}}^{TS}\) with the following properties:

  1. 1.

    There exists \(p>0\) such that, for all \(\Vert \varvec{u}\Vert =1\) we have \({\mathbb {P}}_{\varvec{\eta }\sim {\mathcal {D}}^{TS}}(\varvec{u}^T\varvec{\eta }\ge 1)\ge p\); and

  2. 2.

    There exist positive constants c and \(c'\) such that, for all \(\delta \in (0,1)\) we have the inequality \({\mathbb {P}}_{\varvec{\eta }\sim {\mathcal {D}}^{TS}}\left( \Vert \varvec{\eta }\Vert \le \sqrt{cd \log \frac{c'd}{\delta }}\right) \ge 1-\delta ~\),

figure e

then LinTS is Hannan consistent. We adopt a standard multivariate Gaussian for \(\varvec{\eta }_t\) which satisfies the above properties [2]. With all of these definitions in mind, the version of LinTS used in this paper can be summarised as shown in Algorithm 1.

3 Warm-starting linear bandits

We now detail our flexible algorithmic framework for warm-starting contextual bandits, beginning with linear Thompson sampling for which we derive a new regret bound.

3.1 Thompson sampling

Given the foundation of Thompson sampling in Bayesian inference, it is natural to look to manipulating the prior as a means to injecting a priori knowledge of the reward structure before the bandit is put into operation. Algorithm 1 implementation of LinTS due to Abeille et al. [2] decomposes the prior and posterior distributions on \(\varvec{\theta }_t\) as a Gaussian centred at the point estimate \(\hat{\varvec{\theta }}_t\) with covariance based on oversampling factor \(\beta _t(\delta ')\) and the matrix \(\varvec{V}_t\) via the random perturbation vector \(\varvec{\eta }_t\). Our approach to warm start is to focus on manipulating the initial point estimate \(\hat{\varvec{\theta }}_1\) and the matrix \(\varvec{V}_1\) to incorporate available prior knowledge into LinTS.

Remark 1

Although Algorithm 1 appears to offer the freedom to select any \(\hat{\varvec{\theta }}_1\), Eqs. (1) and (2) do not present an immediate route to adapting subsequent point estimates \(\hat{\varvec{\theta }}_t\). Generalising Eq. (2) to point estimate \(\hat{\varvec{\theta }}_t = \varvec{V}_t^{-1}(\lambda \hat{\varvec{\theta }}_1 + \sum _{s=1}^{t-1}\varvec{x}_s(i_s)r_t(i_s))\) is unintuitive and does not clearly admit regret analysis.

We adopt an intuitive approach of adapting Algorithm 1 to model the difference between an initial guess derived from some process occurring before bandit learning, and the actual parameter. This pre-deployment process could be batch supervised learning, an earlier bandit deployment on a related decision problem, or simply a prior manually constructed by a domain expert. Our general framework is completely agnostic and generalises earlier approaches to warm-starting bandits such as [32]. Without loss of generality we refer to this earlier process as the first phase and the basis for which initial parameters are designed as the first phase dataset. Let \(\varvec{\theta }_\star = \varvec{\mu }_\star + \bar{\varvec{\delta }}_\star \), where \(\varvec{\mu }_\star \) is the true parameter of the first phase dataset and \(\bar{\varvec{\delta }}_\star \) represents the concept drift between first phase and bandit deployment. With this reparametrisation, our linear model becomes:

$$\begin{aligned} r_t(i_t)&= \varvec{\theta }_\star ^T\varvec{x}_t(i_t) + \epsilon _{t}(i_t) = (\varvec{\mu }_\star + \bar{\varvec{\delta }}_\star )^T\varvec{x}_t(i_t) + \epsilon _{t}(i_t)\\ r_t(i_t) - \varvec{\mu }_\star ^T \varvec{x}_t(i_t)&= \bar{\varvec{\delta }}_\star ^T\varvec{x}_t(i_t) + \epsilon _{t}(i_t)\\ y_t(i_t)&= \bar{\varvec{\delta }}_\star ^T\varvec{x}_t(i_t) + \epsilon _{t}(i_t)\;. \end{aligned}$$

Therefore, our problem has reduced from estimating \(\varvec{\theta }_\star \) to estimating \(\bar{\varvec{\delta }}_\star \).

Consider a Bayesian linear regression model with the unknown true value of first phase dataset \(\varvec{\mu }_\star \) modelled by random variable \(\varvec{\mu }\sim {\mathcal {N}}(\hat{\varvec{\mu }}, \varvec{\Sigma }_\mu )\) with conjugate context-conditional Gaussian likelihood. We then model the difference parameter \(\bar{\varvec{\delta }}_\star \) as \(\bar{\varvec{\delta }} \sim {\mathcal {N}}(\varvec{0}, \alpha ^{-1} \varvec{I})\). If \(\varvec{\theta } = \varvec{\mu } + \bar{\varvec{\delta }}\) is the random variable modelling \(\varvec{\theta }_\star \), then \(\varvec{\theta } \sim {\mathcal {N}}(\hat{\varvec{\mu }}, \varvec{\Sigma }_\mu +\alpha ^{-1}\varvec{I})\) owing to the Gaussian’s stability property. Finally, since \(\hat{\varvec{\mu }}\) is known, we can model \(\varvec{\theta }\) as \(\varvec{\theta } = \hat{\varvec{\mu }} + \varvec{\delta }\), that is, a random variable centred at \(\hat{\varvec{\mu }}\) which is shifted by drift \(\varvec{\delta } \sim {\mathcal {N}}(\varvec{0}, (\varvec{\Sigma }_\mu + \alpha ^{-1} \varvec{I}_d))\).

We next generalise the coupled recurrence Eqs. (1) and (2) for efficient incremental computation of the generalised posterior estimates.

Proposition 1

Consider linear regression likelihood \(y_i = \varvec{\theta }^T\varvec{x}_i + \epsilon _i\), where \(\epsilon _i \sim {\mathcal {N}}(0, R^2)\), and prior \(\varvec{\theta } \sim {\mathcal {N}}(\varvec{0}, \varvec{V}_1^{-1})\). Then the posterior conditioned on data \(\varvec{z}_i=(\varvec{x}_i, y_i)\) for \(i\in [t]\) is given by \({\mathcal {N}}(\hat{\varvec{\theta }}_{t+1}, R^2\varvec{V}_{t+1}^{-1})\) where \(\varvec{\theta }_t\) point estimates are defined by Eq. (2), and we replace Eq. (1) for \(\varvec{V}_t\) with

$$\begin{aligned} \varvec{V}_t = R^2\varvec{V}_1 + \sum _{s=1}^{t-1}\varvec{x}_s(i_s)\varvec{x}_s^T(i_s)~, \end{aligned}$$

where \(R^2\) is the variance of the measurement noise.


The posterior distribution is:

$$\begin{aligned} p(&\varvec{\theta }\mid y_1, \cdots , y_n)\\ {}&\propto \exp \left\{ -\frac{1}{2}\left[ \sum _{i=1}^n \left( \frac{y_i - \varvec{\theta }^T\varvec{x}_i}{R}\right) ^2+\varvec{\theta }^T\varvec{V}_1\varvec{\theta } \right] \right\} \\&\propto \exp \left\{ -\frac{1}{2}\left[ \varvec{\theta }^T\left( \frac{1}{R^2}\sum _{i=1}^n \varvec{x}_i\varvec{x}_i^T\right) \varvec{\theta }-\frac{2}{R^2}\varvec{\theta }^T\sum _{i=1}^n y_i \varvec{x}_i + \varvec{\theta }^T\varvec{V}_1\varvec{\theta } \right] \right\} \\&= \exp \left\{ -\frac{1}{2}\left[ \varvec{\theta }^T\left( \varvec{V}_1 + \frac{1}{R^2}\sum _{i=1}^n \varvec{x}_i\varvec{x}_i^T\right) \varvec{\theta }\right. \right. \\&\qquad \left. \left. - \varvec{\theta }^T \left( \frac{1}{R^2}\sum _{i=1}^n y_i \varvec{x}_i\right) - \left( \frac{1}{R^2}\sum _{i=1}^n y_i \varvec{x}_i\right) ^T \varvec{\theta }\right] \right\} \;. \end{aligned}$$

To avoid clutter, let \(\bar{\varvec{V}}_{n+1}= \varvec{V}_1 + \frac{1}{R^2}\sum _{i=1}^n \varvec{x}_i\varvec{x}_i^T\) and \(\bar{\varvec{b}}_{n+1} = \frac{1}{R^2}\sum _{i=1}^n y_i \varvec{x}_i\). Therefore, our posterior distribution can be rewritten as

$$\begin{aligned}&p(\varvec{\theta }\mid y_1, \cdots , y_n)\\ \propto&\exp \left\{ -\frac{1}{2}\left[ \varvec{\theta }^T\bar{\varvec{V}}_{n+1}\varvec{\theta } - \varvec{\theta }^T \bar{\varvec{b}}_{n+1} - \bar{\varvec{b}}_{n+1}^T\varvec{\theta }\right] \right\} \\ \propto&\exp \left\{ -\frac{1}{2}\left[ \varvec{\theta }^T\bar{\varvec{V}}_{n+1}\varvec{\theta } \right. - \varvec{\theta }^T \bar{\varvec{V}}_{n+1} \bar{\varvec{V}}_{n+1}^{-1} \bar{\varvec{b}}_{n+1}- \bar{\varvec{b}}_{n+1}^T\bar{\varvec{V}}_{n+1}^{-T}\bar{\varvec{V}}_{n+1}\varvec{\theta }\right. \\&\qquad \qquad \qquad +\left. \left. \bar{\varvec{b}}_{n+1}^T\bar{\varvec{V}}_{n+1}^{-T}\bar{\varvec{V}}_{n+1}\bar{\varvec{V}}_{n+1}^{-1}\bar{\varvec{b}}_{n+1}\right] \right\} \\ =&\exp \left\{ -\frac{1}{2}\left( \varvec{\theta } - \bar{\varvec{V}}_{n+1}^{-1}\bar{\varvec{b}}_{n+1}\right) ^T\bar{\varvec{V}}_{n+1}\left( \varvec{\theta } - \bar{\varvec{V}}_{n+1}^{-1}\bar{\varvec{b}}_{n+1}\right) \right\} \,, \end{aligned}$$

which is proportional to \({\mathcal {N}}(\bar{\varvec{V}}_{n+1}^{-1}\bar{\varvec{b}}_{n+1}, \bar{\varvec{V}}_{n+1}^{-1})\). Therefore, our estimator for \(\theta \) would be

$$\begin{aligned} \hat{\varvec{\theta }}_{n+1} = \bar{\varvec{V}}_{n+1}^{-1}\bar{\varvec{b}}_{n+1} = \varvec{V}_{n+1}^{-1}\varvec{b}_{n+1}, \end{aligned}$$

where we have defined

$$\begin{aligned} \varvec{V}_{n+1}= R^2\varvec{V}_1 + \sum _{i=1}^n \varvec{x}_i\varvec{x}_i^T, \quad \varvec{b}_{n+1} = \sum _{i=1}^n y_i \varvec{x}_i. \end{aligned}$$

This completes the proof. \(\square \)

Our approach comes with an appealing interpretation when setting \(\bar{\varvec{\delta }} \sim {\mathcal {N}}(\varvec{0}, \alpha ^{-1} \varvec{I})\): when we are confident that our pre-training guess is very close to the true parameter, we can set drift \(\alpha ^{-1}\) to be very small and close to 0. However, when we are not as confident, \(\alpha ^{-1}\) is naturally set large. Large \(\alpha ^{-1}\) creates more “deviation” or error from our first phase parameter \(\varvec{\mu }_\star \). This suggests a promising new direction which we highlight in future work in Sect. 6.

Our simple reduction of warm-start bandit learning to LinTS admits a regret bound. We follow the pattern of the regret analysis of Abeille et al. [2] with differences detailed next.

Observe first that \(\Vert \hat{\varvec{\theta }}_t - \varvec{\theta }_\star \Vert _{\varvec{V}_t} = \Vert (\hat{\varvec{\theta }}_t - \hat{\varvec{\mu }}) - (\varvec{\theta }_\star -\hat{\varvec{\mu }})\Vert _{\varvec{V}_t} = \Vert \hat{\varvec{\delta }}_t - \varvec{\delta }_\star \Vert _{\varvec{V}_t} \le \beta _t(\delta ')\). Accordingly the argument yielding the confidence ellipsoid \(\beta _t(\delta ')\) stated in [1, Theorem 2] bounding \(\Vert \hat{\varvec{\theta }}_t - \varvec{\theta }_\star \Vert _{\varvec{V}_t}\) applies in our case, whose full proof of its modification can be found in Appendix. However, as our initial matrix \(\varvec{V}_1\) generalises \(\lambda \varvec{I}\), we must alter the penultimate proof step of Abeille et al. [2] as follows:

  • the inequality proposed by Abbasi-Yadkori et al. [1] which is used to define \(\beta _t(\delta )\) in their paper is not valid in our scenario. This is corrected by using the version of \(\beta _t(\delta )\) presented in this paper, removing the assumption that \(\varvec{V}_1 = \frac{\lambda }{R^2} \varvec{I}\) and leave it in terms of \(\varvec{V}_1\):

    $$\begin{aligned} R \sqrt{2\log \frac{\det (\varvec{V}_t)^{1/2}\det (R^2\varvec{V}_1)^{-1/2}}{\delta }} + \sqrt{\lambda _{max}(R^2\varvec{V}_1)} S \end{aligned}$$
  • the inequality of [2, Proposition 2] is no longer valid in our case. However, the last inequality in [20] has modified [2, Proposition 2] into:

    $$\begin{aligned} \sum _{s=1}^t\Vert \varvec{x}_s\Vert ^2_{\varvec{V}_s^{-1}} \le 2\log \left( \frac{\det (\varvec{V}_{t+1})}{\det (R^2\varvec{V}_1)}\right) \end{aligned}$$

    and hence serves our purpose; and

  • in proving [2, Theorem 1] the authors used the fact that \(\varvec{V}_t^{-1} \le \frac{1}{\lambda }\varvec{I}\). This is not the case in our setting, but we can generalise the result with similar reasoning yielding \(\varvec{V}_t^{-1} \le \frac{1}{\lambda _{min}(R^2\varvec{V}_1)}\varvec{I}\), where \(\lambda _{min}(R^2\varvec{V}_1)\) denotes the minimum eigenvalue of the matrix \(R^2\varvec{V}_1\).

We also need to change the definition of S, since our problem has shifted from estimating \(\varvec{\theta }\) to estimating \(\varvec{\delta }\). Therefore, after modifying the framework, the warm-start linear Thompson sampling bandit can be summarised as in Algorithm 2 and admits the following regret bound.

figure f

Theorem 2

(Warm-start LinTS regret bound) Under the assumptions that:

  1. 1.

    \(\Vert \varvec{x}\Vert \le 1\) for all \(x \in {\mathcal {X}}\);

  2. 2.

    \(\Vert \varvec{\delta }\Vert \le S\) for some known \(S\in {\mathbb {R}}^+\); and

  3. 3.

    the conditionally R-sub-Gaussian process \(\{\epsilon _t\}_t\) is a martingale difference sequence given the filtration \({\mathcal {F}}_t^x = ({\mathcal {F}}_1, \sigma (\varvec{x}_1, r_1, \cdots , r_{t-1}, \varvec{x}_t))\) with \({\mathcal {F}}_1\) denoting any information on prior knowledge,

along with the definition of \({\mathcal {D}}^{TS}\) given in Sect. 2, then with probability at least \(1 - \delta \), with \(\delta ' = \delta /(4T)\) and \(\gamma _t = \beta _t(\delta ')\sqrt{cd\log ((c'd)/\delta )} \), the regret of LinTS can be decomposed as

$$\begin{aligned} Reg(T) = R^{TS}(T) + R^{RLS}(T), \end{aligned}$$

with each of the term bounded as

$$\begin{aligned} R^{TS}(T)&\le \frac{4\gamma _T(\delta ')}{p}\left( \sqrt{2T\log \frac{\det (\varvec{V}_{t+1})}{\det (R^2\varvec{V}_1)}}+\sqrt{\frac{8T}{\lambda _{min}(R^2\varvec{V}_1)}\log \frac{4}{\delta }}\right) \\ R^{RLS}(T)&\le \left( \beta _T(\delta ') + \gamma _T(\delta ')\right) \sqrt{2T\log \frac{\det (\varvec{V}_{t+1})}{\det (R^2\varvec{V}_1)}}~. \end{aligned}$$

3.2 Extension to \(\epsilon \)-Greedy and LinUCB learners

The core idea of our warm-starting method as derived for linear Thompson sampling lies in the method of setting up the initial phase of the bandit. The same expression of initial set up can be applied to other contextual bandit algorithms such as \(\epsilon \)-Greedy and LinUCB.

In the \(\epsilon \)-greedy algorithm, we balance exploration and exploitation by means of relatively naïve randomness: in each round we (uniformly) explore with probability \(\epsilon \) and exploit with probability \(1-\epsilon \). Specifically, by incorporating warm start, this means that at each round we choose an arm at random uniformly from the set [k] with probability \(\epsilon \), and choose an arm at random uniformly from the set \(S = \arg \max _{i \in [k]} \hat{\varvec{\theta }}_t^T\varvec{x}_t(i)\) with probability \(1-\epsilon \). We summarise the warm-start \(\epsilon \)-greedy algorithm in Algorithm 3

figure g

We can also extend our warm-starting technique to LinUCB using the fact that \(\varvec{\theta }\sim {\mathcal {N}}(\hat{\varvec{\mu }} + \varvec{V}_{t}^{-1}\varvec{b}_{t},R^2\varvec{V}_{t}^{-1})\), which is a powerful result. It was proposed by Li et al. [15] that one way to interpret their algorithm is to look at the distribution of the expected payoff \(\varvec{\theta }_\star ^T\varvec{x}_t\). With the affine transformation property of multivariate Gaussian distributions, we have that \(\varvec{\theta }^T\varvec{x} \sim {\mathcal {N}}(\hat{\varvec{\theta }}_t^T\varvec{x},R^2\varvec{x}^T\varvec{V}_{t}^{-1}\varvec{x})\). Therefore, the upper bound of such a quantity is:

$$\begin{aligned} \hat{\varvec{\mu }}^T\varvec{x} + (\varvec{V}_{t}^{-1}\varvec{b}_{t})^T\varvec{x}+\rho R\sqrt{\varvec{x}^T\varvec{V}_{t}^{-1}\varvec{x}} \end{aligned}$$

for some value \(\rho \), which is left as a hyperparameter. The summary of our warm-start LinUCB Algorithm can be seen in Algorithm 4.

figure h

Theorem 3

(Warm-start LinUCB regret bound) The regret bound of warm-started LinUCB follows an argument of Lattimore and Szepesvári [14] very closely. The regret, whose complete derivation is provided in Appendix, admits bound

$$\begin{aligned} Reg(T)&\le \left( R\sqrt{2\log \left( \frac{\det (\varvec{V}_T)^{\frac{1}{2}}\det (R^2{\varvec{V}}_1)^{-\frac{1}{2}}}{\delta } \right) } + \sqrt{\lambda _{max}(R^2\varvec{V}_1)} S\right) \varvec{\cdot }\\&\qquad \qquad \qquad \qquad \sqrt{8T\log \left( \frac{\det (\varvec{V}_{T+1})}{\det (R^2\varvec{V}_1)}\right) }~. \end{aligned}$$

3.3 A regret lower bound

We here present a lower bound for the warm-started bandit linear contextual \(\epsilon \)-greedy algorithm. Consider the best-case scenario for \(\epsilon \)-Greedy with constant \(\epsilon \), that is, that we have the true weight as our initial guess i.e., \(\hat{{\varvec{\mu }}}=\varvec{\theta }_\star \). Assume that we use the hyperparameter \(\alpha \rightarrow \infty \), which ensures the weight’s resistance to changes from observations, i.e., \(\hat{\varvec{\theta }}_t = \hat{{\varvec{\mu }}}= \varvec{\theta }_\star \) for all t. With this setting, denoting \(\Delta _{i,t}\ge 0\) as the difference between the expected rewards of the optimal arm and arm i at round t, the regret is \(\frac{\epsilon }{K}\sum _{t=1}^T\sum _{i=1}^K\Delta _{i,t}\). This argument, detailed in Lemma 4, proves a lower bound since it is derived from a best case scenario.

Lemma 4

The regret for warm-started \(\epsilon \)-greedy is at best \(\frac{\epsilon }{K}\sum _{t=1}^T\sum _{i=1}^K\Delta _{i,t}\).


Since \(\hat{\varvec{\theta }}_t = \varvec{\theta }_\star \) for all t, each exploitation round will yield one of the optimal arms with probability 1. Assume that there are K arms in total. Let E denote the event that exploration occurs, and \(A_i\) be the event that arm i is chosen. Then, the expected cumulative regret for the linear contextual \(\epsilon \)-greedy is:

$$\begin{aligned} R(T)&= \sum _{t=1}^T[0P(E^c) + \sum _{i=1}^K \Delta _{i,t}P(E\cap A_i)]\\&= \frac{\epsilon }{K}\sum _{t=1}^T\sum _{i=1}^K\Delta _{i,t}\\&= \frac{\epsilon T}{K}\sum _{i=1}^K\bar{\Delta }_i~, \end{aligned}$$

where \(\bar{\Delta }_i\) is the average of \(\Delta _{i,t}\) over t, i.e., \(\bar{\Delta }_{i}=\frac{1}{T}\sum _{t=1}^T\Delta _{i,t}\). \(\square \)

Note that in this analysis we have used a constant \(\epsilon \) for our \(\epsilon \)-greedy algorithm. In practice, the value of \(\epsilon \) can be scheduled to recede over time. Auer et al. [5] have shown that in the case of non-contextual bandits, this regime enjoys a sub-linear upper regret bound.

Reduction from non-contextual to contextual bandits The above lower bound of the contextual \(\epsilon \)-greedy algorithm leads naturally to a lower bound for non-contextual bandits. The non-contextual bandit is different from its contextual counterpart where it does not provide any context. In each round, the true means of each non-contextual arm remain constant and are independent of each other (i.e., \(\theta _{i,t} = \theta _i\) for all t); thus, the parameters to estimate are \(\theta _i\) for arm \(i \in [K]\). A non-contextual bandit can be formulated as a contextual bandit, as shown in Lemma 5. By performing such a reduction, essentially using a contextual bandit to act in a non-contextual setting, we can relate lower bounds between the settings.

Lemma 5

A non-contextual bandit can be formulated as a contextual bandit. Therefore, any fundamental limitations for non-contextual bandits must also hold for contextual bandits.


Let the non-contextual bandit arm be \(i = 1, \dots , K\) and let the expected reward for arm i be \(\theta _i\). A contextual bandit equivalent can be constructed by setting the context for arm i as \(\varvec{x}(i) = \varvec{e}_i\), which is the standard basis of \({\mathbb {R}}^K\), i.e., the vector whose element is 1 in its ith element and 0 otherwise. Furthermore, assuming that the shared model is used, then the ith element of the true weight \(\varvec{\theta }_\star \) can be taken to be \(\theta _i\). This setting leads us to set the initial weight \(\hat{{\varvec{\mu }}}= \begin{bmatrix} {\hat{\mu }}_1&\cdots&{\hat{\mu }}_K\end{bmatrix}^T\) to provide an initial guess of the true mean of each arm \(\mu _i\) for \(i\in [K]\), with \(\varvec{V}_1 = \text {diag}(\lambda _1, \cdots , \lambda _K)\) reflecting the confidence we have for our initial estimate. A diagonal matrix is particularly chosen for this purpose since the means of each arm are independent of each other. Thus, the (contextual) estimate of \(\varvec{\theta }_\star \) is

$$\begin{aligned} \hat{\varvec{\theta }}_{t+1} = \hat{{\varvec{\mu }}}+ \varvec{V}_{t+1}^{-1}\varvec{b}_{t+1} = \hat{{\varvec{\mu }}}+ \left( \varvec{V}_1 + \sum _{s=1}^t\varvec{x}_s\varvec{x}_s^T\right) ^{-1}\sum _{s=1}^t(r_s - \hat{{\varvec{\mu }}}^T\varvec{x}_s)\varvec{x}_s. \end{aligned}$$

Now since \(\varvec{x}_s = \varvec{e}_{i_s}\), and noticing that \(\varvec{e}_i\varvec{e}_i^T = \text {diag}(\mathbb {1}(i=1),\cdots , \mathbb {1}(i=K))\) for all \(i \in [K]\), i.e., a matrix with all zero entries except at entry (ii) with value 1, we have

$$\begin{aligned}{} & {} \sum _{s=1}^t\varvec{x}_s\varvec{x}_s^T= \text {diag}\left( \sum _{s=1}^t\mathbb {1}(i_s=1),\cdots ,\sum _{s=1}^ t\mathbb {1}(i_s=K)\right) =\text {diag}(T_1,\cdots ,T_K), \\{} & {} \quad r_s - \hat{{\varvec{\mu }}}^T\varvec{x}_s = r_s - {\hat{\mu }}_{i_s} \end{aligned}$$


$$\begin{aligned} \sum _{s=1}^t(r_s - \hat{{\varvec{\mu }}}^T\varvec{x}_s)\varvec{x}_s=\begin{bmatrix}w_1&\dots&w_K\end{bmatrix}^T, \end{aligned}$$

where \(T_i\) is the number of times arm i is pulled and \(w_i = \sum _{s=1}^t (r_s - {\hat{\mu }}_i) \mathbb {1}(i_s = i) = \sum _{s=1}^t r_s\mathbb {1}(i_s = i) - T_{i}{\hat{\mu }}_i\) is the total sum of all the reward differences observed by arm i. Therefore, the estimate of the weight is

$$\begin{aligned} \hat{\varvec{\theta }}_{t+1}&= \hat{{\varvec{\mu }}}+ \varvec{V}_{t+1}^{-1}\varvec{b}_{t+1} \\&= \begin{bmatrix}{\hat{\mu }}_1&\cdots&{\hat{\mu }}_K\end{bmatrix}^T + \\&\qquad \qquad [\text {diag}(\lambda _1,\cdots ,\lambda _K)+\text {diag}(T_1,\cdots ,T_K)]^{-1} \begin{bmatrix}w_1&\cdots&w_K\end{bmatrix}^T \\&= \begin{bmatrix}{\hat{\mu }}_1&\cdots&{\hat{\mu }}_K\end{bmatrix}^T +[\text {diag}(\lambda _1+T_1,\cdots ,\lambda _K+T_K)]^{-1} \begin{bmatrix}w_1&\cdots&w_K\end{bmatrix}^T \\&= \begin{bmatrix}{\hat{\mu }}_1+\frac{w_1}{\lambda _1+T_1}&\dots&{\hat{\mu }}_K+\frac{w_K}{\lambda _K+T_K}\end{bmatrix}^T\\&= \begin{bmatrix}\frac{{\hat{\mu }}_1\lambda _1+\sum _{s=1}^t r_s\mathbb {1}(i_s=1)}{\lambda _1+T_1}&\dots&\frac{{\hat{\mu }}_K\lambda _K+\sum _{s=1}^t r_s\mathbb {1}(i_s=K)}{\lambda _K+T_K}\end{bmatrix}^T\\&= \begin{bmatrix}{\hat{\theta }}_1&\dots&{\hat{\theta }}_K\end{bmatrix}^T. \end{aligned}$$

This result can be interpreted such that for each arm \(i\in [K]\), our estimate of the true mean \(\theta _i\) is its sample mean with a pseudo-observation of mean \({\hat{\mu }}_i\) worth of \(\lambda _i\) observations. Indeed, when we choose \(\lambda _i=0\) for all \(i\in [K]\), we recover each arm’s mean estimate typically calculated by a non-contextual bandit \(\epsilon \)-greedy algorithm. With this, when we exploit, we choose an arm which maximise \(\hat{\varvec{\theta }}^T x(i) = \hat{\varvec{\theta }}^T \varvec{e}_i = {\hat{\theta }}_i\), which is the same as what is performed in the non-contextual case. \(\square \)

Since a non-contextual bandit can be formulated as a contextual bandit, our approach may be applied to warm-start a non-contextual bandit. Its lower bound when the \(\epsilon \)-greedy algorithm is used follows the lower bound of contextual \(\epsilon \)-greedy, with \(\Delta _{i,t} = \bar{\Delta }_i\) for all t since the mean reward (hence the regret each arm) is stationary across t. In other words, Lemma 4 is a fundamental lower bound on our warm-start setting also.

4 Experiments

We now report on a comprehensive suite of experimental evaluations of our warm-start framework against a number of baselines and different datasets. We are interested in the benefit of warm start over cold start—in such cases we focus on short-term performance differences, as this is a practical limitation of bandits in high-stakes applications. We also explore the impact of prior misspecification as a potential risk of incorrect warm start. We summarise our experiments next and then describe them with results in more detail below.

Datasets Experiments in database index selection explore the effect of warm start in selecting a single index per round where queries arrive to the database in batches and rewards correspond to (negative) execution time. We use a commercial database system, and the standard TPC-H benchmark [29]. Results on two OpenML datasets (Letters and Numbers) test bandits on online multi-class classification, as a benchmark previously used to evaluate the ARRoW warm-start technique [32]. These datasets are advantageous to ARRoW in that they supply the (restrictive) kind of prior knowledge needed—supervised pre-training. Experiments on synthetic data provide sufficient control of the environment to explore limitations of our warm-start approach.

Baselines On the database index selection task, we use cold start TS as a natural and fair baseline. On the OpenML datasets we include the ARRoW warm-start framework, which was originally tested in the same way. We also demonstrate the performance of both frameworks on the \(\epsilon \)-greedy and LinUCB learners, as well as LinTS. Where cold start corresponds throughout to having no pre-training dataset (i.e., Algorithm 1), hot start in the synthetic experiment corresponds to having 100% accuracy on the pre-training parameter \(\varvec{\mu }_\star \), and warm start corresponds to having an estimate on the pre-training parameter \(\varvec{\mu }_\star \), namely \(\hat{\varvec{\mu }}\). By its very nature, we can only produce hot start results with the artificial dataset, since 100% accuracy on the pre-training parameter requires an infinite amount of observation in the real-world database index selection problem.

Hardware All experiments are performed on a commodity laptop equipped with Intel Core i7-6600u (2 cores, 2.60 GHz, 2.81 GHz), 16 GB RAM, and 256 GB disk (Sandisk X400 SSD) running Windows 10. In database experiments, we report cold runs only: we clear database buffer caches prior to query execution—the memory setting thus does not impact our findings.

4.1 Database index selection

As the real-world problem of database index selection motivated this work, we begin with a demonstration in this setting. In a database management system, an index is a data structure used to speed up database execution of a set of queries (a.k.a workload). While a huge space of possible indices could be considered, only a few can actually be created due to memory constraints (since each index occupies space in memory). With a tremendous number of indices, it is impractical for humans to decide which indices to create without assistance. A recent effort has been made to automate this task by using bandits [21] to propose an optimal set of indices to boost the workload execution. This recent framework we will be adopted in our work and expanded to support warm start. The aim of this experiment is to demonstrate that the warm-started bandit will yield similar performance as the cold-started bandit in the long run while having better performance in earlier rounds. The consequence of such a demonstration is a system more suitable for deployment.

In particular, our problem setting is as follows. At round \(t=1,2,\ldots ,T\), we observe a workload \(W_t\) with a set of queries, and the system recommends one index \(i_t\) out of the set of all possible indices \({\mathcal {I}}\). After index \(i_t\) is created, we execute the queries in workload \(W_t\). Our chosen aim is to minimise the query execution time, noting we do not take into account the time it takes to create the index \(i_t\). After \(q_t\) is executed, the index \(i_t\) is dropped and the buffer is cleaned.

Fig. 1
figure 1

Cold start vs. warm-start LinTS for database index selection on the TPC-H benchmark

In this paper, the adopted database comes from the TPC-H benchmark [29]. This publicly available industrial benchmark comes with a set of predefined query templates. A query template is a parameterised query whose parameter values (a.k.a conditions) are missing, keeping only the structure of the query and leaving number and string values as variables. We chose five query templates at random and instantiated them with actual parameter values in each round. These queries will be used as the workload in both pre-training and deployment phase.

It should be noted that the value of R and S are unknown in the real-world dataset. In this case, we treat these as hyperparameters which need to be chosen, adding to \(\alpha \).

In running this experiment, we have used the context features as described by Perera et al. [21], with the reward being the performance gain, described as \(t_{no\_index} - t_{i}\), where \(t_{no\_index}\) corresponds to the execution time of the whole workload without any indices and \(t_i\) the execution time of the whole queries in the workload using index i.

Due to the lack of information on the most optimal index, it is impossible to retrieve the regret for each round. Therefore, with this real-world experiment, we present the average execution time (loss) of workload \(W_t\) based on what both algorithms recommend, which can be found in Fig. 1.

Results It can be seen that the warm-started LinTS outperforms the cold-started LinTS, in short-term rounds and cumulatively. This can be explained by the query templates used to pre-train the warm-started bandit resembling the templates used in the testing dataset. This leads the warm-started bandit’s guess of the initial weight \(\varvec{\theta }_1 = \hat{\varvec{\mu }}\) being closer to the actual weight \(\varvec{\theta }_\star \) compared to the initial guess of \(\varvec{\theta }_1 = \varvec{0}\) by the cold-started bandit.

4.2 OpenML classification dataset

We chose two of the datasets used in [32], which correspond to letters and numbers identification, respectively. We split the data such that 10% is used as the supervised learning examples and the other 90% used as the actual bandit rounds. This advantages ARRoW [32] as the only form of permissible prior knowledge. We try all learners presented in this paper for this dataset: \(\epsilon \)-greedy, LinUCB and LinTS. As for the hyperparameters, we used \(\epsilon = 0.0125\) for \(\epsilon \)-greedy, \(\rho R = 0.2\) for LinUCB, \(\beta _t(\delta )=1\) for LinTS in Letter dataset and \(\beta _t(\delta )=0.05\) for LinTS in Numbers dataset with \(R=0.25\). All of these hyperparameters were found iteratively by grid search.

As described in [32], we transform the dataset into a dataset capable of evaluating bandit algorithm by mapping the classes as the arms and the cost of each class as \(c(a) = \mathbb {1}(a\ne y)\) given example (xy). For the classification problem, we also modify our bandit algorithm which usually shares its parameter across the arms. However, since the context of each arm is the same for the classification task, we distinguish the value by making the parameter different, leading to the disjoint bandit with arm i having the weight \(\varvec{\theta _{i,t}}\). As such its reward is modelled by the equation \(r_t(i) = \varvec{\theta }_{i,\star }^T\varvec{x}_t(i) + \epsilon _t(i)\)

We have used the term cost instead of rewards in this dataset, which requires minor modification of the learners: we change the argmax operation into argmin and in the case of LinUCB, the upper confidence bound in Line 5 to lower confidence bound \(\hat{\varvec{\theta }}_{i,t}^T\varvec{x}_t(i) - \rho R\sqrt{\varvec{x}_t^T(i)\varvec{V}_{t}^{-1}\varvec{x}_t(i)}\).

The ARRoW algorithm presented in [32] is also executed partially, with the size of the class \(|\Lambda |\) set to 1. We chose the best performing \(\lambda \) to be compared against our algorithm, for fairness. We note that sensitivity analysis in Figs. 3 and 4 demonstrates that the choices are generally not very important.

We follow a suggestion of the original ARRoW paper to evaluate [32, Algorithm Line 5], evaluating

$$\begin{aligned} \arg \min _{f\in {\mathcal {F}}}&\left\{ (1-\lambda )\sum _{(x,c)\in S}\sum _{a=1}^K(f(x,a)- c(a))^2\right. \\ +&\left. \lambda \sum _{\tau =1}^t\frac{1}{p_{\tau ,a_\tau }}(f(x_\tau ,a_\tau )-c_\tau (a_\tau ))^2\right\} \end{aligned}$$

where f(xa) is a linear function and \({\mathcal {F}}\) is the class of all linear functions, the solution of which can be obtained via the weighted linear regression.

Another algorithm we used for comparison is by Li et al. [16], hereby labelled as WWW’21 for convenience (denoting the publication venue). This algorithm employs virtual plays in every round by sampling the context according to a cdf \(F_X(\varvec{x})\), estimated by its empirical cdf \({\hat{F}}_X(\varvec{x})\), ultimately equivalent to random sampling of the seen contexts with replacement. A feedback is provided by an offline evaluator whenever the online confidence band is wider than the offline counterpart. The virtual plays are continued indefinitely until the offline evaluator does not give a feedback.

We present the results for the OpenML Dataset in Fig. 2, where we have labelled our algorithm diff for the fact that our algorithm models the difference between the true parameter from the guessed weight. It can be seen that our algorithm performs as well as previous algorithms, while still offering the flexibility to choose the initial guess.

Fig. 2
figure 2

Comparisons of both our and ARRoW warm-start frameworks on the (column i) Letters and (ii) Numbers datasets, with learners a \(\epsilon \)-greedy, b LinUCB and c LinTS

Sensitivity analysis for this experiment (with accurate prior) is presented in Figs. 3 and 4. As mentioned, neither ARRoW nor our warm-start approach are very sensitive to their hyperparameters, while the algorithm proposed by Li et al. [16] does not require any hyperparameter tuning. These results also support our choice of \(\alpha = 10^7\) across these experiments.

Fig. 3
figure 3

Sensitivity analysis showing total cumulative cost achieved vs. hyperparameter on the Letters dataset. Column (i) demonstrates ARRoW results with varying \(\lambda \) while column (ii) shows our warm-start approach Diff with varying \(\alpha \). Finally the learners vary over a \(\epsilon \)-greedy, b LinUCB, c LinTS

Fig. 4
figure 4

Sensitivity analysis showing total cumulative cost achieved vs. hyperparameter on the Numbers dataset. Column (i) demonstrates ARRoW results with varying \(\lambda \) while (ii) shows our warm-start approach Diff with varying \(\alpha \). Finally the learners vary over a \(\epsilon \)-greedy, b LinUCB, c LinTS

Effect of warm-start on exploration hyperparameters In this section, we present the final cumulative cost as a means of measuring the performance of warm-started bandit under different exploration hyperparameters. As previously observed from Figs. 3 and 4, the temperature hyperparameter does not appear to have a significant impact on final performance. Thus, for this analysis, we again fixed the value \(\alpha = 10^7\). We reran the experiment for both Letters and Numbers datasets using the \(\epsilon \)-greedy, LinUCB, and LinTS algorithms, varying the value of the exploration hyperparameters \(\epsilon \), \(\rho R\) and \(\beta \), respectively. The results, as shown in Fig. 5, suggest that lower values of the exploration hyperparameters are preferred. This is intuitive since a goal of warm-starting bandits is to reduce the demand on exploration during initial rounds. This effect is very prominent, especially in the \(\epsilon \)-greedy algorithm. This can be explained by the fact that exploration in the \(\epsilon \)-greedy is strictly dictated by the value of \(\epsilon \), while in LinUCB and LinTS the exploration terms are partly influenced by the matrix \(\varvec{V}_t\), which initially depends on the covariance matrix \(\varvec{\Sigma }_\mu \). Therefore, in \(\epsilon \)-greedy, we recommend ‘manually’ reducing the exploration hyperparameter \(\epsilon \), while in LinUCB the exploration is partially automatically reduced thanks to the lower exploration boost when \(\varvec{\Sigma }_\mu \) has smaller eigenvalues.

Fig. 5
figure 5

Effect of warm-starting the bandits showing total cumulative cost achieved vs. exploration hyperparameter. Column (i) is on the Letters dataset while column (ii) is on Numbers. The learners vary over a \(\epsilon \)-greedy, b LinUCB and c LinTS. The performance appears better when the exploration hyperparameter is relatively small

Effect of pre-training data ratio on performance As previously done in Zhang et al. [32], we can explore the fraction of the dataset available for pre-training. In this section, we present how the cumulative cost evolves as the pre-training dataset to total dataset ratio changes. Here the total dataset refers to the union between the pre-training dataset and the bandit deployment dataset. In particular, we investigate the performance for each of the ratios in \(\{0, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01, 0.05, 0.1\}\). For fairness, all experiments from the different ratios in the same dataset share the same deployment data, thus the maximum ratio in the experiment, which is 0.1, is used to determine the deployment dataset. Since there are 20,000 data in Letters Dataset and 2000 data in Numbers dataset, we used the last 18,000 and 1800 data in Letter and Number Dataset, respectively. Figure 6 supports the intuition that higher ratios likely lead to better performance. This effect is particularly apparent during the initial increase, while the gain gradually fades away as the ratio is increased further. This diminishing return can be explained since the biggest improvement in the correctness of \(\theta \) occurs in the beginning of the supervised learning, whereas its accuracy, while increasing, improves more slowly as more data are observed.

Fig. 6
figure 6

Effect of different ratios of pre-training data (fraction of full dataset used in pre-training). Column (i) is on the Letters dataset while column (ii) is on Numbers. The learners vary over a \(\epsilon \)-greedy, b LinUCB and c LinTS. The 2001st to 20,000th data and the 201st to 2000th data are used as the deployment data in Letter and Number Dataset, respectively, regardless of the ratio used

Effect of misspecified pre-training data ratio on performance A series of experiments investigating sensitivity to the warm-start temperature and exploration hyperparameters was carried out. We also investigated the effect of the fraction of dataset used as pre-training in both settings: accurate prior and misspecified prior.

We investigated the effect of a misspecified prior with both datasets. For this, we need to create another dataset in which the true weight \(\theta _\star \) is different from the deployment dataset’s. To do this, we have trained a linear regression for the whole dataset for each arm i, giving us the disjoint parameter \(\varvec{\theta }_1(i)\), which is then transformed by a rotation matrix \(\varvec{R}_\gamma \) to give a new parameter \(\varvec{\theta }_2(i) = \varvec{R}_\gamma \varvec{\theta }_1(i)\). For each datum at round t used for pre-training, we extracted the context \(\varvec{x}_t(i)\) for all arms, then calculate \(d_r(i) = (\varvec{\theta }_2(i)-\varvec{\theta }_1(i))^T\varvec{x}_t(i)\). This acts as the perturbation of the original reward \(r_t(i)\), yielding the inaccurate reward \(r_t'(i) = r_t(i) + d_r(i)\). In our data generation, we have calculated the similarities between the two parameters, yielding the similarities \(\cos (\varvec{\theta }_1, \varvec{\theta }_2) = \frac{\langle \varvec{\theta }_1, \varvec{\theta }_2 \rangle }{\Vert \varvec{\theta }_1\Vert \Vert \varvec{\theta }_2\Vert }= \frac{1}{\sqrt{2}}\) for all arms and both datasets. This consistent rotation attempts to maintain a similar amount of misspecification across datasets, however as we shall see, properties of the data interact with the magnitude of perturbation.

Due to the nature of the semi-synthetic dataset generation process, the reward might no longer be in \(\{0,1\}\) as previously generated from the classification problem. This observation does not effect the validity of the model, or appropriateness of warm-start in this setting thanks to the flexibility of reward structures accommodated.

We present our result in Fig. 7. Differing to the previous experiment, we no longer have the privilege to have a very similar dataset as our pre-training data. It can be seen that for the Letter dataset, some warm-starting provides a modest initial boost to performance, while warm-starting appears to hurt the performance in Numbers dataset.

Fig. 7
figure 7

Effect of different fractions of misspecified pre-training data. Column (i) is on the Letters dataset while column (ii) is on Numbers. The learners vary over a)\(\epsilon \)-greedy, b LinUCB and c LinTS. The 2001st to 20,000th data and the 201st to 2000th data are used as the deployment data in Letter and Number Dataset, respectively, regardless of the ratio used

4.3 Synthetic experiments

In generating the artificial dataset, we started off by choosing a value for \(\varvec{\theta }_\star \). In this case, we chose the value to be \(\varvec{\theta }_\star ^T = \begin{bmatrix} 0.1&0.3&0.5&0.7&0.9 \end{bmatrix}\), with the bandit having 10 arms. After the value of \(\varvec{\theta }_\star \) is chosen, we generate a random vector \(\varvec{x}_t(i) \in {\mathbb {R}}^d,\,d=5\) where each element is drawn from uniform distribution U(0, 1) for each \(i = 1, 2, \cdots , 10\), followed by taking the inner product and adding the Gaussian noise \(\epsilon _i(t) \sim {\mathcal {N}}(0, R^2),\,R=0.25\), independent on the arm i and round number t. The noisy reward \(r_i(t) = \varvec{\theta }_\star ^T \varvec{x}_t(i) + \epsilon _i(t)\) is saved, as well as the regret of pulling arm i, namely \(\varvec{\theta }_\star ^T \varvec{x}_t(i) - \max _{i \in [k]} \varvec{\theta }_\star ^T \varvec{x}_t(i)\). This makes it possible to compare all bandit algorithms equally without needing off-policy evaluation. We repeat this process 100,000 times, which corresponds to 100,000 rounds of the second phase dataset.

Fig. 8
figure 8

Artificial dataset experimental results for a an accurate prior and b a misspecified prior, comparing cold-, warm- and hot-start LinTS

To generate the pre-training dataset, we firstly choose the value of \(\alpha ^{-1}\), before sampling the true parameter deviation \(\varvec{\delta }_\star \sim {\mathcal {N}}(\varvec{0}, \alpha ^{-1}\varvec{I})\). After the value \(\varvec{\delta }_\star \) is sampled, we calculate \(\varvec{\mu }_\star = \varvec{\theta }_\star - \varvec{\delta }_\star \) and conducted the process exactly as we generated the second phase dataset. We generated two types of pre-training dataset: accurate prior, where we chose \(\alpha ^{-1} = 10^{-4}\) and misspecified prior, where we chose \(\alpha ^{-1} = 0.25\). We produced 10,000 rounds worth of pre-training dataset.

We observed that, with the dataset generated both from the accurate and misspecified prior regime, \(\alpha = 10\) seems to be the cut-off point where all algorithms work quite well. Therefore, we plot for all warm-starting methods the cumulative regret for \(\alpha = 10\), as shown in Fig. 8.

Results In the accurate prior regime, it is clear that the hot-started and warm-started bandits outperform the cold-started bandit. This can be explained by the fact that the value of \(\varvec{\theta }_\star \) is closer to \(\hat{\varvec{\mu }}\) or \(\varvec{\mu }_\star \) as opposed to \(\varvec{0}\). However, the opposite problem occurs when the prior is misspecified, as the cold-start bandit slightly outperforms the hot-started bandit and warm-started bandit, due to the fact that \(\varvec{\theta }_\star \) is closer to \(\varvec{0}\) compared to \(\hat{\varvec{\mu }}\) or \(\varvec{\mu }_\star \).

It should be noted as well that we have held the hyperparameter \(\alpha \) the same for all regimes here. When the hyperparameter \(\alpha \) is tuned optimally, the hot-started and cold-started bandits are able to perform even better, as the pre-training dataset is treated as if they are the real dataset.

5 Towards adaptive drift hyperparameter

In this section, we take a closer look at a key hyperparameter of our warm-start algorithms: the drift hyperparameter \(\alpha \) which controls how much exploration follows pre-training. While this has so far been set manually, based on how much the operator believes pre-training to be aligned with deployment time, in practice we believe this parameter may sometimes be difficult to set.

Limitations of the current approach The advantage of our current approach of warm-starting as applied in Algorithms 2, 3 and 4 has been centralised around the selection of the drift hyperparameter. This drift hyperparameter \(\alpha \) has been used as a means for temperature tuning: how much can we trust the initial weight guess? With an accurate prior, a sufficiently large value of \(\alpha \) will give the bandit an early advantage in the deployment phase as unnecessary exploration is eliminated. On the other hand, although the warm-started bandit is somewhat insensitive to \(\alpha \) with an accurate prior, its sensitivity will be largely augmented when the prior is highly misspecified; a large \(\alpha \) value makes the bandit retain its highly misaligned initial guess and resist changes made from observations. Therefore, it is advantageous to choose a value of \(\alpha \) which is not too far off from its optimum. Alternatively, we may attempt to adapt \(\alpha \) based on data, which is the approach adopted in this section.

Empirical Bayes We choose the value of \(\alpha \) using the fact that even though this hyperparameter is completely unknown before the deployment phase starts, a better estimate can be made as we observe more data from the deployment phase. If the data match with how the initial weight is chosen, we may decide to put more trust on \(\hat{{\varvec{\mu }}}\) (large \(\alpha \)). On the other hand, we may decide to doubt our initial weight when the observed data does not support it (small \(\alpha \)). This strategy invites adoption of empirical Bayes, a general method of using observations to estimate or set prior distributions.

Assumptions In an attempt to do this, we make a hierarchical structure assumption such that \(\bar{\varvec{\delta }}\mid \alpha \sim {\mathcal {N}}(\varvec{0}, \alpha ^{-1}\varvec{I}_d)\), where \(\alpha \sim \Gamma (\bar{\alpha }, \bar{\beta })\) for convenience. Furthermore, in order to obtain a well-known distribution, we also assume that \(\varvec{\theta }_{\star } = \hat{{\varvec{\mu }}}+ \bar{\varvec{\delta }}_{\star }\) as represented by the random variable \(\varvec{\theta } = \hat{{\varvec{\mu }}}+ \bar{\varvec{\delta }}\) for deterministic \(\hat{{\varvec{\mu }}}\), where the dissimilarity between \(\hat{{\varvec{\mu }}}\) and \(\theta _\star \) is captured by the random variable \(\alpha \) embdedded in \(\bar{\varvec{\delta }}\). Compared to the initial assumption, \(\alpha \) is now treated as random variable and the variance of the initial guess \(\varvec{\Sigma }_{\mu }\) is now absorbed and partially represented by \(\alpha \).

Lemma 6

With the above assumptions, the marginal \(\bar{\varvec{\delta }}\) follows a multivariate student-t distribution with degrees of freedom \(\nu _t\), location \(\varvec{\mu }_t\) and scale matrix \(\varvec{\Sigma }_t\), denoted \(St(\nu _t, \mu _t, \varvec{\Sigma }_t)\), with \(\nu _t = 2\bar{\alpha }= 2\bar{\beta }\alpha _t, \mu _t = \varvec{0}, \varvec{\Sigma }_t = \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d = \alpha _t^{-1}\varvec{I}_d\).


Firstly, notice that in the case of one-dimensional (scalar) weight, the joint distribution of \((\bar{\delta }, \alpha )\) collapses to normal-gamma distribution. It is a standard result that the marginal distribution of \(\bar{\delta }\) follows non-standardised Student-t distribution with degrees of freedom \(\nu _t = 2\alpha \), location \(\mu _t = \mu \) and scale \(\sigma ^2_t = \frac{\beta }{\alpha }\), so we expect a similar result for multidimensional \(\bar{\varvec{\delta }}\).

To prove the main result, we compute the required marginal density by marginalising \(\alpha \) out of the joint distribution itself found by multiplying the model’s likelihood and prior, noting the integrand of the fifth equation to be the pdf of a gamma distribution with shape \(\bar{\alpha }+\frac{d}{2}\) and rate \(\bar{\beta }+\frac{1}{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }}\), hence integrates to 1:

$$\begin{aligned} p_{\bar{\varvec{\delta }}}(\bar{\varvec{\delta }})&= \int _0^\infty p_{\bar{\varvec{\delta }}\mid \alpha }(\bar{\varvec{\delta }}\mid \alpha )p_{\alpha }(\alpha )\,d\alpha \\&= \int _0^\infty (2\pi )^{-\frac{d}{2}}\det (\alpha ^{-1}\varvec{I}_d)^ {-\frac{1}{2}}\exp \left\{ -\frac{1}{2}\bar{\varvec{\delta }}^T(\alpha ^{-1}\varvec{I}_d)^{-1}\bar{\varvec{\delta }}\right\} \varvec{\cdot }\\&\qquad \qquad \frac{\bar{\beta }^{\bar{\alpha }}}{\Gamma (\bar{\alpha })} \alpha ^{\bar{\alpha }-1}e^{-\bar{\beta }\alpha }\,d\alpha \\&= \int _0^{\infty }(2\pi )^{-\frac{d}{2}}\alpha ^{\frac{d}{2}} \exp \left\{ -\frac{\alpha }{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }}\right\} \frac{\bar{\beta }^{\bar{\alpha }}}{\Gamma (\bar{\alpha })}\alpha ^{\bar{\alpha }-1}\exp \{-\bar{\beta }\alpha \}\,d\alpha \\&= \frac{(2\pi )^{-\frac{d}{2}}\bar{\beta }^{\bar{\alpha }}}{\Gamma (\bar{\alpha })} \int _0^\infty \alpha ^{\bar{\alpha }+\frac{d}{2}-1}\exp \left\{ -\alpha (\bar{\beta }+\frac{1}{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }})\right\} \,d\alpha \\&= \frac{(2\pi )^{-\frac{d}{2}}\bar{\beta }^{\bar{\alpha }}}{\Gamma (\bar{\alpha })} \frac{\Gamma (\bar{\alpha }+\frac{d}{2})}{(\bar{\beta }+\frac{1}{2} \bar{\varvec{\delta }}^T\bar{\varvec{\delta }})^{\bar{\alpha }+\frac{d}{2}}}\varvec{\cdot }\\&\qquad \qquad \int _0^\infty \frac{(\bar{\beta }+\frac{1}{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }})^ {\bar{\alpha }+\frac{d}{2}}}{\Gamma (\bar{\alpha }+\frac{d}{2})}\alpha ^{(\bar{\alpha }+\frac{d}{2})-1}\exp ^{-(\bar{\beta }+\frac{1}{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }})\alpha }\,d\alpha \\&= \frac{(2\pi )^{-\frac{d}{2}}\bar{\beta }^{\bar{\alpha }}\Gamma (\bar{\alpha }+\frac{d}{2})}{(\bar{\beta }+\frac{1}{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }})^{\bar{\alpha }+\frac{d}{2}}\Gamma (\bar{\alpha })}\\&= \frac{1}{2^{\frac{d}{2}}\pi ^{\frac{d}{2}}}\bar{\beta }^{\bar{\alpha }} \frac{\Gamma (\frac{2\bar{\alpha }+d}{2})}{\Gamma (\frac{2\bar{\alpha }}{2})}(\bar{\beta }+\frac{1}{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }})^{-\frac{2\bar{\alpha }+d}{2}}\\&= \frac{1}{2^{\frac{d}{2}}\pi ^{\frac{d}{2}}}\bar{\beta }^{\bar{\alpha }} \frac{\Gamma (\frac{2\bar{\alpha }+d}{2})}{\Gamma (\frac{2\bar{\alpha }}{2})}\bar{\beta }^ {-\frac{2\bar{\alpha }+d}{2}}\left( 1+\frac{1}{2}\bar{\varvec{\delta }}^T\frac{1}{\bar{\beta }} \varvec{I}_d\bar{\varvec{\delta }}\right) ^{-\frac{2\bar{\alpha }+d}{2}}\\&= \frac{1}{2^{\frac{d}{2}}\pi ^{\frac{d}{2}}}\frac{\Gamma (\frac{2\bar{\alpha }+d}{2})}{\Gamma (\frac{2\bar{\alpha }}{2})}\bar{\beta }^{-\frac{d}{2}}\left( 1+\frac{1}{2\bar{\alpha }} \bar{\varvec{\delta }}^T\frac{\bar{\alpha }}{\bar{\beta }}\varvec{I}_d\bar{\varvec{\delta }}\right) ^{-\frac{2\bar{\alpha }+d}{2}}\\&= \frac{1}{2^{\frac{d}{2}}\pi ^{\frac{d}{2}}} \frac{\bar{\alpha }^\frac{d}{2}}{\bar{\alpha }^\frac{d}{2}}\frac{\Gamma (\frac{2\bar{\alpha }+d}{2})}{\Gamma (\frac{2\bar{\alpha }}{2})}\bar{\beta }^{-\frac{d}{2}}\left( 1+\frac{1}{2\bar{\alpha }}\bar{\varvec{\delta }}^T\left( \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d\right) ^{-1}\bar{\varvec{\delta }}\right) ^{-\frac{2\bar{\alpha }+d}{2}}\\&= \frac{1}{(2\bar{\alpha })^{\frac{d}{2}}\pi ^{\frac{d}{2}}}\left( \frac{\bar{\alpha }}{\bar{\beta }}\right) ^{\frac{d}{2}}\frac{\Gamma (\frac{2\bar{\alpha }+d}{2})}{\Gamma (\frac{2\bar{\alpha }}{2})}\left[ 1+\frac{1}{2\bar{\alpha }}\bar{\varvec{\delta }}^T\left( \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d\right) ^{-1}\bar{\varvec{\delta }}\right] ^{-\frac{2\bar{\alpha }+d}{2}}\\&= \frac{\Gamma (\frac{2\bar{\alpha }+d}{2})}{\Gamma (\frac{2\bar{\alpha }}{2})(2\bar{\alpha })^ {\frac{d}{2}}\pi ^{\frac{d}{2}}\left[ \left( \frac{\bar{\beta }}{\bar{\alpha }}\right) ^d\right] ^ {\frac{1}{2}}}\left[ 1+\frac{1}{2\bar{\alpha }}\bar{\varvec{\delta }}^T\left( \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d\right) ^ {-1}\bar{\varvec{\delta }}\right] ^{-\frac{2\bar{\alpha }+d}{2}}\\&= \frac{\Gamma (\frac{2\bar{\alpha }+d}{2})}{\Gamma (\frac{2\bar{\alpha }}{2})(2\bar{\alpha })^ {\frac{d}{2}}\pi ^{\frac{d}{2}}\left[ \det \left( \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d\right) \right] ^ {\frac{1}{2}}}\varvec{\cdot }\\&\qquad \qquad \left[ 1+\frac{1}{2\bar{\alpha }}(\bar{\varvec{\delta }}-\varvec{0})^T\left( \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d\right) ^{-1}(\bar{\varvec{\delta }}-\varvec{0})\right] ^{-\frac{2\bar{\alpha }+d}{2}}\\&= \frac{\Gamma (\frac{\nu _t+d}{2})}{\Gamma (\frac{\nu _t}{2})\nu _t^{\frac{d}{2}} \pi ^{\frac{d}{2}}\left( \det \varvec{\Sigma _t}\right) ^{\frac{1}{2}}}\left[ 1+\frac{1}{\nu _t}(\bar{\varvec{\delta }}-\varvec{\mu _t})^T\varvec{\Sigma _t}^{-1}(\bar{\varvec{\delta }}-\varvec{\mu _t})\right] ^{-\frac{\nu _t+d}{2}}, \end{aligned}$$

which is multivariate t-distribution with \(\nu _t=2\bar{\alpha }, \mu _t=\varvec{0}\) and \(\varvec{\Sigma }_t=\alpha _t^{-1}\varvec{I}_d=\frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d\). Therefore, we conclude that \(\bar{\varvec{\delta }}\sim St(2\bar{\alpha }, \varvec{0}, \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d)\), i.e., a student-t distribution with zero mean and spherical covariance. Notice that we can express \(\nu _t\) in terms of \(\alpha _t\) and \(\bar{\beta }\) as \(\nu _t = 2\bar{\beta }\alpha _t\) since \(\alpha _t=\frac{\bar{\alpha }}{\bar{\beta }}\). By setting the hyperparameters in terms of \((\alpha _t, \bar{\beta })\), we control the prior of \(\alpha \) by its mean \(\alpha _t\) and variance \(\frac{\alpha _t}{\bar{\beta }}\), which is more intuitive instead of its shape and rate \((\bar{\alpha },\bar{\beta })\). \(\square \)

Following Song and Xia [27], we adopt noise such that

$$\begin{aligned} \varvec{\epsilon } \sim St\left( 2\bar{\alpha }+ d, \varvec{0}, \frac{2\bar{\alpha }}{2\bar{\alpha }+d}\left( 1+\frac{1}{2\bar{\beta }}\Vert \bar{\varvec{\delta }}\Vert _2^2\right) \beta _t^{-1}\varvec{I}_n\right) ~. \end{aligned}$$

Adaptive hyperparameter algorithm Since \(\bar{\varvec{\delta }}\) follows a student-t distribution, our assumptions follow the premise laid out by Song and Xia [27]. By rewriting \(\varvec{X} = \begin{bmatrix}\varvec{x}_1&\cdots&\varvec{x}_n\end{bmatrix}^T\) and \(\varvec{y} = \begin{bmatrix}y_1&\cdots&y_n\end{bmatrix}^T\), the value of \(\alpha _t\) and \(\beta _t\) can then be optimised by the q-EM algorithm following Song and Xia [27], summarised in Algorithm 5. This algorithm takes \(\bar{\beta }\) as its hyperparameter, which controls the degrees of freedom in the underlying distribution of \(\bar{\varvec{\delta }}\): when a Gaussian distribution of \(\bar{\varvec{\delta }}\) is preferred, we let \(\nu _t \rightarrow \infty \) by letting \(\bar{\beta }\rightarrow \infty \), recovering the Gaussian distribution from the t-distribution.

figure i

Some steps in Algorithm 5 require expensive computations. To mitigate such costs, Song and Xia [27] suggest to diagonalise the Gram matrix \(\varvec{X}^T\varvec{X} = \varvec{P}\varvec{D}\varvec{P}^T\) and compute the following quantities beforehand:

$$\begin{aligned} \varvec{y}_p = \varvec{X}^T\varvec{y},\quad \varvec{y}_{pV} = \varvec{P}^T\varvec{y}_p,\quad \Vert \varvec{y}\Vert _2^2~. \end{aligned}$$

The required quantities in each iteration can then be calculated as:

$$\begin{aligned} \varvec{\mu }_{opt}&= \varvec{P}\left( \varvec{D}+\frac{\alpha _t}{\beta _t}\varvec{I}_d\right) ^{-1}\varvec{y}_{pV}\\ \varvec{y}^T\varvec{B}_{opt}^{-1}\varvec{y}&= \beta _t\left( \Vert \varvec{y}\Vert _2^2-\varvec{y}_p^T\varvec{\mu _{opt}}\right) \\ tr(\varvec{C}_{opt})&= \frac{\nu +\varvec{y}^T\varvec{B}_{opt}^{-1}\varvec{y}}{\nu +n}tr((\alpha _t\varvec{I}_d+\beta _t\varvec{D})^{-1})\\ tr(\varvec{X}^T\varvec{X}\varvec{C}_{opt})&= \frac{\nu +\varvec{y}^T\varvec{B}_{opt}^{-1}\varvec{y}}{\nu +n}tr(\varvec{D}(\alpha _t\varvec{I}_d+\beta _t\varvec{D})^{-1})\\ \Vert \varvec{y}-\varvec{X}\varvec{\mu }_{opt}\Vert _2^2&= \Vert \varvec{y}\Vert _2^2-2\varvec{y}_p^T\varvec{\mu }_{opt} + \Vert \varvec{X}\varvec{\mu }_{opt}\Vert _2^2~, \end{aligned}$$

where the Woodbury matrix identity is used in the second equation and the cyclic property of the trace operation is used in the fourth equation. We argue that when one wishes to store \(\varvec{X}^T\varvec{X}\) and not \(\varvec{X}\), the second term of the last equality can be calculated as

$$\begin{aligned} \Vert \varvec{X}\varvec{\mu }_{opt}\Vert _2^2 = \varvec{\mu }_{opt}^T(\varvec{X}^T\varvec{X})\varvec{\mu }_{opt}=\Vert \varvec{\mu }_{opt}\Vert _{\varvec{X}^T\varvec{X}}^2~. \end{aligned}$$

These quantities can then be used to calculate \(b_{opt}\) and \(c_{opt}\) which yield new \(\alpha _t\) and \(\beta _t\) until convergence.

Regret bound Algorithm 5 may be invoked at the start of each round to give updated values of \(\alpha _t\) and \(\beta _t\). However, under this adaptive hyperparameter, \(\alpha ^{-1}\) is no longer independent of the other variables. This violates one of the assumptions made in [1], as the choice of \(\lambda \) in their scenario is independent of other variables. Therefore, the validity of the oversampling factor becomes questionable. As the regret analysis for LinTS depends on the validity of the upper bound provided by [1], this in turns becomes invalid as well. As such, regret analysis for the adaptive case would become another open problem. A possible remedy for this problem may be to halt the hyperparameter optimisation update after a certain number of rounds, in which case \(\alpha ^{-1}\) might be viewed as constant in the long run as a direct consequence of Theorem 2 and 3.

Corollary 7

Consider a multi-armed bandit agent with hyperparameters updated as per Algorithm 5 every round up to round \(n_s\) when no further update is invoked. Then, round \(n_s+1\) can be treated as the first bandit round with constant hyperparameter.

For LinTS, this is equivalent to

$$\begin{aligned} Reg(T+n_s) = R^{TS}(T+n_s) + R^{RLS}(T+n_s), \end{aligned}$$

with each of the term bounded as

$$\begin{aligned} R^{TS}(T+n_s)&\le R^{TS}(n_s)+\frac{4\bar{\gamma }_T(\delta ')}{p}\left( \sqrt{2T\log \frac{\det (\varvec{V}_{n_s+T+1})}{\det (R^2\varvec{V}_{n_s+1})}}+\right. \\&\qquad \qquad \qquad \qquad \qquad \qquad \left. \sqrt{\frac{8T}{\lambda _{min}(R^2\varvec{V}_{n_s+1})}\log \frac{4}{\delta }}\ \right) \\ R^{RLS}(T+n_s)&\le R^{RLS}(n_s)+\left( \bar{\beta }_T(\delta ') + \bar{\gamma }_T(\delta ')\right) \sqrt{2T\log \frac{\det (\varvec{V}_{n_s+T+1})}{\det (R^2\varvec{V}_{n_s+1})}}~, \end{aligned}$$

where \(R^{RS}(n_s)\) and \(R^{RLS}(n_s)\) are constant, \(\bar{\gamma }_T(\delta ) = \bar{\beta }_T(\delta ')\sqrt{cd\log ((c'd)/\delta )} \) and \(\bar{\beta }_T(\delta )\) is the upper bound of the ellipsoid whose rounds start at \(n_s\), defined as:

$$\begin{aligned} \bar{\beta }_T(\delta ) = R\sqrt{2\log \left( \frac{\det (\varvec{V}_{n_s+T})^{\frac{1}{2}}\det (R^2{\varvec{V}}_{n_s+1})^{-\frac{1}{2}}}{\delta } \right) } + \sqrt{\lambda _{max}(R^2{\varvec{V}}_{n_s+1})} {\bar{S}}~, \end{aligned}$$

where \({\bar{S}}\) is defined such that \(\Vert \hat{\varvec{\theta }}_{n_s+1}-\varvec{\theta }_{\star }\Vert \le {\bar{S}}\).

For LinUCB, this is equivalent to

$$\begin{aligned} Reg(T+n_s) \le Reg(n_s) + \bar{\beta }_T(\delta )\sqrt{8T\log \left( \frac{\det (\varvec{V}_{n_s+T+1})}{\det (R^2\varvec{V}_{n_s+1})}\right) }~, \end{aligned}$$

where \(Reg(n_s)\) is constant and \(\bar{\beta }_T(\delta )\) is defined as above.

Experimental results To demonstrate the advantage of the adaptive hyperparameter tuning, we repeated the experiment for the artificial dataset. We generated two types of pre-training data: accurate and misspecified. For the generation of accurate dataset, we chose true \(\alpha ^{-1} = 10^{-4}\) and for the misspecified dataset, we chose true \(\alpha ^{-1} = 100\). Notice that such a high value of \(\alpha \) in the misspecified dataset is intentionally chosen to be extreme to demonstrate the capability of the adaptive hyperparameter algorithm, and hence does not reflect a real world setting. For the bandit, we have used LinUCB with \(\rho =0.2\), bandit hyperparameters initial \(\alpha _t = 1\), initial \(\beta _t=1/R^2=16\) (both unchanged over time for bandits with manually chosen hyperparameters), and \(\bar{\beta }=1\) with \(tol= 0.1\) for hyperparameter tuning convergence requirements of both \(\alpha _t\) and \(\beta _t\). As shown in Fig. 9a, the adaptive hyperparameter algorithm is capable of exploiting the accurate prior, even outperforming its non-adaptive counterpart. On the other hand, when the prior is highly misspecified in Fig. 9b, a disastrous result occurs for warm-started bandit without automatic hyperparameter, while our adaptive hyperparameter algorithm is able to detect the mismatch and ignore the initial guess, attempting to restore its performance should cold-start regime had been deployed.

Fig. 9
figure 9

Experimental results for a an accurate prior and b a misspecified prior, comparing cold-start (cold), warm-start with non-adaptive hyperparameters (warm manual) and warm-start with adaptive hyperparameters (warm auto) using LinUCB

6 Conclusions and future work

In this paper, we have developed a flexible framework for warm starting linear contextual bandits that inherits the flexibility of Bayesian inference in incorporating prior knowledge. Our approach generalises the linear Thompson sampler of Abeille et al. [2], by permitting arbitrary Gaussian priors for potentially improving short-term performance, while maintaining the regret bound that guarantees the long-term performance of Hannan consistency. While little attention has been paid to the warm-start problem since the direction was suggested by Li et al. [15], the few existing works on warm start are far less flexible in catering to potential sources of prior knowledge, and in how uncertainty is quantified. We motivate the opportunity for warm start in the database systems domain where bandit-based index selection could be pre-trained prior to deployment by users, and we demonstrate the practical potential for warm start on a standard database benchmark. We have also contributed an approach to adapting the key hyperparameters responsible to the control of the exploration temperature based on misspecification of pre-training.

Being relatively unexplored, we believe that warm-start bandits offer a number of intriguing future directions for research, well suited to the Thompson sampling framework on which our approach was developed.

Adaptive oversampling factor In this paper, it is assumed that the \(\ell _2\)-norm of the parameter is bounded by S. However, this may not be known with confidence in some applications. In such cases the algorithms are still valid, but the bounds may not be. However, as more data are observed, we gain information (accuracy) about \(\varvec{\delta }_\star \): the variance of random variable \(\varvec{\delta }\) drops. Therefore, one may wish to bound \(\Vert \varvec{\delta }\Vert \) with some level of probability. It is interesting to note that how large the value of S is closely related on the drift hyperparameter—potentially both quantities could be optimised using one algorithm jointly.

Reward unit mismatch When the pre-training data are provided, there is a potential difference between the units of the pre-training and deployed datasets. An interesting problem arises by noticing that the performance of the contextual bandit algorithm is not measured by how close the predicted reward is to the actual reward, but rather the rank of the arm values. As such it is the direction of the initial guess of \(\varvec{\theta }\) that is important, not its norm. A simple solution could be learning a constant scaling the size of the pre-training reward to the deployed rewards. Ideally this scalar would be incorporated into the Warm Start LinTS, provided performance is not sacrificed.