Abstract
Multiarmed bandits achieve excellent longterm performance in practice and sublinear cumulative regret in theory. However, a realworld limitation of bandit learning is poor performance in early rounds due to the need for exploration—a phenomenon known as the coldstart problem. While this limitation may be necessary in the general classical stochastic setting, in practice where “pretraining” data or knowledge is available, it is natural to attempt to “warmstart” bandit learners. This paper provides a theoretical treatment of warmstart contextual bandit learning, adopting Linear Thompson Sampling as a principled framework for flexibly transferring domain knowledge as might be captured by bandit learning in a prior related task, a supervised pretrained Bayesian posterior, or domain expert knowledge. Under standard conditions, we prove a general regret bound. We then apply our warmstart algorithmic technique to other common bandit learners—the \(\epsilon \)greedy and upperconfidence bound contextual learners. An upper regret bound is then provided for LinUCB. Our suite of warmstart learners are evaluated in experiments with both artificial and realworld datasets, including a motivating task of tuning a commercial database. A comprehensive range of experimental results are presented, highlighting the effect of different hyperparameters and quantities of pretraining data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Multiarmed bandits have undergone a renaissance in machine learning research [14, 26] with a range of deep theoretical results discovered, while applications to realworld sequential decision making under uncertainty abound, ranging from news [15] and movie recommendation [22], to crowd sourcing [30] and selfdriving databases [19, 21]. The relative simplicity of the stochastic bandit setting, as compared to more general partiallyobservable Markov decision processes (POMDPs), typically admits regret analysis where bandit learners enjoy bounded cumulative regret—the gap between a learner’s cumulative reward to time T and the cumulative reward possible with a fixed but optimalwithhindsight policy. While many bandit learners are celebrated for attaining sublinear regret or average regret converging to zero, such longterm performance goals say little about the shortterm performance of today’s popular bandit algorithms.
Indeed, the bandit setting is well known to be the simplest Markov decision process setting to require balancing of exploration—attempting infrequent actions in case of higherthanexpected rewards—with exploitation—greedy selection of actions that so far appear fruitful. Even in the stochastic setting, where rewards are drawn from stationary (context conditional) distributions, the underlying distributions are unknown and considered adversarially chosen. In other words, there is no free lunch (in the worst case) without significant exploration in early rounds.
The relatively poor early round performance of bandit learners is known as the cold start problem and can be costly in highstakes domains. Li et al. [15] suggested that bandit learners be warm started or pretrained somehow prior to such deployment, in the context of online media recommendation and advertising where poor performance leads to user dissatisfaction and financial loss. However, little systematic research has explored the cold start problem. Intuitively, warm start is related to transfer learning [9] and domain adaptation [10] while while Shivaswamy and Joachims [25] proposed warmstarting methods for noncontextual bandits and Zhang et al. [32] modify any bandit policy to make use of pretraining from (batch) supervised learning via manipulation of the policy’s importance sampling and weighting, which determines the relative importance of one data \((\varvec{x},y)\) over the other data—ultimately resulting in a weighted linear regression. Another work by Li et al. [16] employs virtual plays before committing to an action in every round, which implicitly assumes that the existing logged data are perfectly aligned with the unknown bandit data. A similar assumption is made implicitly by Bouneffouf et al. [6], who combine prior historical observations and clustering information. Other works have proposed approaches to the itemuser coldstart problem, such as that proposed by Wang et al. [31], who passively assign a user to each item on top of the usual bandit which selects an item for a user. The warmstart problem is also related to the conservative bandit problem, where the usual bandit setting applies under the existence of a baseline policy and a performance constraint [13]. This paper advocates for Thompson sampling (TS) [28] as a natural framework for warm start bandits. Although the prior used in Thompson sampling can be misspecified, as discussed by Liu and Li [17], our extension to the LinTS contextual bandit not only affords more flexible forms of warm start, but quantifies prior uncertainty, and admits regret analysis. Furthermore, this idea can be extended into other bandit algorithms, such as \(\epsilon \)greedy and LinUCB.
Flexibility in warm start is paramount, as not all settings requiring warm start will necessarily admit prior supervised learning as assumed previously [32]. Indeed, bandits are typically motivated when there is an absence of direct supervision, and only indirect rewards are available. Our framework offers unprecedented flexibility. We advocate that prior knowledge could come from: bandit learning on a previous, related task; domain expert knowledge or knowledge extracted from a rulebased, nonadaptive baseline system; or indeed prior supervised learning.
We introduce a new motivation for warmstart bandits from the database systems domain. Database indices, a data structure used by database management systems to execute queries more rapidly, may be formed on any combination of table columns. Unfortunately the best choice of index depends on unknown query workloads and potentially unstable system performance. Offline solutions to index selection have been the foundations of the automated tools provided by database vendors [3, 11, 33]. Recognising that database administrators cannot practically foresee future database loads, online solutions, where the choice of the representative workload and the costbenefit analysis of materialising a configuration are automated, have been proposed [7, 8, 12, 18, 23, 24]. Unfortunately, most such approaches lack any form of performance guarantee. Recent work has demonstrated compelling potential for linear bandits for index selection [21] complete with regret bound guarantees, however the cold start problem is likely to limit deployment as vendors and users alike may be concerned about outofbox performance. We demonstrate that a warmstart bandit can deliver strong shortterm improvement for database index selection without costing longterm results.
In summary, this paper makes the following contributions:

We propose a framework for warmstarting contextual bandits based on LinTS and extend our technique to \(\epsilon \)greedy and LinUCB;

Unlike past efforts to warmstart bandit learners, which strictly apply to supervised learning only, our warm start linear bandit seen in Algorithms 2, 3 and 4 can incorporate prior knowledge from any form of prior learning, such as supervised learning [32], prior bandit learning, or manual construction of a prior by a domain expert. Notably, our warm start approach incorporates uncertainty quantification;

We introduce a method to automatically tune the hyperparameters used in Algorithms 2, 3 and 4;

We present regret bounds for warm start LinTS and LinUCB that demonstrate sublinear regret for longterm performance;

Experiments on database index selection (using data derived from standard system benchmarks), classification task data and synthetic data demonstrates performance improvement in the short term with performance competitive with baselines (where such baselines are able to be run); and

We have expanded experiments to demonstrate the effect of increased pretraining on the performance in both accurate and misspecified settings.
2 Background: contextual bandits and linear Thompson sampling
The stochastic contextual multiarmed bandit (MAB) problem is a game proceeding in rounds \(t \in [T]=\{1, 2, \ldots , T\}\). In round t the MAB learner,

1.
observes k possible actions or arms \(i \in [k]\) each with adversarially chosen context vector \(\varvec{x}_t(i) \in {\mathbb {R}}^d\);

2.
selects or pulls an arm \(i_t \in [k]\);

3.
observes random reward \(R_{i_t}(t)\) for the pulled arm \(i_t\), where each \(R_i(t)\mid \varvec{x}_t(i)\sim P_{i\mid \varvec{x}_t(i)}\) independently over \(i\in [k], t\in [T]\).
The MAB learner’s goal is to maximise its cumulative expected reward—the total expected reward over all rounds—which is equivalent to minimising the cumulative regret up to round T:
where \(i^\star _t \in {\arg \max }_{i\in [k]} {\mathbb {E}}\left[ R_i(t)\mid \varvec{x}_t(i)\right] \), that is, an optimal arm to pull at round t. When a MAB algorithm’s cumulative regret Reg(T) is sublinear in T, the average regret Reg(T)/T goes to zero. Such an algorithm is said to be a “no regret” learner or Hannan consistent.
Thompson sampling (TS), a Bayesian approach within the family of randomised probability matching algorithms, is one of the earliest design patterns for MAB learning [28]. Each modelled arm’s reward likelihood is endowed with a prior. Arms are then pulled based on their posteriors: e.g., parameters for each arm can be drawn from the corresponding posteriors, and then arm selection may proceed (greedily) by maximising reward likelihood.
Linear Thompson sampling (LinTS) [2, 4] is an algorithm with sublinear cumulative regret, when the contextconditional reward satisfies a linear relationship
where additive noise \(\epsilon _{t}(i_t)\) is conditionally RsubGaussian and \(\varvec{\theta _\star } \in {\mathbb {R}}^d\) is an unknown vectorvalued parameter shared among all of the k arms.
Like most approaches to linear contextual bandit learning, LinTS adopts (online) ridge regression fitting for estimating the unknown parameter. For any regularisation parameter \(\lambda \in {\mathbb {R}}^+\), define the matrix \(\varvec{V}_t\) as
Then, Abeille et al. [2] demonstrated that we can estimate the unknown parameter \(\varvec{\theta }_\star \) as
Earlier versions of LinTS [4] do not include a tunable regularisation parameter.
A result due to AbbasiYadkori et al. [1] is used within LinTS. Assuming \(\Vert \varvec{\theta }_\star \Vert \le S\), then with probability at least \(1\delta \in (0,1)\):
In Thompson sampling, we may introduce a perturbation parameter \(\varvec{\eta }_t\in {\mathbb {R}}^d\), which, after rotation and scaling by the inverse square root of the matrix \(\varvec{V}_t^{1/2}\), and scaling by oversampling factor \(\beta _t(\delta ')\), promotes exploration around the point estimate \(\hat{\varvec{\theta }}_t\):
Moreover, Abeille et al. [2] have shown, that if \(\varvec{\eta }_t\) follows distribution \({\mathcal {D}}^{TS}\) with the following properties:

1.
There exists \(p>0\) such that, for all \(\Vert \varvec{u}\Vert =1\) we have \({\mathbb {P}}_{\varvec{\eta }\sim {\mathcal {D}}^{TS}}(\varvec{u}^T\varvec{\eta }\ge 1)\ge p\); and

2.
There exist positive constants c and \(c'\) such that, for all \(\delta \in (0,1)\) we have the inequality \({\mathbb {P}}_{\varvec{\eta }\sim {\mathcal {D}}^{TS}}\left( \Vert \varvec{\eta }\Vert \le \sqrt{cd \log \frac{c'd}{\delta }}\right) \ge 1\delta ~\),
then LinTS is Hannan consistent. We adopt a standard multivariate Gaussian for \(\varvec{\eta }_t\) which satisfies the above properties [2]. With all of these definitions in mind, the version of LinTS used in this paper can be summarised as shown in Algorithm 1.
3 Warmstarting linear bandits
We now detail our flexible algorithmic framework for warmstarting contextual bandits, beginning with linear Thompson sampling for which we derive a new regret bound.
3.1 Thompson sampling
Given the foundation of Thompson sampling in Bayesian inference, it is natural to look to manipulating the prior as a means to injecting a priori knowledge of the reward structure before the bandit is put into operation. Algorithm 1 implementation of LinTS due to Abeille et al. [2] decomposes the prior and posterior distributions on \(\varvec{\theta }_t\) as a Gaussian centred at the point estimate \(\hat{\varvec{\theta }}_t\) with covariance based on oversampling factor \(\beta _t(\delta ')\) and the matrix \(\varvec{V}_t\) via the random perturbation vector \(\varvec{\eta }_t\). Our approach to warm start is to focus on manipulating the initial point estimate \(\hat{\varvec{\theta }}_1\) and the matrix \(\varvec{V}_1\) to incorporate available prior knowledge into LinTS.
Remark 1
Although Algorithm 1 appears to offer the freedom to select any \(\hat{\varvec{\theta }}_1\), Eqs. (1) and (2) do not present an immediate route to adapting subsequent point estimates \(\hat{\varvec{\theta }}_t\). Generalising Eq. (2) to point estimate \(\hat{\varvec{\theta }}_t = \varvec{V}_t^{1}(\lambda \hat{\varvec{\theta }}_1 + \sum _{s=1}^{t1}\varvec{x}_s(i_s)r_t(i_s))\) is unintuitive and does not clearly admit regret analysis.
We adopt an intuitive approach of adapting Algorithm 1 to model the difference between an initial guess derived from some process occurring before bandit learning, and the actual parameter. This predeployment process could be batch supervised learning, an earlier bandit deployment on a related decision problem, or simply a prior manually constructed by a domain expert. Our general framework is completely agnostic and generalises earlier approaches to warmstarting bandits such as [32]. Without loss of generality we refer to this earlier process as the first phase and the basis for which initial parameters are designed as the first phase dataset. Let \(\varvec{\theta }_\star = \varvec{\mu }_\star + \bar{\varvec{\delta }}_\star \), where \(\varvec{\mu }_\star \) is the true parameter of the first phase dataset and \(\bar{\varvec{\delta }}_\star \) represents the concept drift between first phase and bandit deployment. With this reparametrisation, our linear model becomes:
Therefore, our problem has reduced from estimating \(\varvec{\theta }_\star \) to estimating \(\bar{\varvec{\delta }}_\star \).
Consider a Bayesian linear regression model with the unknown true value of first phase dataset \(\varvec{\mu }_\star \) modelled by random variable \(\varvec{\mu }\sim {\mathcal {N}}(\hat{\varvec{\mu }}, \varvec{\Sigma }_\mu )\) with conjugate contextconditional Gaussian likelihood. We then model the difference parameter \(\bar{\varvec{\delta }}_\star \) as \(\bar{\varvec{\delta }} \sim {\mathcal {N}}(\varvec{0}, \alpha ^{1} \varvec{I})\). If \(\varvec{\theta } = \varvec{\mu } + \bar{\varvec{\delta }}\) is the random variable modelling \(\varvec{\theta }_\star \), then \(\varvec{\theta } \sim {\mathcal {N}}(\hat{\varvec{\mu }}, \varvec{\Sigma }_\mu +\alpha ^{1}\varvec{I})\) owing to the Gaussian’s stability property. Finally, since \(\hat{\varvec{\mu }}\) is known, we can model \(\varvec{\theta }\) as \(\varvec{\theta } = \hat{\varvec{\mu }} + \varvec{\delta }\), that is, a random variable centred at \(\hat{\varvec{\mu }}\) which is shifted by drift \(\varvec{\delta } \sim {\mathcal {N}}(\varvec{0}, (\varvec{\Sigma }_\mu + \alpha ^{1} \varvec{I}_d))\).
We next generalise the coupled recurrence Eqs. (1) and (2) for efficient incremental computation of the generalised posterior estimates.
Proposition 1
Consider linear regression likelihood \(y_i = \varvec{\theta }^T\varvec{x}_i + \epsilon _i\), where \(\epsilon _i \sim {\mathcal {N}}(0, R^2)\), and prior \(\varvec{\theta } \sim {\mathcal {N}}(\varvec{0}, \varvec{V}_1^{1})\). Then the posterior conditioned on data \(\varvec{z}_i=(\varvec{x}_i, y_i)\) for \(i\in [t]\) is given by \({\mathcal {N}}(\hat{\varvec{\theta }}_{t+1}, R^2\varvec{V}_{t+1}^{1})\) where \(\varvec{\theta }_t\) point estimates are defined by Eq. (2), and we replace Eq. (1) for \(\varvec{V}_t\) with
where \(R^2\) is the variance of the measurement noise.
Proof
The posterior distribution is:
To avoid clutter, let \(\bar{\varvec{V}}_{n+1}= \varvec{V}_1 + \frac{1}{R^2}\sum _{i=1}^n \varvec{x}_i\varvec{x}_i^T\) and \(\bar{\varvec{b}}_{n+1} = \frac{1}{R^2}\sum _{i=1}^n y_i \varvec{x}_i\). Therefore, our posterior distribution can be rewritten as
which is proportional to \({\mathcal {N}}(\bar{\varvec{V}}_{n+1}^{1}\bar{\varvec{b}}_{n+1}, \bar{\varvec{V}}_{n+1}^{1})\). Therefore, our estimator for \(\theta \) would be
where we have defined
This completes the proof. \(\square \)
Our approach comes with an appealing interpretation when setting \(\bar{\varvec{\delta }} \sim {\mathcal {N}}(\varvec{0}, \alpha ^{1} \varvec{I})\): when we are confident that our pretraining guess is very close to the true parameter, we can set drift \(\alpha ^{1}\) to be very small and close to 0. However, when we are not as confident, \(\alpha ^{1}\) is naturally set large. Large \(\alpha ^{1}\) creates more “deviation” or error from our first phase parameter \(\varvec{\mu }_\star \). This suggests a promising new direction which we highlight in future work in Sect. 6.
Our simple reduction of warmstart bandit learning to LinTS admits a regret bound. We follow the pattern of the regret analysis of Abeille et al. [2] with differences detailed next.
Observe first that \(\Vert \hat{\varvec{\theta }}_t  \varvec{\theta }_\star \Vert _{\varvec{V}_t} = \Vert (\hat{\varvec{\theta }}_t  \hat{\varvec{\mu }})  (\varvec{\theta }_\star \hat{\varvec{\mu }})\Vert _{\varvec{V}_t} = \Vert \hat{\varvec{\delta }}_t  \varvec{\delta }_\star \Vert _{\varvec{V}_t} \le \beta _t(\delta ')\). Accordingly the argument yielding the confidence ellipsoid \(\beta _t(\delta ')\) stated in [1, Theorem 2] bounding \(\Vert \hat{\varvec{\theta }}_t  \varvec{\theta }_\star \Vert _{\varvec{V}_t}\) applies in our case, whose full proof of its modification can be found in Appendix. However, as our initial matrix \(\varvec{V}_1\) generalises \(\lambda \varvec{I}\), we must alter the penultimate proof step of Abeille et al. [2] as follows:

the inequality proposed by AbbasiYadkori et al. [1] which is used to define \(\beta _t(\delta )\) in their paper is not valid in our scenario. This is corrected by using the version of \(\beta _t(\delta )\) presented in this paper, removing the assumption that \(\varvec{V}_1 = \frac{\lambda }{R^2} \varvec{I}\) and leave it in terms of \(\varvec{V}_1\):
$$\begin{aligned} R \sqrt{2\log \frac{\det (\varvec{V}_t)^{1/2}\det (R^2\varvec{V}_1)^{1/2}}{\delta }} + \sqrt{\lambda _{max}(R^2\varvec{V}_1)} S \end{aligned}$$ 
the inequality of [2, Proposition 2] is no longer valid in our case. However, the last inequality in [20] has modified [2, Proposition 2] into:
$$\begin{aligned} \sum _{s=1}^t\Vert \varvec{x}_s\Vert ^2_{\varvec{V}_s^{1}} \le 2\log \left( \frac{\det (\varvec{V}_{t+1})}{\det (R^2\varvec{V}_1)}\right) \end{aligned}$$and hence serves our purpose; and

in proving [2, Theorem 1] the authors used the fact that \(\varvec{V}_t^{1} \le \frac{1}{\lambda }\varvec{I}\). This is not the case in our setting, but we can generalise the result with similar reasoning yielding \(\varvec{V}_t^{1} \le \frac{1}{\lambda _{min}(R^2\varvec{V}_1)}\varvec{I}\), where \(\lambda _{min}(R^2\varvec{V}_1)\) denotes the minimum eigenvalue of the matrix \(R^2\varvec{V}_1\).
We also need to change the definition of S, since our problem has shifted from estimating \(\varvec{\theta }\) to estimating \(\varvec{\delta }\). Therefore, after modifying the framework, the warmstart linear Thompson sampling bandit can be summarised as in Algorithm 2 and admits the following regret bound.
Theorem 2
(Warmstart LinTS regret bound) Under the assumptions that:

1.
\(\Vert \varvec{x}\Vert \le 1\) for all \(x \in {\mathcal {X}}\);

2.
\(\Vert \varvec{\delta }\Vert \le S\) for some known \(S\in {\mathbb {R}}^+\); and

3.
the conditionally RsubGaussian process \(\{\epsilon _t\}_t\) is a martingale difference sequence given the filtration \({\mathcal {F}}_t^x = ({\mathcal {F}}_1, \sigma (\varvec{x}_1, r_1, \cdots , r_{t1}, \varvec{x}_t))\) with \({\mathcal {F}}_1\) denoting any information on prior knowledge,
along with the definition of \({\mathcal {D}}^{TS}\) given in Sect. 2, then with probability at least \(1  \delta \), with \(\delta ' = \delta /(4T)\) and \(\gamma _t = \beta _t(\delta ')\sqrt{cd\log ((c'd)/\delta )} \), the regret of LinTS can be decomposed as
with each of the term bounded as
3.2 Extension to \(\epsilon \)Greedy and LinUCB learners
The core idea of our warmstarting method as derived for linear Thompson sampling lies in the method of setting up the initial phase of the bandit. The same expression of initial set up can be applied to other contextual bandit algorithms such as \(\epsilon \)Greedy and LinUCB.
In the \(\epsilon \)greedy algorithm, we balance exploration and exploitation by means of relatively naïve randomness: in each round we (uniformly) explore with probability \(\epsilon \) and exploit with probability \(1\epsilon \). Specifically, by incorporating warm start, this means that at each round we choose an arm at random uniformly from the set [k] with probability \(\epsilon \), and choose an arm at random uniformly from the set \(S = \arg \max _{i \in [k]} \hat{\varvec{\theta }}_t^T\varvec{x}_t(i)\) with probability \(1\epsilon \). We summarise the warmstart \(\epsilon \)greedy algorithm in Algorithm 3
We can also extend our warmstarting technique to LinUCB using the fact that \(\varvec{\theta }\sim {\mathcal {N}}(\hat{\varvec{\mu }} + \varvec{V}_{t}^{1}\varvec{b}_{t},R^2\varvec{V}_{t}^{1})\), which is a powerful result. It was proposed by Li et al. [15] that one way to interpret their algorithm is to look at the distribution of the expected payoff \(\varvec{\theta }_\star ^T\varvec{x}_t\). With the affine transformation property of multivariate Gaussian distributions, we have that \(\varvec{\theta }^T\varvec{x} \sim {\mathcal {N}}(\hat{\varvec{\theta }}_t^T\varvec{x},R^2\varvec{x}^T\varvec{V}_{t}^{1}\varvec{x})\). Therefore, the upper bound of such a quantity is:
for some value \(\rho \), which is left as a hyperparameter. The summary of our warmstart LinUCB Algorithm can be seen in Algorithm 4.
Theorem 3
(Warmstart LinUCB regret bound) The regret bound of warmstarted LinUCB follows an argument of Lattimore and Szepesvári [14] very closely. The regret, whose complete derivation is provided in Appendix, admits bound
3.3 A regret lower bound
We here present a lower bound for the warmstarted bandit linear contextual \(\epsilon \)greedy algorithm. Consider the bestcase scenario for \(\epsilon \)Greedy with constant \(\epsilon \), that is, that we have the true weight as our initial guess i.e., \(\hat{{\varvec{\mu }}}=\varvec{\theta }_\star \). Assume that we use the hyperparameter \(\alpha \rightarrow \infty \), which ensures the weight’s resistance to changes from observations, i.e., \(\hat{\varvec{\theta }}_t = \hat{{\varvec{\mu }}}= \varvec{\theta }_\star \) for all t. With this setting, denoting \(\Delta _{i,t}\ge 0\) as the difference between the expected rewards of the optimal arm and arm i at round t, the regret is \(\frac{\epsilon }{K}\sum _{t=1}^T\sum _{i=1}^K\Delta _{i,t}\). This argument, detailed in Lemma 4, proves a lower bound since it is derived from a best case scenario.
Lemma 4
The regret for warmstarted \(\epsilon \)greedy is at best \(\frac{\epsilon }{K}\sum _{t=1}^T\sum _{i=1}^K\Delta _{i,t}\).
Proof
Since \(\hat{\varvec{\theta }}_t = \varvec{\theta }_\star \) for all t, each exploitation round will yield one of the optimal arms with probability 1. Assume that there are K arms in total. Let E denote the event that exploration occurs, and \(A_i\) be the event that arm i is chosen. Then, the expected cumulative regret for the linear contextual \(\epsilon \)greedy is:
where \(\bar{\Delta }_i\) is the average of \(\Delta _{i,t}\) over t, i.e., \(\bar{\Delta }_{i}=\frac{1}{T}\sum _{t=1}^T\Delta _{i,t}\). \(\square \)
Note that in this analysis we have used a constant \(\epsilon \) for our \(\epsilon \)greedy algorithm. In practice, the value of \(\epsilon \) can be scheduled to recede over time. Auer et al. [5] have shown that in the case of noncontextual bandits, this regime enjoys a sublinear upper regret bound.
Reduction from noncontextual to contextual bandits The above lower bound of the contextual \(\epsilon \)greedy algorithm leads naturally to a lower bound for noncontextual bandits. The noncontextual bandit is different from its contextual counterpart where it does not provide any context. In each round, the true means of each noncontextual arm remain constant and are independent of each other (i.e., \(\theta _{i,t} = \theta _i\) for all t); thus, the parameters to estimate are \(\theta _i\) for arm \(i \in [K]\). A noncontextual bandit can be formulated as a contextual bandit, as shown in Lemma 5. By performing such a reduction, essentially using a contextual bandit to act in a noncontextual setting, we can relate lower bounds between the settings.
Lemma 5
A noncontextual bandit can be formulated as a contextual bandit. Therefore, any fundamental limitations for noncontextual bandits must also hold for contextual bandits.
Proof
Let the noncontextual bandit arm be \(i = 1, \dots , K\) and let the expected reward for arm i be \(\theta _i\). A contextual bandit equivalent can be constructed by setting the context for arm i as \(\varvec{x}(i) = \varvec{e}_i\), which is the standard basis of \({\mathbb {R}}^K\), i.e., the vector whose element is 1 in its i^{th} element and 0 otherwise. Furthermore, assuming that the shared model is used, then the i^{th} element of the true weight \(\varvec{\theta }_\star \) can be taken to be \(\theta _i\). This setting leads us to set the initial weight \(\hat{{\varvec{\mu }}}= \begin{bmatrix} {\hat{\mu }}_1&\cdots&{\hat{\mu }}_K\end{bmatrix}^T\) to provide an initial guess of the true mean of each arm \(\mu _i\) for \(i\in [K]\), with \(\varvec{V}_1 = \text {diag}(\lambda _1, \cdots , \lambda _K)\) reflecting the confidence we have for our initial estimate. A diagonal matrix is particularly chosen for this purpose since the means of each arm are independent of each other. Thus, the (contextual) estimate of \(\varvec{\theta }_\star \) is
Now since \(\varvec{x}_s = \varvec{e}_{i_s}\), and noticing that \(\varvec{e}_i\varvec{e}_i^T = \text {diag}(\mathbb {1}(i=1),\cdots , \mathbb {1}(i=K))\) for all \(i \in [K]\), i.e., a matrix with all zero entries except at entry (i, i) with value 1, we have
and
where \(T_i\) is the number of times arm i is pulled and \(w_i = \sum _{s=1}^t (r_s  {\hat{\mu }}_i) \mathbb {1}(i_s = i) = \sum _{s=1}^t r_s\mathbb {1}(i_s = i)  T_{i}{\hat{\mu }}_i\) is the total sum of all the reward differences observed by arm i. Therefore, the estimate of the weight is
This result can be interpreted such that for each arm \(i\in [K]\), our estimate of the true mean \(\theta _i\) is its sample mean with a pseudoobservation of mean \({\hat{\mu }}_i\) worth of \(\lambda _i\) observations. Indeed, when we choose \(\lambda _i=0\) for all \(i\in [K]\), we recover each arm’s mean estimate typically calculated by a noncontextual bandit \(\epsilon \)greedy algorithm. With this, when we exploit, we choose an arm which maximise \(\hat{\varvec{\theta }}^T x(i) = \hat{\varvec{\theta }}^T \varvec{e}_i = {\hat{\theta }}_i\), which is the same as what is performed in the noncontextual case. \(\square \)
Since a noncontextual bandit can be formulated as a contextual bandit, our approach may be applied to warmstart a noncontextual bandit. Its lower bound when the \(\epsilon \)greedy algorithm is used follows the lower bound of contextual \(\epsilon \)greedy, with \(\Delta _{i,t} = \bar{\Delta }_i\) for all t since the mean reward (hence the regret each arm) is stationary across t. In other words, Lemma 4 is a fundamental lower bound on our warmstart setting also.
4 Experiments
We now report on a comprehensive suite of experimental evaluations of our warmstart framework against a number of baselines and different datasets. We are interested in the benefit of warm start over cold start—in such cases we focus on shortterm performance differences, as this is a practical limitation of bandits in highstakes applications. We also explore the impact of prior misspecification as a potential risk of incorrect warm start. We summarise our experiments next and then describe them with results in more detail below.
Datasets Experiments in database index selection explore the effect of warm start in selecting a single index per round where queries arrive to the database in batches and rewards correspond to (negative) execution time. We use a commercial database system, and the standard TPCH benchmark [29]. Results on two OpenML datasets (Letters and Numbers) test bandits on online multiclass classification, as a benchmark previously used to evaluate the ARRoW warmstart technique [32]. These datasets are advantageous to ARRoW in that they supply the (restrictive) kind of prior knowledge needed—supervised pretraining. Experiments on synthetic data provide sufficient control of the environment to explore limitations of our warmstart approach.
Baselines On the database index selection task, we use cold start TS as a natural and fair baseline. On the OpenML datasets we include the ARRoW warmstart framework, which was originally tested in the same way. We also demonstrate the performance of both frameworks on the \(\epsilon \)greedy and LinUCB learners, as well as LinTS. Where cold start corresponds throughout to having no pretraining dataset (i.e., Algorithm 1), hot start in the synthetic experiment corresponds to having 100% accuracy on the pretraining parameter \(\varvec{\mu }_\star \), and warm start corresponds to having an estimate on the pretraining parameter \(\varvec{\mu }_\star \), namely \(\hat{\varvec{\mu }}\). By its very nature, we can only produce hot start results with the artificial dataset, since 100% accuracy on the pretraining parameter requires an infinite amount of observation in the realworld database index selection problem.
Hardware All experiments are performed on a commodity laptop equipped with Intel Core i76600u (2 cores, 2.60 GHz, 2.81 GHz), 16 GB RAM, and 256 GB disk (Sandisk X400 SSD) running Windows 10. In database experiments, we report cold runs only: we clear database buffer caches prior to query execution—the memory setting thus does not impact our findings.
4.1 Database index selection
As the realworld problem of database index selection motivated this work, we begin with a demonstration in this setting. In a database management system, an index is a data structure used to speed up database execution of a set of queries (a.k.a workload). While a huge space of possible indices could be considered, only a few can actually be created due to memory constraints (since each index occupies space in memory). With a tremendous number of indices, it is impractical for humans to decide which indices to create without assistance. A recent effort has been made to automate this task by using bandits [21] to propose an optimal set of indices to boost the workload execution. This recent framework we will be adopted in our work and expanded to support warm start. The aim of this experiment is to demonstrate that the warmstarted bandit will yield similar performance as the coldstarted bandit in the long run while having better performance in earlier rounds. The consequence of such a demonstration is a system more suitable for deployment.
In particular, our problem setting is as follows. At round \(t=1,2,\ldots ,T\), we observe a workload \(W_t\) with a set of queries, and the system recommends one index \(i_t\) out of the set of all possible indices \({\mathcal {I}}\). After index \(i_t\) is created, we execute the queries in workload \(W_t\). Our chosen aim is to minimise the query execution time, noting we do not take into account the time it takes to create the index \(i_t\). After \(q_t\) is executed, the index \(i_t\) is dropped and the buffer is cleaned.
In this paper, the adopted database comes from the TPCH benchmark [29]. This publicly available industrial benchmark comes with a set of predefined query templates. A query template is a parameterised query whose parameter values (a.k.a conditions) are missing, keeping only the structure of the query and leaving number and string values as variables. We chose five query templates at random and instantiated them with actual parameter values in each round. These queries will be used as the workload in both pretraining and deployment phase.
It should be noted that the value of R and S are unknown in the realworld dataset. In this case, we treat these as hyperparameters which need to be chosen, adding to \(\alpha \).
In running this experiment, we have used the context features as described by Perera et al. [21], with the reward being the performance gain, described as \(t_{no\_index}  t_{i}\), where \(t_{no\_index}\) corresponds to the execution time of the whole workload without any indices and \(t_i\) the execution time of the whole queries in the workload using index i.
Due to the lack of information on the most optimal index, it is impossible to retrieve the regret for each round. Therefore, with this realworld experiment, we present the average execution time (loss) of workload \(W_t\) based on what both algorithms recommend, which can be found in Fig. 1.
Results It can be seen that the warmstarted LinTS outperforms the coldstarted LinTS, in shortterm rounds and cumulatively. This can be explained by the query templates used to pretrain the warmstarted bandit resembling the templates used in the testing dataset. This leads the warmstarted bandit’s guess of the initial weight \(\varvec{\theta }_1 = \hat{\varvec{\mu }}\) being closer to the actual weight \(\varvec{\theta }_\star \) compared to the initial guess of \(\varvec{\theta }_1 = \varvec{0}\) by the coldstarted bandit.
4.2 OpenML classification dataset
We chose two of the datasets used in [32], which correspond to letters and numbers identification, respectively. We split the data such that 10% is used as the supervised learning examples and the other 90% used as the actual bandit rounds. This advantages ARRoW [32] as the only form of permissible prior knowledge. We try all learners presented in this paper for this dataset: \(\epsilon \)greedy, LinUCB and LinTS. As for the hyperparameters, we used \(\epsilon = 0.0125\) for \(\epsilon \)greedy, \(\rho R = 0.2\) for LinUCB, \(\beta _t(\delta )=1\) for LinTS in Letter dataset and \(\beta _t(\delta )=0.05\) for LinTS in Numbers dataset with \(R=0.25\). All of these hyperparameters were found iteratively by grid search.
As described in [32], we transform the dataset into a dataset capable of evaluating bandit algorithm by mapping the classes as the arms and the cost of each class as \(c(a) = \mathbb {1}(a\ne y)\) given example (x, y). For the classification problem, we also modify our bandit algorithm which usually shares its parameter across the arms. However, since the context of each arm is the same for the classification task, we distinguish the value by making the parameter different, leading to the disjoint bandit with arm i having the weight \(\varvec{\theta _{i,t}}\). As such its reward is modelled by the equation \(r_t(i) = \varvec{\theta }_{i,\star }^T\varvec{x}_t(i) + \epsilon _t(i)\)
We have used the term cost instead of rewards in this dataset, which requires minor modification of the learners: we change the argmax operation into argmin and in the case of LinUCB, the upper confidence bound in Line 5 to lower confidence bound \(\hat{\varvec{\theta }}_{i,t}^T\varvec{x}_t(i)  \rho R\sqrt{\varvec{x}_t^T(i)\varvec{V}_{t}^{1}\varvec{x}_t(i)}\).
The ARRoW algorithm presented in [32] is also executed partially, with the size of the class \(\Lambda \) set to 1. We chose the best performing \(\lambda \) to be compared against our algorithm, for fairness. We note that sensitivity analysis in Figs. 3 and 4 demonstrates that the choices are generally not very important.
We follow a suggestion of the original ARRoW paper to evaluate [32, Algorithm Line 5], evaluating
where f(x, a) is a linear function and \({\mathcal {F}}\) is the class of all linear functions, the solution of which can be obtained via the weighted linear regression.
Another algorithm we used for comparison is by Li et al. [16], hereby labelled as WWW’21 for convenience (denoting the publication venue). This algorithm employs virtual plays in every round by sampling the context according to a cdf \(F_X(\varvec{x})\), estimated by its empirical cdf \({\hat{F}}_X(\varvec{x})\), ultimately equivalent to random sampling of the seen contexts with replacement. A feedback is provided by an offline evaluator whenever the online confidence band is wider than the offline counterpart. The virtual plays are continued indefinitely until the offline evaluator does not give a feedback.
We present the results for the OpenML Dataset in Fig. 2, where we have labelled our algorithm diff for the fact that our algorithm models the difference between the true parameter from the guessed weight. It can be seen that our algorithm performs as well as previous algorithms, while still offering the flexibility to choose the initial guess.
Sensitivity analysis for this experiment (with accurate prior) is presented in Figs. 3 and 4. As mentioned, neither ARRoW nor our warmstart approach are very sensitive to their hyperparameters, while the algorithm proposed by Li et al. [16] does not require any hyperparameter tuning. These results also support our choice of \(\alpha = 10^7\) across these experiments.
Effect of warmstart on exploration hyperparameters In this section, we present the final cumulative cost as a means of measuring the performance of warmstarted bandit under different exploration hyperparameters. As previously observed from Figs. 3 and 4, the temperature hyperparameter does not appear to have a significant impact on final performance. Thus, for this analysis, we again fixed the value \(\alpha = 10^7\). We reran the experiment for both Letters and Numbers datasets using the \(\epsilon \)greedy, LinUCB, and LinTS algorithms, varying the value of the exploration hyperparameters \(\epsilon \), \(\rho R\) and \(\beta \), respectively. The results, as shown in Fig. 5, suggest that lower values of the exploration hyperparameters are preferred. This is intuitive since a goal of warmstarting bandits is to reduce the demand on exploration during initial rounds. This effect is very prominent, especially in the \(\epsilon \)greedy algorithm. This can be explained by the fact that exploration in the \(\epsilon \)greedy is strictly dictated by the value of \(\epsilon \), while in LinUCB and LinTS the exploration terms are partly influenced by the matrix \(\varvec{V}_t\), which initially depends on the covariance matrix \(\varvec{\Sigma }_\mu \). Therefore, in \(\epsilon \)greedy, we recommend ‘manually’ reducing the exploration hyperparameter \(\epsilon \), while in LinUCB the exploration is partially automatically reduced thanks to the lower exploration boost when \(\varvec{\Sigma }_\mu \) has smaller eigenvalues.
Effect of pretraining data ratio on performance As previously done in Zhang et al. [32], we can explore the fraction of the dataset available for pretraining. In this section, we present how the cumulative cost evolves as the pretraining dataset to total dataset ratio changes. Here the total dataset refers to the union between the pretraining dataset and the bandit deployment dataset. In particular, we investigate the performance for each of the ratios in \(\{0, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01, 0.05, 0.1\}\). For fairness, all experiments from the different ratios in the same dataset share the same deployment data, thus the maximum ratio in the experiment, which is 0.1, is used to determine the deployment dataset. Since there are 20,000 data in Letters Dataset and 2000 data in Numbers dataset, we used the last 18,000 and 1800 data in Letter and Number Dataset, respectively. Figure 6 supports the intuition that higher ratios likely lead to better performance. This effect is particularly apparent during the initial increase, while the gain gradually fades away as the ratio is increased further. This diminishing return can be explained since the biggest improvement in the correctness of \(\theta \) occurs in the beginning of the supervised learning, whereas its accuracy, while increasing, improves more slowly as more data are observed.
Effect of misspecified pretraining data ratio on performance A series of experiments investigating sensitivity to the warmstart temperature and exploration hyperparameters was carried out. We also investigated the effect of the fraction of dataset used as pretraining in both settings: accurate prior and misspecified prior.
We investigated the effect of a misspecified prior with both datasets. For this, we need to create another dataset in which the true weight \(\theta _\star \) is different from the deployment dataset’s. To do this, we have trained a linear regression for the whole dataset for each arm i, giving us the disjoint parameter \(\varvec{\theta }_1(i)\), which is then transformed by a rotation matrix \(\varvec{R}_\gamma \) to give a new parameter \(\varvec{\theta }_2(i) = \varvec{R}_\gamma \varvec{\theta }_1(i)\). For each datum at round t used for pretraining, we extracted the context \(\varvec{x}_t(i)\) for all arms, then calculate \(d_r(i) = (\varvec{\theta }_2(i)\varvec{\theta }_1(i))^T\varvec{x}_t(i)\). This acts as the perturbation of the original reward \(r_t(i)\), yielding the inaccurate reward \(r_t'(i) = r_t(i) + d_r(i)\). In our data generation, we have calculated the similarities between the two parameters, yielding the similarities \(\cos (\varvec{\theta }_1, \varvec{\theta }_2) = \frac{\langle \varvec{\theta }_1, \varvec{\theta }_2 \rangle }{\Vert \varvec{\theta }_1\Vert \Vert \varvec{\theta }_2\Vert }= \frac{1}{\sqrt{2}}\) for all arms and both datasets. This consistent rotation attempts to maintain a similar amount of misspecification across datasets, however as we shall see, properties of the data interact with the magnitude of perturbation.
Due to the nature of the semisynthetic dataset generation process, the reward might no longer be in \(\{0,1\}\) as previously generated from the classification problem. This observation does not effect the validity of the model, or appropriateness of warmstart in this setting thanks to the flexibility of reward structures accommodated.
We present our result in Fig. 7. Differing to the previous experiment, we no longer have the privilege to have a very similar dataset as our pretraining data. It can be seen that for the Letter dataset, some warmstarting provides a modest initial boost to performance, while warmstarting appears to hurt the performance in Numbers dataset.
4.3 Synthetic experiments
In generating the artificial dataset, we started off by choosing a value for \(\varvec{\theta }_\star \). In this case, we chose the value to be \(\varvec{\theta }_\star ^T = \begin{bmatrix} 0.1&0.3&0.5&0.7&0.9 \end{bmatrix}\), with the bandit having 10 arms. After the value of \(\varvec{\theta }_\star \) is chosen, we generate a random vector \(\varvec{x}_t(i) \in {\mathbb {R}}^d,\,d=5\) where each element is drawn from uniform distribution U(0, 1) for each \(i = 1, 2, \cdots , 10\), followed by taking the inner product and adding the Gaussian noise \(\epsilon _i(t) \sim {\mathcal {N}}(0, R^2),\,R=0.25\), independent on the arm i and round number t. The noisy reward \(r_i(t) = \varvec{\theta }_\star ^T \varvec{x}_t(i) + \epsilon _i(t)\) is saved, as well as the regret of pulling arm i, namely \(\varvec{\theta }_\star ^T \varvec{x}_t(i)  \max _{i \in [k]} \varvec{\theta }_\star ^T \varvec{x}_t(i)\). This makes it possible to compare all bandit algorithms equally without needing offpolicy evaluation. We repeat this process 100,000 times, which corresponds to 100,000 rounds of the second phase dataset.
To generate the pretraining dataset, we firstly choose the value of \(\alpha ^{1}\), before sampling the true parameter deviation \(\varvec{\delta }_\star \sim {\mathcal {N}}(\varvec{0}, \alpha ^{1}\varvec{I})\). After the value \(\varvec{\delta }_\star \) is sampled, we calculate \(\varvec{\mu }_\star = \varvec{\theta }_\star  \varvec{\delta }_\star \) and conducted the process exactly as we generated the second phase dataset. We generated two types of pretraining dataset: accurate prior, where we chose \(\alpha ^{1} = 10^{4}\) and misspecified prior, where we chose \(\alpha ^{1} = 0.25\). We produced 10,000 rounds worth of pretraining dataset.
We observed that, with the dataset generated both from the accurate and misspecified prior regime, \(\alpha = 10\) seems to be the cutoff point where all algorithms work quite well. Therefore, we plot for all warmstarting methods the cumulative regret for \(\alpha = 10\), as shown in Fig. 8.
Results In the accurate prior regime, it is clear that the hotstarted and warmstarted bandits outperform the coldstarted bandit. This can be explained by the fact that the value of \(\varvec{\theta }_\star \) is closer to \(\hat{\varvec{\mu }}\) or \(\varvec{\mu }_\star \) as opposed to \(\varvec{0}\). However, the opposite problem occurs when the prior is misspecified, as the coldstart bandit slightly outperforms the hotstarted bandit and warmstarted bandit, due to the fact that \(\varvec{\theta }_\star \) is closer to \(\varvec{0}\) compared to \(\hat{\varvec{\mu }}\) or \(\varvec{\mu }_\star \).
It should be noted as well that we have held the hyperparameter \(\alpha \) the same for all regimes here. When the hyperparameter \(\alpha \) is tuned optimally, the hotstarted and coldstarted bandits are able to perform even better, as the pretraining dataset is treated as if they are the real dataset.
5 Towards adaptive drift hyperparameter
In this section, we take a closer look at a key hyperparameter of our warmstart algorithms: the drift hyperparameter \(\alpha \) which controls how much exploration follows pretraining. While this has so far been set manually, based on how much the operator believes pretraining to be aligned with deployment time, in practice we believe this parameter may sometimes be difficult to set.
Limitations of the current approach The advantage of our current approach of warmstarting as applied in Algorithms 2, 3 and 4 has been centralised around the selection of the drift hyperparameter. This drift hyperparameter \(\alpha \) has been used as a means for temperature tuning: how much can we trust the initial weight guess? With an accurate prior, a sufficiently large value of \(\alpha \) will give the bandit an early advantage in the deployment phase as unnecessary exploration is eliminated. On the other hand, although the warmstarted bandit is somewhat insensitive to \(\alpha \) with an accurate prior, its sensitivity will be largely augmented when the prior is highly misspecified; a large \(\alpha \) value makes the bandit retain its highly misaligned initial guess and resist changes made from observations. Therefore, it is advantageous to choose a value of \(\alpha \) which is not too far off from its optimum. Alternatively, we may attempt to adapt \(\alpha \) based on data, which is the approach adopted in this section.
Empirical Bayes We choose the value of \(\alpha \) using the fact that even though this hyperparameter is completely unknown before the deployment phase starts, a better estimate can be made as we observe more data from the deployment phase. If the data match with how the initial weight is chosen, we may decide to put more trust on \(\hat{{\varvec{\mu }}}\) (large \(\alpha \)). On the other hand, we may decide to doubt our initial weight when the observed data does not support it (small \(\alpha \)). This strategy invites adoption of empirical Bayes, a general method of using observations to estimate or set prior distributions.
Assumptions In an attempt to do this, we make a hierarchical structure assumption such that \(\bar{\varvec{\delta }}\mid \alpha \sim {\mathcal {N}}(\varvec{0}, \alpha ^{1}\varvec{I}_d)\), where \(\alpha \sim \Gamma (\bar{\alpha }, \bar{\beta })\) for convenience. Furthermore, in order to obtain a wellknown distribution, we also assume that \(\varvec{\theta }_{\star } = \hat{{\varvec{\mu }}}+ \bar{\varvec{\delta }}_{\star }\) as represented by the random variable \(\varvec{\theta } = \hat{{\varvec{\mu }}}+ \bar{\varvec{\delta }}\) for deterministic \(\hat{{\varvec{\mu }}}\), where the dissimilarity between \(\hat{{\varvec{\mu }}}\) and \(\theta _\star \) is captured by the random variable \(\alpha \) embdedded in \(\bar{\varvec{\delta }}\). Compared to the initial assumption, \(\alpha \) is now treated as random variable and the variance of the initial guess \(\varvec{\Sigma }_{\mu }\) is now absorbed and partially represented by \(\alpha \).
Lemma 6
With the above assumptions, the marginal \(\bar{\varvec{\delta }}\) follows a multivariate studentt distribution with degrees of freedom \(\nu _t\), location \(\varvec{\mu }_t\) and scale matrix \(\varvec{\Sigma }_t\), denoted \(St(\nu _t, \mu _t, \varvec{\Sigma }_t)\), with \(\nu _t = 2\bar{\alpha }= 2\bar{\beta }\alpha _t, \mu _t = \varvec{0}, \varvec{\Sigma }_t = \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d = \alpha _t^{1}\varvec{I}_d\).
Proof
Firstly, notice that in the case of onedimensional (scalar) weight, the joint distribution of \((\bar{\delta }, \alpha )\) collapses to normalgamma distribution. It is a standard result that the marginal distribution of \(\bar{\delta }\) follows nonstandardised Studentt distribution with degrees of freedom \(\nu _t = 2\alpha \), location \(\mu _t = \mu \) and scale \(\sigma ^2_t = \frac{\beta }{\alpha }\), so we expect a similar result for multidimensional \(\bar{\varvec{\delta }}\).
To prove the main result, we compute the required marginal density by marginalising \(\alpha \) out of the joint distribution itself found by multiplying the model’s likelihood and prior, noting the integrand of the fifth equation to be the pdf of a gamma distribution with shape \(\bar{\alpha }+\frac{d}{2}\) and rate \(\bar{\beta }+\frac{1}{2}\bar{\varvec{\delta }}^T\bar{\varvec{\delta }}\), hence integrates to 1:
which is multivariate tdistribution with \(\nu _t=2\bar{\alpha }, \mu _t=\varvec{0}\) and \(\varvec{\Sigma }_t=\alpha _t^{1}\varvec{I}_d=\frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d\). Therefore, we conclude that \(\bar{\varvec{\delta }}\sim St(2\bar{\alpha }, \varvec{0}, \frac{\bar{\beta }}{\bar{\alpha }}\varvec{I}_d)\), i.e., a studentt distribution with zero mean and spherical covariance. Notice that we can express \(\nu _t\) in terms of \(\alpha _t\) and \(\bar{\beta }\) as \(\nu _t = 2\bar{\beta }\alpha _t\) since \(\alpha _t=\frac{\bar{\alpha }}{\bar{\beta }}\). By setting the hyperparameters in terms of \((\alpha _t, \bar{\beta })\), we control the prior of \(\alpha \) by its mean \(\alpha _t\) and variance \(\frac{\alpha _t}{\bar{\beta }}\), which is more intuitive instead of its shape and rate \((\bar{\alpha },\bar{\beta })\). \(\square \)
Following Song and Xia [27], we adopt noise such that
Adaptive hyperparameter algorithm Since \(\bar{\varvec{\delta }}\) follows a studentt distribution, our assumptions follow the premise laid out by Song and Xia [27]. By rewriting \(\varvec{X} = \begin{bmatrix}\varvec{x}_1&\cdots&\varvec{x}_n\end{bmatrix}^T\) and \(\varvec{y} = \begin{bmatrix}y_1&\cdots&y_n\end{bmatrix}^T\), the value of \(\alpha _t\) and \(\beta _t\) can then be optimised by the qEM algorithm following Song and Xia [27], summarised in Algorithm 5. This algorithm takes \(\bar{\beta }\) as its hyperparameter, which controls the degrees of freedom in the underlying distribution of \(\bar{\varvec{\delta }}\): when a Gaussian distribution of \(\bar{\varvec{\delta }}\) is preferred, we let \(\nu _t \rightarrow \infty \) by letting \(\bar{\beta }\rightarrow \infty \), recovering the Gaussian distribution from the tdistribution.
Some steps in Algorithm 5 require expensive computations. To mitigate such costs, Song and Xia [27] suggest to diagonalise the Gram matrix \(\varvec{X}^T\varvec{X} = \varvec{P}\varvec{D}\varvec{P}^T\) and compute the following quantities beforehand:
The required quantities in each iteration can then be calculated as:
where the Woodbury matrix identity is used in the second equation and the cyclic property of the trace operation is used in the fourth equation. We argue that when one wishes to store \(\varvec{X}^T\varvec{X}\) and not \(\varvec{X}\), the second term of the last equality can be calculated as
These quantities can then be used to calculate \(b_{opt}\) and \(c_{opt}\) which yield new \(\alpha _t\) and \(\beta _t\) until convergence.
Regret bound Algorithm 5 may be invoked at the start of each round to give updated values of \(\alpha _t\) and \(\beta _t\). However, under this adaptive hyperparameter, \(\alpha ^{1}\) is no longer independent of the other variables. This violates one of the assumptions made in [1], as the choice of \(\lambda \) in their scenario is independent of other variables. Therefore, the validity of the oversampling factor becomes questionable. As the regret analysis for LinTS depends on the validity of the upper bound provided by [1], this in turns becomes invalid as well. As such, regret analysis for the adaptive case would become another open problem. A possible remedy for this problem may be to halt the hyperparameter optimisation update after a certain number of rounds, in which case \(\alpha ^{1}\) might be viewed as constant in the long run as a direct consequence of Theorem 2 and 3.
Corollary 7
Consider a multiarmed bandit agent with hyperparameters updated as per Algorithm 5 every round up to round \(n_s\) when no further update is invoked. Then, round \(n_s+1\) can be treated as the first bandit round with constant hyperparameter.
For LinTS, this is equivalent to
with each of the term bounded as
where \(R^{RS}(n_s)\) and \(R^{RLS}(n_s)\) are constant, \(\bar{\gamma }_T(\delta ) = \bar{\beta }_T(\delta ')\sqrt{cd\log ((c'd)/\delta )} \) and \(\bar{\beta }_T(\delta )\) is the upper bound of the ellipsoid whose rounds start at \(n_s\), defined as:
where \({\bar{S}}\) is defined such that \(\Vert \hat{\varvec{\theta }}_{n_s+1}\varvec{\theta }_{\star }\Vert \le {\bar{S}}\).
For LinUCB, this is equivalent to
where \(Reg(n_s)\) is constant and \(\bar{\beta }_T(\delta )\) is defined as above.
Experimental results To demonstrate the advantage of the adaptive hyperparameter tuning, we repeated the experiment for the artificial dataset. We generated two types of pretraining data: accurate and misspecified. For the generation of accurate dataset, we chose true \(\alpha ^{1} = 10^{4}\) and for the misspecified dataset, we chose true \(\alpha ^{1} = 100\). Notice that such a high value of \(\alpha \) in the misspecified dataset is intentionally chosen to be extreme to demonstrate the capability of the adaptive hyperparameter algorithm, and hence does not reflect a real world setting. For the bandit, we have used LinUCB with \(\rho =0.2\), bandit hyperparameters initial \(\alpha _t = 1\), initial \(\beta _t=1/R^2=16\) (both unchanged over time for bandits with manually chosen hyperparameters), and \(\bar{\beta }=1\) with \(tol= 0.1\) for hyperparameter tuning convergence requirements of both \(\alpha _t\) and \(\beta _t\). As shown in Fig. 9a, the adaptive hyperparameter algorithm is capable of exploiting the accurate prior, even outperforming its nonadaptive counterpart. On the other hand, when the prior is highly misspecified in Fig. 9b, a disastrous result occurs for warmstarted bandit without automatic hyperparameter, while our adaptive hyperparameter algorithm is able to detect the mismatch and ignore the initial guess, attempting to restore its performance should coldstart regime had been deployed.
6 Conclusions and future work
In this paper, we have developed a flexible framework for warm starting linear contextual bandits that inherits the flexibility of Bayesian inference in incorporating prior knowledge. Our approach generalises the linear Thompson sampler of Abeille et al. [2], by permitting arbitrary Gaussian priors for potentially improving shortterm performance, while maintaining the regret bound that guarantees the longterm performance of Hannan consistency. While little attention has been paid to the warmstart problem since the direction was suggested by Li et al. [15], the few existing works on warm start are far less flexible in catering to potential sources of prior knowledge, and in how uncertainty is quantified. We motivate the opportunity for warm start in the database systems domain where banditbased index selection could be pretrained prior to deployment by users, and we demonstrate the practical potential for warm start on a standard database benchmark. We have also contributed an approach to adapting the key hyperparameters responsible to the control of the exploration temperature based on misspecification of pretraining.
Being relatively unexplored, we believe that warmstart bandits offer a number of intriguing future directions for research, well suited to the Thompson sampling framework on which our approach was developed.
Adaptive oversampling factor In this paper, it is assumed that the \(\ell _2\)norm of the parameter is bounded by S. However, this may not be known with confidence in some applications. In such cases the algorithms are still valid, but the bounds may not be. However, as more data are observed, we gain information (accuracy) about \(\varvec{\delta }_\star \): the variance of random variable \(\varvec{\delta }\) drops. Therefore, one may wish to bound \(\Vert \varvec{\delta }\Vert \) with some level of probability. It is interesting to note that how large the value of S is closely related on the drift hyperparameter—potentially both quantities could be optimised using one algorithm jointly.
Reward unit mismatch When the pretraining data are provided, there is a potential difference between the units of the pretraining and deployed datasets. An interesting problem arises by noticing that the performance of the contextual bandit algorithm is not measured by how close the predicted reward is to the actual reward, but rather the rank of the arm values. As such it is the direction of the initial guess of \(\varvec{\theta }\) that is important, not its norm. A simple solution could be learning a constant scaling the size of the pretraining reward to the deployed rewards. Ideally this scalar would be incorporated into the Warm Start LinTS, provided performance is not sacrificed.
References
AbbasiYadkori Y, Pál D, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Adv Neural Inf Process Syst 24:2312–2320
Abeille M, Lazaric A et al (2017) Linear Thompson sampling revisited. Electron J Stat 11(2):5165–5197
Agrawal S, Chaudhuri S, Kollár L, Marathe AP, Narasayya VR, Syamala M (2004) Database tuning advisor for Microsoft SQL Server 2005. In: VLDB
Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. In: International conference on machine learning, pp 127–135
Auer P, CesaBianchi N, Fischer P (2002) Finitetime analysis of the multiarmed bandit problem. Mach Learn 47(2):235–256
Bouneffouf D, Parthasarathy S, Samulowitz H Wistub M (2019) Optimal exploitation of clustering and history information in multiarmed bandit. arXiv preprint arXiv:1906.03979
Bruno N, Chaudhuri S (2006) To tune or not to tune? A lightweight physical design alerter. In: VLDB
Bruno N, Chaudhuri S (2007) An online approach to physical design tuning. In: ICDE
Cao B, Pan SJ, Zhang Y, Yeung DY, Yang Q (2010) Adaptive transfer learning. In: AAAI, p 7
Csurka G (2017) Domain adaptation in computer vision applications. Springer, Berlin
Dageville B, Das D, Dias K, Yagoub K, Zaït M, Ziauddin M (2004) Automatic SQL tuning in Oracle 10g. In: VLDB
Das S, Grbic M, Ilic I, Jovandic I, Jovanovic A, Narasayya VR, Radulovic M, Stikic M, Xu G, Chaudhuri S (2019) Automatically indexing millions of databases in Microsoft Azure SQL database. In: SIGMOD
Kazerouni A, Ghavamzadeh M, Abbasi Yadkori Y, Van Roy B (2017) Conservative contextual linear bandits. In: Advances in neural information processing systems 30
Lattimore T, Szepesvári C (2020) Bandit algorithms. Cambridge University Press, Cambridge
Li L, Chu W, Langford J, Schapire RE (2010) A contextualbandit approach to personalized news article recommendation. In: WWW
Li Y, Xie H, Lin Y, Lui JC (2021) Unifying offline causal inference and online bandit learning for data driven decision. In: Proceedings of the web conference 2021, WWW ’21, association for computing machinery, New York, NY, USA, pp 2291–2303. https://doi.org/10.1145/3442381.3449982
Liu CY, Li L (2015) On the prior sensitivity of thompson sampling. arXiv preprint arXiv:1506.03378
Ma L, Van Aken D, Hefny A, Mezerhane G, Pavlo A, Gordon GJ (2018) Querybased workload forecasting for selfdriving database management systems. In: SIGMOD
Marcus R, Negi P, Mao H, Tatbul N, Alizadeh M, Kraska T (2020) Bao: learning to steer query optimizers. arXiv:2004.03814 [cs.DB]
Oetomo B, Perera M, BorovicaGajic R, Rubinstein BI (2019) A note on bounding regret of the C\(^{2}\)UCB contextual combinatorial bandit. arXiv preprint arXiv:1902.07500
Perera RM, Oetomo B, Rubinstein BIP, BorovicaGajic R (2021) DBA bandits: Selfdriving index tuning under adhoc, analytical workloads with safety guarantees. In: 2021 IEEE 37th international conference on data engineering, ICDE
Qin L, Chen S, Zhu X (2014) Contextual combinatorial bandit and its application on diversified online recommendation. In: SDM
Sattler KU, Schallehn E, Geist I (2004) Autonomous querydriven index tuning. In: IDEAS
Schnaitter K, Abiteboul S, Milo T, Polyzotis N (2007) Online index selection for shifting workloads. In: ICDEW
Shivaswamy P, Joachims T (2012) Multiarmed bandit problems with history. In: Artificial intelligence and statistics, PMLR, pp 1046–1054
Slivkins A (2019) Introduction to multiarmed bandits. Found Trends Mach Learn 12(1–2):1–286
Song C, Xia ST (2016) Bayesian linear regression with Studentt assumptions. arXiv preprint arXiv:1604.04434
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3–4):285–294
TPC (n.d.) TPCH benchmark. http://www.tpc.org/tpch/
TranThanh L, Stein S, Rogers A, Jennings NR (2014) Efficient crowdsourcing of unknown experts using bounded multiarmed bandits. Artif Intell 214:89–111
Wang L, Wang C, Wang K, He X (2017) Biucb: A contextual bandit algorithm for coldstart and diversified recommendation. In: 2017 IEEE international conference on big knowledge (ICBK), IEEE, pp 248–253
Zhang C, Agarwal A, Daumé III H, Langford J, Negahban S (2019) Warmstarting contextual bandits: robustly combining supervised and bandit feedback. In: Proceedings of the 36th international conference on machine learning, pp 7335–7344
Zilio DC, Rao J, Lightstone S, Lohman GM, Storm AJ, GarciaArellano C, Fadden S (2004) DB2 design advisor: integrated automatic physical database design. In: VLDB
Acknowledgements
We gratefully acknowledge support from the Australian Research Council Discovery Project DP220102269, as well as Discovery Early Career Researcher Award DE230100366.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary abridged version of this paper appeared as: Bastian Oetomo, R. Malinga Perera, Renata BorovicaGajic, and Benjamin I. P. Rubinstein. “Cutting to the Chase with WarmStart Contextual Bandits.” 2021 IEEE 21th International Conference on Data Mining (ICDM), pp. 459–468, 2021.
Appendices
A. Full proof of the confidence ellipsoid of warmstarted bandit
We now detail the full proof of Theorem 2, by extending a previous analysis [1]. We restate our estimate of the parameter for convenience:
where for \(n\ge 2\) we have defined
Let \(\varvec{X}_{1:t}\) and \(\varvec{Y}_{1:t}\) be matrices comprising the contexts and the rewards up to round t, respectively, and \(\varvec{\epsilon }_{1:t}\) be the vector containing their corresponding subGaussian noise, that is:
Therefore, we can write \(\hat{\varvec{\theta }}_t\) as
To avoid clutter, let \(\varvec{X} = \varvec{X}_{1:t1}, \varvec{Y} = \varvec{Y}_{1:t1}, \varvec{\epsilon }=\varvec{\epsilon }_{1:t1}\). Then, we have \(\varvec{V}_t = \bar{\varvec{V}}_1 + \varvec{X}^T\varvec{X}\). Therefore, we can expand the expression of \(\varvec{\theta }_t\) above as:
Next, we would like to obtain for any vector with appropriate size \(\varvec{c}\):
Now as we have assumed that \(\bar{\varvec{V}}_1\) is positive definite, and since \(\varvec{V}_t\) is the sum of positive definite matrices, then \(\varvec{V}_t\) is also a positive definite matrix, thus the inner products are welldefined. Therefore, we can invoke the Cauchy–Schwarz inequality to obtain
Now [1, Theorem 1], where \(\varvec{V} = \bar{\varvec{V}}_1\), yields, with probability at least \(1\delta \) that
Furthermore, since \(\varvec{c}\) can be any vector, we choose \(\varvec{c} = \varvec{V}_t(\hat{\varvec{\theta }}_t  \varvec{\theta }_\star )\), which yields
and
Combining both expressions above, we have:
Now we use the fact that \(\varvec{V}_s \le \varvec{V}_t\) for \(s\le t\); thus, we can bound:
Thus, we conclude that
B Full proof of the regret bound of warmstart LinUCB
The regret analysis for LinUCB is included here for completeness and follows closely the proof laid out by Lattimore and Szepesvári [14]. Let \({\mathcal {C}}_t\) be a closed confidence set containing \(\varvec{\theta }_\star \) with high probability such that \({\mathcal {C}}_t \subseteq \{ \varvec{\theta } \in {\mathbb {R}}^d: \Vert \varvec{\theta }\hat{\varvec{\theta }}_t\Vert _{\varvec{V}_{t}}\le \beta _t \}\). Furthermore, let \(\tilde{\varvec{\theta }}_t \in {\mathcal {C}}_t\) be such that \(\tilde{\varvec{\theta }}_t^T\varvec{x}_t = UCB_t(\varvec{x}_t)\). This implies that
Therefore,
where we have defined \(\tilde{\varvec{\delta }}_t\) as the vector \(\tilde{\varvec{\theta }}_t\) relative to \(\hat{\varvec{\mu }}\). The next step follows from Jensen’s inequality:
where
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Oetomo, B., Perera, R.M., BorovicaGajic, R. et al. Cutting to the chase with warmstart contextual bandits. Knowl Inf Syst 65, 3533–3565 (2023). https://doi.org/10.1007/s10115023018612
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115023018612