Cutting to the chase with warm-start contextual bandits

Multi-armed bandits achieve excellent long-term performance in practice and sublinear cumulative regret in theory. However, a real-world limitation of bandit learning is poor performance in early rounds due to the need for exploration—a phenomenon known as the cold-start problem. While this limitation may be necessary in the general classical stochastic setting, in practice where “pre-training” data or knowledge is available, it is natural to attempt to “warm-start” bandit learners. This paper provides a theoretical treatment of warm-start contextual bandit learning, adopting Linear Thompson Sampling as a principled framework for flexibly transferring domain knowledge as might be captured by bandit learning in a prior related task, a supervised pre-trained Bayesian posterior, or domain expert knowledge. Under standard conditions, we prove a general regret bound. We then apply our warm-start algorithmic technique to other common bandit learners—the ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-greedy and upper-confidence bound contextual learners. An upper regret bound is then provided for LinUCB. Our suite of warm-start learners are evaluated in experiments with both artificial and real-world datasets, including a motivating task of tuning a commercial database. A comprehensive range of experimental results are presented, highlighting the effect of different hyperparameters and quantities of pre-training data.

stochastic bandit setting, as compared to more general partially-observable Markov decision processes (POMDPs), typically admits regret analysis where bandit learners enjoy bounded cumulative regret-the gap between a learner's cumulative reward to time T and the cumulative reward possible with a fixed but optimal-with-hindsight policy. While many bandit learners are celebrated for attaining sublinear regret or average regret converging to zero, such long-term performance goals say little about the short-term performance of today's popular bandit algorithms.
Indeed, the bandit setting is well known to be the simplest Markov decision process setting to require balancing of exploration-attempting infrequent actions in case of higher-thanexpected rewards-with exploitation-greedy selection of actions that so far appear fruitful. Even in the stochastic setting, where rewards are drawn from stationary (context conditional) distributions, the underlying distributions are unknown and considered adversarially chosen. In other words, there is no free lunch (in the worst case) without significant exploration in early rounds.
The relatively poor early round performance of bandit learners is known as the cold start problem and can be costly in high-stakes domains. Li et al. [15] suggested that bandit learners be warm started or pre-trained somehow prior to such deployment, in the context of online media recommendation and advertising where poor performance leads to user dissatisfaction and financial loss. However, little systematic research has explored the cold start problem. Intuitively, warm start is related to transfer learning [9] and domain adaptation [10] while while Shivaswamy and Joachims [25] proposed warm-starting methods for non-contextual bandits and Zhang et al. [32] modify any bandit policy to make use of pre-training from (batch) supervised learning via manipulation of the policy's importance sampling and weighting, which determines the relative importance of one data (x, y) over the other data-ultimately resulting in a weighted linear regression. Another work by Li et al. [16] employs virtual plays before committing to an action in every round, which implicitly assumes that the existing logged data are perfectly aligned with the unknown bandit data. A similar assumption is made implicitly by Bouneffouf et al. [6], who combine prior historical observations and clustering information. Other works have proposed approaches to the item-user cold-start problem, such as that proposed by Wang et al. [31], who passively assign a user to each item on top of the usual bandit which selects an item for a user. The warm-start problem is also related to the conservative bandit problem, where the usual bandit setting applies under the existence of a baseline policy and a performance constraint [13]. This paper advocates for Thompson sampling (TS) [28] as a natural framework for warm start bandits. Although the prior used in Thompson sampling can be misspecified, as discussed by Liu and Li [17], our extension to the LinTS contextual bandit not only affords more flexible forms of warm start, but quantifies prior uncertainty, and admits regret analysis. Furthermore, this idea can be extended into other bandit algorithms, such as -greedy and LinUCB.
Flexibility in warm start is paramount, as not all settings requiring warm start will necessarily admit prior supervised learning as assumed previously [32]. Indeed, bandits are typically motivated when there is an absence of direct supervision, and only indirect rewards are available. Our framework offers unprecedented flexibility. We advocate that prior knowledge could come from: bandit learning on a previous, related task; domain expert knowledge or knowledge extracted from a rule-based, non-adaptive baseline system; or indeed prior supervised learning.
We introduce a new motivation for warm-start bandits from the database systems domain. Database indices, a data structure used by database management systems to execute queries more rapidly, may be formed on any combination of table columns. Unfortunately the best choice of index depends on unknown query workloads and potentially unstable system per-formance. Offline solutions to index selection have been the foundations of the automated tools provided by database vendors [3,11,33]. Recognising that database administrators cannot practically foresee future database loads, online solutions, where the choice of the representative workload and the cost-benefit analysis of materialising a configuration are automated, have been proposed [7,8,12,18,23,24]. Unfortunately, most such approaches lack any form of performance guarantee. Recent work has demonstrated compelling potential for linear bandits for index selection [21] complete with regret bound guarantees, however the cold start problem is likely to limit deployment as vendors and users alike may be concerned about out-of-box performance. We demonstrate that a warm-start bandit can deliver strong short-term improvement for database index selection without costing long-term results.
In summary, this paper makes the following contributions: -We propose a framework for warm-starting contextual bandits based on LinTS and extend our technique to -greedy and LinUCB; -Unlike past efforts to warm-start bandit learners, which strictly apply to supervised learning only, our warm start linear bandit seen in Algorithms 2, 3 and 4 can incorporate prior knowledge from any form of prior learning, such as supervised learning [32], prior bandit learning, or manual construction of a prior by a domain expert. Notably, our warm start approach incorporates uncertainty quantification; -We introduce a method to automatically tune the hyperparameters used in Algorithms 2, 3 and 4; -We present regret bounds for warm start LinTS and LinUCB that demonstrate sublinear regret for long-term performance; -Experiments on database index selection (using data derived from standard system benchmarks), classification task data and synthetic data demonstrates performance improvement in the short term with performance competitive with baselines (where such baselines are able to be run); and -We have expanded experiments to demonstrate the effect of increased pre-training on the performance in both accurate and misspecified settings.

Background: contextual bandits and linear Thompson sampling
The stochastic contextual multi-armed bandit (MAB) problem is a game proceeding in rounds t ∈ [T ] = {1, 2, . . . , T }. In round t the MAB learner, 1. observes k possible actions or arms i ∈ [k] each with adversarially chosen context vector . The MAB learner's goal is to maximise its cumulative expected reward-the total expected reward over all rounds-which is equivalent to minimising the cumulative regret up to round T : , that is, an optimal arm to pull at round t. When a MAB algorithm's cumulative regret Reg(T ) is sub-linear in T , the average regret Reg(T )/T goes to zero. Such an algorithm is said to be a "no regret" learner or Hannan consistent.
Thompson sampling (TS), a Bayesian approach within the family of randomised probability matching algorithms, is one of the earliest design patterns for MAB learning [28]. Each modelled arm's reward likelihood is endowed with a prior. Arms are then pulled based on their posteriors: e.g., parameters for each arm can be drawn from the corresponding posteriors, and then arm selection may proceed (greedily) by maximising reward likelihood.
Linear Thompson sampling (LinTS) [2,4] is an algorithm with sub-linear cumulative regret, when the context-conditional reward satisfies a linear relationship where additive noise t (i t ) is conditionally R-sub-Gaussian and θ ∈ R d is an unknown vector-valued parameter shared among all of the k arms.
Like most approaches to linear contextual bandit learning, LinTS adopts (online) ridge regression fitting for estimating the unknown parameter. For any regularisation parameter λ ∈ R + , define the matrix V t as Then, Abeille et al. [2] demonstrated that we can estimate the unknown parameter θ aŝ Earlier versions of LinTS [4] do not include a tunable regularisation parameter. A result due to Abbasi-Yadkori et al. [1] is used within LinTS. Assuming θ ≤ S, then with probability at least 1 − δ ∈ (0, 1): In Thompson sampling, we may introduce a perturbation parameter η t ∈ R d , which, after rotation and scaling by the inverse square root of the matrix V −1/2 t , and scaling by oversampling factor β t (δ ), promotes exploration around the point estimateθ t : Moreover, Abeille et al. [2] have shown, that if η t follows distribution D T S with the following properties: 1. There exists p > 0 such that, for all u = 1 we have P η∼D T S (u T η ≥ 1) ≥ p; and 2. There exist positive constants c and c such that, for all δ ∈ (0, 1) we have the inequality then LinTS is Hannan consistent. We adopt a standard multivariate Gaussian for η t which satisfies the above properties [2]. With all of these definitions in mind, the version of LinTS used in this paper can be summarised as shown in Algorithm 1.

Warm-starting linear bandits
We now detail our flexible algorithmic framework for warm-starting contextual bandits, beginning with linear Thompson sampling for which we derive a new regret bound.

Thompson sampling
Given the foundation of Thompson sampling in Bayesian inference, it is natural to look to manipulating the prior as a means to injecting a priori knowledge of the reward structure before the bandit is put into operation. Algorithm 1 implementation of LinTS due to Abeille et al. [2] decomposes the prior and posterior distributions on θ t as a Gaussian centred at the point estimateθ t with covariance based on oversampling factor β t (δ ) and the matrix V t via the random perturbation vector η t . Our approach to warm start is to focus on manipulating the initial point estimateθ 1 and the matrix V 1 to incorporate available prior knowledge into LinTS.

Remark 1
Although Algorithm 1 appears to offer the freedom to select anyθ 1 , Eqs. (1) and (2) do not present an immediate route to adapting subsequent point estimatesθ t . Generalising Eq. (2) to point estimateθ t = V −1 t (λθ 1 + t−1 s=1 x s (i s )r t (i s )) is unintuitive and does not clearly admit regret analysis.
We adopt an intuitive approach of adapting Algorithm 1 to model the difference between an initial guess derived from some process occurring before bandit learning, and the actual parameter. This pre-deployment process could be batch supervised learning, an earlier bandit deployment on a related decision problem, or simply a prior manually constructed by a domain expert. Our general framework is completely agnostic and generalises earlier approaches to warm-starting bandits such as [32]. Without loss of generality we refer to this earlier process as the first phase and the basis for which initial parameters are designed as the first phase dataset. Let θ = μ +δ , where μ is the true parameter of the first phase dataset andδ represents the concept drift between first phase and bandit deployment. With this reparametrisation, our linear model becomes: Therefore, our problem has reduced from estimating θ to estimatingδ . Consider a Bayesian linear regression model with the unknown true value of first phase dataset μ modelled by random variable μ ∼ N (μ, μ ) with conjugate context-conditional Gaussian likelihood. We then model the difference parameterδ asδ ∼ N (0, α −1 I). If θ = μ +δ is the random variable modelling θ , then θ ∼ N (μ, μ + α −1 I) owing to the Gaussian's stability property. Finally, sinceμ is known, we can model θ as θ =μ + δ, that is, a random variable centred atμ which is shifted by drift δ ∼ N (0, ( μ + α −1 I d )).
We next generalise the coupled recurrence Eqs. (1) and (2) for efficient incremental computation of the generalised posterior estimates.

Proposition 1 Consider linear regression likelihood y
where θ t point estimates are defined by Eq.
(2), and we replace where R 2 is the variance of the measurement noise.
Proof The posterior distribution is: To avoid clutter, letV n+1 = V 1 + 1 y i x i . Therefore, our posterior distribution can be rewritten as ). Therefore, our estimator for θ would bê This completes the proof.
Our approach comes with an appealing interpretation when settingδ ∼ N (0, α −1 I): when we are confident that our pre-training guess is very close to the true parameter, we can set drift α −1 to be very small and close to 0. However, when we are not as confident, α −1 is naturally set large. Large α −1 creates more "deviation" or error from our first phase parameter μ . This suggests a promising new direction which we highlight in future work in Sect. 6.
Our simple reduction of warm-start bandit learning to LinTS admits a regret bound. We follow the pattern of the regret analysis of Abeille et al. [2] with differences detailed next.
Observe first that . Accordingly the argument yielding the confidence ellipsoid β t (δ ) stated in [1,Theorem 2] bounding θ t − θ V t applies in our case, whose full proof of its modification can be found in Appendix. However, as our initial matrix V 1 generalises λI, we must alter the penultimate proof step of Abeille et al. [2] as follows: -the inequality proposed by Abbasi-Yadkori et al. [1] which is used to define β t (δ) in their paper is not valid in our scenario. This is corrected by using the version of β t (δ) presented in this paper, removing the assumption that V 1 = λ R 2 I and leave it in terms of V 1 : -the inequality of [2, Proposition 2] is no longer valid in our case. However, the last inequality in [20] has modified [2, Proposition 2] into: and hence serves our purpose; and -in proving [2, Theorem 1] the authors used the fact that V −1 t ≤ 1 λ I. This is not the case in our setting, but we can generalise the result with similar reasoning yielding We also need to change the definition of S, since our problem has shifted from estimating θ to estimating δ. Therefore, after modifying the framework, the warm-start linear Thompson sampling bandit can be summarised as in Algorithm 2 and admits the following regret bound.
Theorem 2 (Warm-start LinTS regret bound) Under the assumptions that: 1. x ≤ 1 for all x ∈ X ; 2. δ ≤ S for some known S ∈ R + ; and Algorithm 2 Warm-Start Linear Thompson Sampler

the regret of LinTS can be decomposed as
with each of the term bounded as

Extension to -Greedy and LinUCB learners
The core idea of our warm-starting method as derived for linear Thompson sampling lies in the method of setting up the initial phase of the bandit. The same expression of initial set up can be applied to other contextual bandit algorithms such as -Greedy and LinUCB.
In the -greedy algorithm, we balance exploration and exploitation by means of relatively naïve randomness: in each round we (uniformly) explore with probability and exploit with probability 1 − . Specifically, by incorporating warm start, this means that at each round we choose an arm at random uniformly from the set [k] with probability , and choose an arm at random uniformly from the set S = arg max i∈[k]θ T t x t (i) with probability 1 − . We summarise the warm-start -greedy algorithm in Algorithm 3 We can also extend our warm-starting technique to LinUCB using the fact that θ ∼ which is a powerful result. It was proposed by Li et al. [15] that one way to interpret their algorithm is to look at the distribution of the expected payoff θ T x t . With the affine transformation property of multivariate Gaussian distributions, we have that . Therefore, the upper bound of such a quantity is: (2)} 11: end for for some value ρ, which is left as a hyperparameter. The summary of our warm-start LinUCB Algorithm can be seen in Algorithm 4.

Theorem 3 (Warm-start LinUCB regret bound)
The regret bound of warm-started LinUCB follows an argument of Lattimore and Szepesvári [14] very closely. The regret, whose complete derivation is provided in Appendix, admits bound

A regret lower bound
We here present a lower bound for the warm-started bandit linear contextual -greedy algorithm. Consider the best-case scenario for -Greedy with constant , that is, that we have the true weight as our initial guess i.e.,μ = θ . Assume that we use the hyperparameter α → ∞, which ensures the weight's resistance to changes from observations, i.e.,θ t =μ = θ for all t. With this setting, denoting i,t ≥ 0 as the difference between the expected rewards of the optimal arm and arm i at round t, the regret is K This argument, detailed in Lemma 4, proves a lower bound since it is derived from a best case scenario.

Lemma 4 The regret for warm-started -greedy is at best
Proof Sinceθ t = θ for all t, each exploitation round will yield one of the optimal arms with probability 1. Assume that there are K arms in total. Let E denote the event that exploration occurs, and A i be the event that arm i is chosen. Then, the expected cumulative regret for the linear contextual -greedy is: Note that in this analysis we have used a constant for our -greedy algorithm. In practice, the value of can be scheduled to recede over time. Auer et al. [5] have shown that in the case of non-contextual bandits, this regime enjoys a sub-linear upper regret bound.
Reduction from non-contextual to contextual bandits The above lower bound of the contextual -greedy algorithm leads naturally to a lower bound for non-contextual bandits. The non-contextual bandit is different from its contextual counterpart where it does not provide any context. In each round, the true means of each non-contextual arm remain constant and are independent of each other (i.e., θ i,t = θ i for all t); thus, the parameters to estimate are θ i for arm i ∈ [K ]. A non-contextual bandit can be formulated as a contextual bandit, as shown in Lemma 5. By performing such a reduction, essentially using a contextual bandit to act in a non-contextual setting, we can relate lower bounds between the settings.

Lemma 5 A non-contextual bandit can be formulated as a contextual bandit. Therefore, any fundamental limitations for non-contextual bandits must also hold for contextual bandits.
Proof Let the non-contextual bandit arm be i = 1, . . . , K and let the expected reward for arm i be θ i . A contextual bandit equivalent can be constructed by setting the context for arm i as x(i) = e i , which is the standard basis of R K , i.e., the vector whose element is 1 in its i th element and 0 otherwise. Furthermore, assuming that the shared model is used, then the i th element of the true weight θ can be taken to be θ i . This setting leads us to set the initial weightμ = μ 1 · · ·μ K T to provide an initial guess of the true mean of each arm μ i for i ∈ [K ], with V 1 = diag(λ 1 , · · · , λ K ) reflecting the confidence we have for our initial estimate. A diagonal matrix is particularly chosen for this purpose since the means of each arm are independent of each other. Thus, the (contextual) estimate of θ iŝ Now since x s = e i s , and noticing that e i e T i = diag(1(i = 1), · · · , 1(i = K )) for all i ∈ [K ], i.e., a matrix with all zero entries except at entry (i, i) with value 1, we have where T i is the number of times arm i is pulled and is the total sum of all the reward differences observed by arm i. Therefore, the estimate of the weight iŝ This result can be interpreted such that for each arm i ∈ [K ], our estimate of the true mean θ i is its sample mean with a pseudo-observation of meanμ i worth of λ i observations. Indeed, when we choose λ i = 0 for all i ∈ [K ], we recover each arm's mean estimate typically calculated by a non-contextual bandit -greedy algorithm. With this, when we exploit, we choose an arm which maximiseθ T x(i) =θ T e i =θ i , which is the same as what is performed in the non-contextual case.
Since a non-contextual bandit can be formulated as a contextual bandit, our approach may be applied to warm-start a non-contextual bandit. Its lower bound when the -greedy algorithm is used follows the lower bound of contextual -greedy, with i,t =¯ i for all t since the mean reward (hence the regret each arm) is stationary across t. In other words, Lemma 4 is a fundamental lower bound on our warm-start setting also.

Experiments
We now report on a comprehensive suite of experimental evaluations of our warm-start framework against a number of baselines and different datasets. We are interested in the benefit of warm start over cold start-in such cases we focus on short-term performance differences, as this is a practical limitation of bandits in high-stakes applications. We also explore the impact of prior misspecification as a potential risk of incorrect warm start. We summarise our experiments next and then describe them with results in more detail below.
Datasets Experiments in database index selection explore the effect of warm start in selecting a single index per round where queries arrive to the database in batches and rewards correspond to (negative) execution time. We use a commercial database system, and the standard TPC-H benchmark [29]. Results on two OpenML datasets (Letters and Numbers) test bandits on online multi-class classification, as a benchmark previously used to evaluate the ARRoW warm-start technique [32]. These datasets are advantageous to ARRoW in that they supply the (restrictive) kind of prior knowledge needed-supervised pre-training. Experiments on synthetic data provide sufficient control of the environment to explore limitations of our warm-start approach.
Baselines On the database index selection task, we use cold start TS as a natural and fair baseline. On the OpenML datasets we include the ARRoW warm-start framework, which was originally tested in the same way. We also demonstrate the performance of both frameworks on the -greedy and LinUCB learners, as well as LinTS. Where cold start corresponds throughout to having no pre-training dataset (i.e., Algorithm 1), hot start in the synthetic experiment corresponds to having 100% accuracy on the pre-training parameter μ , and warm start corresponds to having an estimate on the pre-training parameter μ , namelyμ. By its very nature, we can only produce hot start results with the artificial dataset, since 100% accuracy on the pre-training parameter requires an infinite amount of observation in the real-world database index selection problem.
Hardware All experiments are performed on a commodity laptop equipped with Intel Core i7-6600u (2 cores, 2.60 GHz, 2.81 GHz), 16 GB RAM, and 256 GB disk (Sandisk X400 SSD) running Windows 10. In database experiments, we report cold runs only: we clear database buffer caches prior to query execution-the memory setting thus does not impact our findings.

Database index selection
As the real-world problem of database index selection motivated this work, we begin with a demonstration in this setting. In a database management system, an index is a data structure used to speed up database execution of a set of queries (a.k.a workload). While a huge space of possible indices could be considered, only a few can actually be created due to memory constraints (since each index occupies space in memory). With a tremendous number of indices, it is impractical for humans to decide which indices to create without assistance. A recent effort has been made to automate this task by using bandits [21] to propose an optimal set of indices to boost the workload execution. This recent framework we will be adopted in our work and expanded to support warm start. The aim of this experiment is to demonstrate that the warm-started bandit will yield similar performance as the cold-started bandit in the long run while having better performance in earlier rounds. The consequence of such a demonstration is a system more suitable for deployment.
In particular, our problem setting is as follows. At round t = 1, 2, . . . , T , we observe a workload W t with a set of queries, and the system recommends one index i t out of the set of all possible indices I. After index i t is created, we execute the queries in workload W t . Our chosen aim is to minimise the query execution time, noting we do not take into account the time it takes to create the index i t . After q t is executed, the index i t is dropped and the buffer is cleaned.
In this paper, the adopted database comes from the TPC-H benchmark [29]. This publicly available industrial benchmark comes with a set of predefined query templates. A query Fig. 1 Cold start vs. warm-start LinTS for database index selection on the TPC-H benchmark template is a parameterised query whose parameter values (a.k.a conditions) are missing, keeping only the structure of the query and leaving number and string values as variables. We chose five query templates at random and instantiated them with actual parameter values in each round. These queries will be used as the workload in both pre-training and deployment phase.
It should be noted that the value of R and S are unknown in the real-world dataset. In this case, we treat these as hyperparameters which need to be chosen, adding to α.
In running this experiment, we have used the context features as described by Perera et al. [21], with the reward being the performance gain, described as t no_index − t i , where t no_index corresponds to the execution time of the whole workload without any indices and t i the execution time of the whole queries in the workload using index i.
Due to the lack of information on the most optimal index, it is impossible to retrieve the regret for each round. Therefore, with this real-world experiment, we present the average execution time (loss) of workload W t based on what both algorithms recommend, which can be found in Fig. 1.
Results It can be seen that the warm-started LinTS outperforms the cold-started LinTS, in short-term rounds and cumulatively. This can be explained by the query templates used to pre-train the warm-started bandit resembling the templates used in the testing dataset. This leads the warm-started bandit's guess of the initial weight θ 1 =μ being closer to the actual weight θ compared to the initial guess of θ 1 = 0 by the cold-started bandit.

OpenML classification dataset
We chose two of the datasets used in [32], which correspond to letters and numbers identification, respectively. We split the data such that 10% is used as the supervised learning examples and the other 90% used as the actual bandit rounds. This advantages ARRoW [32] as the only form of permissible prior knowledge. We try all learners presented in this paper for this dataset: -greedy, LinUCB and LinTS. As for the hyperparameters, we used = 0.0125 for -greedy, ρ R = 0.2 for LinUCB, β t (δ) = 1 for LinTS in Letter dataset and β t (δ) = 0.05 for LinTS in Numbers dataset with R = 0.25. All of these hyperparameters were found iteratively by grid search.
As described in [32], we transform the dataset into a dataset capable of evaluating bandit algorithm by mapping the classes as the arms and the cost of each class as c(a) = 1(a = y) given example (x, y). For the classification problem, we also modify our bandit algorithm which usually shares its parameter across the arms. However, since the context of each arm is the same for the classification task, we distinguish the value by making the parameter different, leading to the disjoint bandit with arm i having the weight θ i,t . As such its reward is modelled by the equation r t (i) = θ T i, x t (i) + t (i) We have used the term cost instead of rewards in this dataset, which requires minor modification of the learners: we change the argmax operation into argmin and in the case of LinUCB, the upper confidence bound in Line 5 to lower confidence boundθ The ARRoW algorithm presented in [32] is also executed partially, with the size of the class | | set to 1. We chose the best performing λ to be compared against our algorithm, for fairness. We note that sensitivity analysis in Figs. 3 and 4 demonstrates that the choices are generally not very important.
We follow a suggestion of the original ARRoW paper to evaluate [32, Algorithm Line 5], evaluating arg min a) is a linear function and F is the class of all linear functions, the solution of which can be obtained via the weighted linear regression. Another algorithm we used for comparison is by Li et al. [16], hereby labelled as WWW'21 for convenience (denoting the publication venue). This algorithm employs virtual plays in every round by sampling the context according to a cdf F X (x), estimated by its empirical cdf F X (x), ultimately equivalent to random sampling of the seen contexts with replacement. A feedback is provided by an offline evaluator whenever the online confidence band is wider than the offline counterpart. The virtual plays are continued indefinitely until the offline evaluator does not give a feedback.
We present the results for the OpenML Dataset in Fig. 2, where we have labelled our algorithm diff for the fact that our algorithm models the difference between the true parameter from the guessed weight. It can be seen that our algorithm performs as well as previous algorithms, while still offering the flexibility to choose the initial guess.
Sensitivity analysis for this experiment (with accurate prior) is presented in Figs. 3 and 4. As mentioned, neither ARRoW nor our warm-start approach are very sensitive to their hyperparameters, while the algorithm proposed by Li et al. [16] does not require any hyperparameter tuning. These results also support our choice of α = 10 7 across these experiments.

Effect of warm-start on exploration hyperparameters
In this section, we present the final cumulative cost as a means of measuring the performance of warm-started bandit under different exploration hyperparameters. As previously observed from Figs. 3 and 4, the temperature hyperparameter does not appear to have a significant impact on final performance. Thus, for this analysis, we again fixed the value α = 10 7 . We reran the experiment for both Letters and Numbers datasets using the -greedy, LinUCB, and LinTS algorithms, varying

Fig. 2 Comparisons of both our and ARRoW warm-start frameworks on the (column i) Letters and (ii)
Numbers datasets, with learners a -greedy, b LinUCB and c LinTS the value of the exploration hyperparameters , ρ R and β, respectively. The results, as shown in Fig. 5, suggest that lower values of the exploration hyperparameters are preferred. This is intuitive since a goal of warm-starting bandits is to reduce the demand on exploration during initial rounds. This effect is very prominent, especially in the -greedy algorithm. This can be explained by the fact that exploration in the -greedy is strictly dictated by the value of , while in LinUCB and LinTS the exploration terms are partly influenced by the matrix V t , Column (i) demonstrates ARRoW results with varying λ while column (ii) shows our warm-start approach Diff with varying α. Finally the learners vary over a -greedy, b LinUCB, c LinTS which initially depends on the covariance matrix μ . Therefore, in -greedy, we recommend 'manually' reducing the exploration hyperparameter , while in LinUCB the exploration is partially automatically reduced thanks to the lower exploration boost when μ has smaller eigenvalues.
(a-i) (a-ii) Effect of pre-training data ratio on performance As previously done in Zhang et al. [32], we can explore the fraction of the dataset available for pre-training. In this section, we present how the cumulative cost evolves as the pre-training dataset to total dataset ratio changes. Here the total dataset refers to the union between the pre-training dataset and the bandit deployment dataset. In particular, we investigate the performance for each of the ratios in  LinUCB and c LinTS. The performance appears better when the exploration hyperparameter is relatively small different ratios in the same dataset share the same deployment data, thus the maximum ratio in the experiment, which is 0.1, is used to determine the deployment dataset. Since there are 20,000 data in Letters Dataset and 2000 data in Numbers dataset, we used the last 18,000 and 1800 data in Letter and Number Dataset, respectively. Figure 6 supports the intuition that higher ratios likely lead to better performance. This effect is particularly apparent during the initial increase, while the gain gradually fades away as the ratio is increased further. This diminishing return can be explained since the biggest improvement in the correctness of θ occurs in the beginning of the supervised learning, whereas its accuracy, while increasing, improves more slowly as more data are observed.
Effect of misspecified pre-training data ratio on performance A series of experiments investigating sensitivity to the warm-start temperature and exploration hyperparameters was carried out. We also investigated the effect of the fraction of dataset used as pre-training in both settings: accurate prior and misspecified prior.
We investigated the effect of a misspecified prior with both datasets. For this, we need to create another dataset in which the true weight θ is different from the deployment dataset's. To do this, we have trained a linear regression for the whole dataset for each arm i, giving us the disjoint parameter θ 1 (i), which is then transformed by a rotation matrix R γ to give a new parameter θ 2 (i) = R γ θ 1 (i). For each datum at round t used for pre-training, we extracted the context x t (i) for all arms, then calculate d r (i) = (θ 2 (i) − θ 1 (i)) T x t (i). This acts as the perturbation of the original reward r t (i), yielding the inaccurate reward r t (i) = r t (i) + d r (i). In our data generation, we have calculated the similarities between the two parameters, yielding the similarities cos(θ 1 , for all arms and both datasets. This consistent rotation attempts to maintain a similar amount of misspecification across datasets, however as we shall see, properties of the data interact with the magnitude of perturbation.
Due to the nature of the semi-synthetic dataset generation process, the reward might no longer be in {0, 1} as previously generated from the classification problem. This observation does not effect the validity of the model, or appropriateness of warm-start in this setting thanks to the flexibility of reward structures accommodated.
We present our result in Fig. 7. Differing to the previous experiment, we no longer have the privilege to have a very similar dataset as our pre-training data. It can be seen that for the Letter dataset, some warm-starting provides a modest initial boost to performance, while warm-starting appears to hurt the performance in Numbers dataset.

Synthetic experiments
In generating the artificial dataset, we started off by choosing a value for θ . In this case, we chose the value to be θ T = 0.1 0.3 0.5 0.7 0.9 , with the bandit having 10 arms. After the value of θ is chosen, we generate a random vector x t (i) ∈ R d , d = 5 where each element is drawn from uniform distribution U (0, 1) for each i = 1, 2, · · · , 10, followed by taking the inner product and adding the Gaussian noise i (t) ∼ N (0, R 2 ), R = 0.25, independent on the arm i and round number t. The noisy reward r i (t) = θ T x t (i) + i (t) is saved, as well as the regret of pulling arm i, namely θ T x t (i) − max i∈ [k] θ T x t (i). This makes it possible to compare all bandit algorithms equally without needing off-policy evaluation. We repeat this process 100,000 times, which corresponds to 100,000 rounds of the second phase dataset.
To generate the pre-training dataset, we firstly choose the value of α −1 , before sampling the true parameter deviation δ ∼ N (0, α −1 I). After the value δ is sampled, we calculate μ = θ − δ and conducted the process exactly as we generated the second phase dataset. We generated two types of pre-training dataset: accurate prior, where we chose α −1 = 10 −4 and misspecified prior, where we chose α −1 = 0.25. We produced 10,000 rounds worth of pre-training dataset.
We observed that, with the dataset generated both from the accurate and misspecified prior regime, α = 10 seems to be the cut-off point where all algorithms work quite well. Therefore, we plot for all warm-starting methods the cumulative regret for α = 10, as shown in Fig. 8.

Results
In the accurate prior regime, it is clear that the hot-started and warm-started bandits outperform the cold-started bandit. This can be explained by the fact that the value of θ is closer toμ or μ as opposed to 0. However, the opposite problem occurs when the prior is misspecified, as the cold-start bandit slightly outperforms the hot-started bandit and warm-started bandit, due to the fact that θ is closer to 0 compared toμ or μ . It should be noted as well that we have held the hyperparameter α the same for all regimes here. When the hyperparameter α is tuned optimally, the hot-started and cold-started bandits are able to perform even better, as the pre-training dataset is treated as if they are the real dataset.

Towards adaptive drift hyperparameter
In this section, we take a closer look at a key hyperparameter of our warm-start algorithms: the drift hyperparameter α which controls how much exploration follows pre-training. While this has so far been set manually, based on how much the operator believes pre-training to be aligned with deployment time, in practice we believe this parameter may sometimes be difficult to set.
Limitations of the current approach The advantage of our current approach of warmstarting as applied in Algorithms 2, 3 and 4 has been centralised around the selection of the drift hyperparameter. This drift hyperparameter α has been used as a means for temperature tuning: how much can we trust the initial weight guess? With an accurate prior, a sufficiently large value of α will give the bandit an early advantage in the deployment phase as unnecessary exploration is eliminated. On the other hand, although the warm-started bandit is somewhat insensitive to α with an accurate prior, its sensitivity will be largely augmented when the prior is highly misspecified; a large α value makes the bandit retain its highly misaligned initial guess and resist changes made from observations. Therefore, it is advantageous to choose a value of α which is not too far off from its optimum. Alternatively, we may attempt to adapt α based on data, which is the approach adopted in this section.

Empirical Bayes
We choose the value of α using the fact that even though this hyperparameter is completely unknown before the deployment phase starts, a better estimate can be made as we observe more data from the deployment phase. If the data match with how the initial weight is chosen, we may decide to put more trust onμ (large α). On the other hand, we may decide to doubt our initial weight when the observed data does not support it (small α). This strategy invites adoption of empirical Bayes, a general method of using observations to estimate or set prior distributions.
Assumptions In an attempt to do this, we make a hierarchical structure assumption such thatδ | α ∼ N (0, α −1 I d ), where α ∼ (ᾱ,β) for convenience. Furthermore, in order to obtain a well-known distribution, we also assume that θ =μ +δ as represented by the random variable θ =μ +δ for deterministicμ, where the dissimilarity betweenμ and θ is captured by the random variable α embdedded inδ. Compared to the initial assumption, α is now treated as random variable and the variance of the initial guess μ is now absorbed and partially represented by α.

Lemma 6
With the above assumptions, the marginalδ follows a multivariate student-t distribution with degrees of freedom ν t , location μ t and scale matrix t , denoted St(ν t , μ t , t ), Proof Firstly, notice that in the case of one-dimensional (scalar) weight, the joint distribution of (δ, α) collapses to normal-gamma distribution. It is a standard result that the marginal distribution ofδ follows non-standardised Student-t distribution with degrees of freedom ν t = 2α, location μ t = μ and scale σ 2 t = β α , so we expect a similar result for multidimensionalδ. To prove the main result, we compute the required marginal density by marginalising α out of the joint distribution itself found by multiplying the model's likelihood and prior, noting the integrand of the fifth equation to be the pdf of a gamma distribution with shapē α + d 2 and rateβ + 1 2δ Tδ , hence integrates to 1: which is multivariate t-distribution with ν t = 2ᾱ, μ t = 0 and t = α −1 t I d =β α I d . Therefore, we conclude thatδ ∼ St(2ᾱ, 0,β α I d ), i.e., a student-t distribution with zero mean and spherical covariance. Notice that we can express ν t in terms of α t andβ as ν t = 2βα t since α t =ᾱβ . By setting the hyperparameters in terms of (α t ,β), we control the prior of α by its mean α t and variance α t β , which is more intuitive instead of its shape and rate (ᾱ,β).
Following Song and Xia [27], we adopt noise such that Adaptive hyperparameter algorithm Sinceδ follows a student-t distribution, our assumptions follow the premise laid out by Song and Xia [27]. By rewriting X = x 1 · · · x n T and y = y 1 · · · y n T , the value of α t and β t can then be optimised by the q-EM algorithm following Song and Xia [27], summarised in Algorithm 5. This algorithm takesβ as its hyperparameter, which controls the degrees of freedom in the underlying distribution ofδ: when a Gaussian distribution ofδ is preferred, we let ν t → ∞ by lettingβ → ∞, recovering the Gaussian distribution from the t-distribution. Some steps in Algorithm 5 require expensive computations. To mitigate such costs, Song and Xia [27] suggest to diagonalise the Gram matrix X T X = P D P T and compute the following quantities beforehand: The required quantities in each iteration can then be calculated as:

Algorithm 5 Adaptive Optimisation of α t and β t
Input: X, y,β, α t , β t , tol where the Woodbury matrix identity is used in the second equation and the cyclic property of the trace operation is used in the fourth equation. We argue that when one wishes to store X T X and not X, the second term of the last equality can be calculated as These quantities can then be used to calculate b opt and c opt which yield new α t and β t until convergence.
Regret bound Algorithm 5 may be invoked at the start of each round to give updated values of α t and β t . However, under this adaptive hyperparameter, α −1 is no longer independent of the other variables. This violates one of the assumptions made in [1], as the choice of λ in their scenario is independent of other variables. Therefore, the validity of the oversampling factor becomes questionable. As the regret analysis for LinTS depends on the validity of the upper bound provided by [1], this in turns becomes invalid as well. As such, regret analysis for the adaptive case would become another open problem. A possible remedy for this problem may be to halt the hyperparameter optimisation update after a certain number of rounds, in which case α −1 might be viewed as constant in the long run as a direct consequence of Theorem 2 and 3.
where R RS (n s ) and R RL S (n s ) are constant,γ T (δ) =β T (δ ) cd log((c d)/δ) andβ T (δ) is the upper bound of the ellipsoid whose rounds start at n s , defined as: whereS is defined such that θ n s +1 − θ ≤S. For LinUCB, this is equivalent to where Reg(n s ) is constant andβ T (δ) is defined as above.
Experimental results To demonstrate the advantage of the adaptive hyperparameter tuning, we repeated the experiment for the artificial dataset. We generated two types of pre-training data: accurate and misspecified. For the generation of accurate dataset, we chose true α −1 = 10 −4 and for the misspecified dataset, we chose true α −1 = 100. Notice that such a high value of α in the misspecified dataset is intentionally chosen to be extreme to demonstrate the capability of the adaptive hyperparameter algorithm, and hence does not reflect a real world setting. For the bandit, we have used LinUCB with ρ = 0.2, bandit hyperparameters initial α t = 1, initial β t = 1/R 2 = 16 (both unchanged over time for bandits with manually chosen hyperparameters), andβ = 1 with tol = 0.1 for hyperparameter tuning convergence requirements of both α t and β t . As shown in Fig. 9a, the adaptive hyperparameter algorithm is capable of exploiting the accurate prior, even outperforming its non-adaptive counterpart. On the other hand, when the prior is highly misspecified in Fig. 9b, a disastrous result occurs for warm-started bandit without automatic hyperparameter, while our adaptive hyperparameter algorithm is able to detect the mismatch and ignore the initial guess, attempting to restore its performance should cold-start regime had been deployed.
(a) ( b) Fig. 9 Experimental results for a an accurate prior and b a misspecified prior, comparing cold-start (cold), warm-start with non-adaptive hyperparameters (warm manual) and warm-start with adaptive hyperparameters (warm auto) using LinUCB Our approach generalises the linear Thompson sampler of Abeille et al. [2], by permitting arbitrary Gaussian priors for potentially improving short-term performance, while maintaining the regret bound that guarantees the long-term performance of Hannan consistency. While little attention has been paid to the warm-start problem since the direction was suggested by Li et al. [15], the few existing works on warm start are far less flexible in catering to potential sources of prior knowledge, and in how uncertainty is quantified. We motivate the opportunity for warm start in the database systems domain where bandit-based index selection could be pre-trained prior to deployment by users, and we demonstrate the practical potential for warm start on a standard database benchmark. We have also contributed an approach to adapting the key hyperparameters responsible to the control of the exploration temperature based on misspecification of pre-training.
Being relatively unexplored, we believe that warm-start bandits offer a number of intriguing future directions for research, well suited to the Thompson sampling framework on which our approach was developed.
Adaptive oversampling factor In this paper, it is assumed that the 2 -norm of the parameter is bounded by S. However, this may not be known with confidence in some applications. In such cases the algorithms are still valid, but the bounds may not be. However, as more data are observed, we gain information (accuracy) about δ : the variance of random variable δ drops. Therefore, one may wish to bound δ with some level of probability. It is interesting to note that how large the value of S is closely related on the drift hyperparameter-potentially both quantities could be optimised using one algorithm jointly.
Reward unit mismatch When the pre-training data are provided, there is a potential difference between the units of the pre-training and deployed datasets. An interesting problem arises by noticing that the performance of the contextual bandit algorithm is not measured by how close the predicted reward is to the actual reward, but rather the rank of the arm values. As such it is the direction of the initial guess of θ that is important, not its norm. A simple solution could be learning a constant scaling the size of the pre-training reward to the deployed rewards. Ideally this scalar would be incorporated into the Warm Start LinTS, provided performance is not sacrificed.

Funding Open Access funding enabled and organized by CAUL and its Member Institutions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A. Full proof of the confidence ellipsoid of warm-started bandit
We now detail the full proof of Theorem 2, by extending a previous analysis [1]. We restate our estimate of the parameter for convenience: Let X 1:t and Y 1:t be matrices comprising the contexts and the rewards up to round t, respectively, and 1:t be the vector containing their corresponding sub-Gaussian noise, that is: . . Therefore, we can writeθ t aŝ θ t = (X T 1:t−1 X 1:t−1 +V 1 ) −1 (X T 1:t−1 Y 1:t−1 ) .
To avoid clutter, let X = X 1:t−1 , Y = Y 1:t−1 , = 1:t−1 . Then, we have V t =V 1 + X T X. Therefore, we can expand the expression of θ t above as: Next, we would like to obtain for any vector with appropriate size c: Now as we have assumed thatV 1 is positive definite, and since V t is the sum of positive definite matrices, then V t is also a positive definite matrix, thus the inner products are welldefined. Therefore, we can invoke the Cauchy-Schwarz inequality to obtain Now [1, Theorem 1], where V =V 1 , yields, with probability at least 1 − δ that Furthermore, since c can be any vector, we choose c = V t (θ t − θ ), which yields Combining both expressions above, we have: Now we use the fact that V s ≤ V t for s ≤ t; thus, we can bound: Thus, we conclude that

B Full proof of the regret bound of warm-start LinUCB
The regret analysis for LinUCB is included here for completeness and follows closely the proof laid out by Lattimore and Szepesvári [14]. Let C t be a closed confidence set containing θ with high probability such that C t ⊆ {θ ∈ R d : θ −θ t V t ≤ β t }. Furthermore, letθ t ∈ C t be such thatθ T t x t = UC B t (x t ). This implies that Therefore, where we have definedδ t as the vectorθ t relative toμ. The next step follows from Jensen's inequality: where β T = R 2 log det(V T ) 1 2 det(R 2 V 1 ) − 1 2 δ + λ max (R 2 V 1 )S .