1 Introduction

Kernel selection is a fundamental problem of online kernel learning, which focuses on how to select kernel functions for online kernel learning algorithms on the fly. This problem is also termed as online kernel selection, and related to the more general online model selection (Foster et al. 2017; Muthukumar et al. 2019). Different from offline kernel selection, where we first execute kernel selection on a training set and then learn a predictor for the subsequent prediction tasks, the kernel selection and online prediction procedures are integrated and form a sequential prediction procedure. Given a collection of kernel functions \(\{\kappa _i\}^K_{i=1}\), which induce K reproducing kernel Hilbert spaces (RKHSs) \(\{{\mathcal {H}}_i\}^K_{i=1}\), an adversary sequentially sends the learner an example \(({\mathbf {x}}_t,y_t)\in \mathbb {R}^d\times \mathbb {R}, t=1,\ldots ,T\). The learner will choose a sequence of kernels \(\{\kappa _{I_t}\}^T_{t=1}\) and a sequence of hypotheses \(\{f_{t}\}^T_{t=1}\). At each round t, the learner suffers a loss \(\ell (f_t({\mathbf {x}}_t),y_t)\). General performance measurement is the regret. The regret with respect to (w.r.t.) \({\mathcal {H}}_i,i\in [K]\) is defined as follows

$$\begin{aligned} \mathrm {Reg}({\mathcal {H}}_i) := \sum ^T_{t=1}\ell (f_{t}({\mathbf {x}}_t),y_t) - \min _{f\in {\mathcal {H}}_i}\sum ^T_{t=1}\ell (f({\mathbf {x}}_t),y_t). \end{aligned}$$
(1)

Since the best kernel function for the current learning task is unknown, the learner hopes to adapt to any \({\mathcal {H}}_i\) up to a small cost.

A major challenge of online kernel selection is the high computational complexity of evaluating kernel functions which requires to operate on the observed examples and thus incurs a O(T) per-round time complexity and space complexity. We can solve this problem from two computational perspectives. The first computational perspective aims at reducing the computational complexity. Most of previous work followed this line. The random feature based online kernel selection approach (Nguyen et al. 2017) embedded the implicit RKHSs to relatively low-dimensional explicit feature spaces, in which the time and space complexity of evaluating kernel functions are linear with the dimension of random feature spaces. The sketch based online kernel selection approach (Zhang and Liao 2018, 2020) maintained a budget and incrementally constructed sketched hypothesis spaces, in which the time and space complexity are linear with the budget size. Another approach reduces online kernel selection to a problem of prediction with expert advice, and uses some master algorithm to wrap computationally efficient online kernel learning algorithms, including budgeted online kernel learning (Crammer et al. 2003; Dekel et al. 2008; Orabona et al. 2009; Koppel et al. 2019), low-rank matrix approximation based online kernel learning and projection to a low-dimension space (Lu et al. 2016; Jézéquel et al. 2019). For instance, Foster et al. (2017) studied online model selection in Banach space and developed a multi-scale expert advice algorithm, which can adapt to the loss range of different hypothesis set.

The second computational perspective limits the usable computational resources and is more practical for online learning problem. Previous work did not consider this new computational perspective, or only indirectly considered the memory constraints (Nguyen et al. 2017; Zhang and Liao 2018). Thus many fundamental problems induced by computational constraints have been omitted. The first fundamental problem is that how the regret depends on the computational constraints, T and K, where K is the number of candidate kernel functions. For instance, given a memory budge B, it is still unclear how the lower bound on the regret depends on B, T and K. The second problem is what the differences between memory constraints and time constraints are. The main obstacle induced by the computational constraints is how to avoid allocating the available computational resources over K RKHSs. Existing approaches allocate the computational resources, and thus may not be optimal.

In this paper, we study online kernel selection under computational constraints, where the kernel selection and online prediction procedures are restricted by a memory budget or a time budget of \({\mathcal {T}}\) quanta. We focus on the worst-case regret analysisFootnote 1 and solve the above two fundamental problems. To start with, we make mild assumptions that relate the memory budget and time budget to the example budget. Thus we only consider such online kernel selection approaches that operate on a subset of observed examples. For unconstrained RKHSs and convex loss functions, we separately prove a lower bound on the regret under a memory budget and time budget. Our proof technique is novelty, which relies on a sequence of equi-distant instances and does not require the orthogonality or approximate orthogonality in RKHSs. For online kernel selection with memory constraints, we reduce it to the problem of prediction with expert advice, and establish two nearly optimal algorithms with different regret bounds. The keys include a memory sharing and a hypothesis space discretization scheme. For online kernel selection with time constraints, we consider two cases. If \(K\le d\), the number of features, this problem is equivalent to the case of memory constraints. For the case of \(K>d\), the two problems are different. We reduce it to the multi-armed bandit problem with an additional observation, and establish a nearly optimal algorithm. The key is a decoupled exploration-exploitation scheme. Table 1 gives a summary of the main results.

Table 1 Summary of main results

1.1 Related work

Online kernel learning with a memory budget has been studied for years (Crammer et al. 2003; Dekel et al. 2008; Orabona et al. 2009). The bounded online gradient descent algorithm (Zhao et al. 2012) enjoys a \(O((\Vert f\Vert ^2_{{\mathcal {H}}}+1){T}/{\sqrt{B}})\) expected regret bound for the hinge loss. However, the matching lower bound is still unknown. Dekel et al. (2008) proved an incomplete hardness result. There exists a sequence of examples and a fixed hypothesis that makes no mistakes, but any online kernel learning algorithm with limited memory always makes mistakes. How the lower bound depends on the memory budget is still unclear. For smooth loss functions, Zhang et al. (2013) proved a \(\varOmega (T/B)\) lower bound on the regret in the case of \(B=O(\sqrt{T})\). Cesa-Bianchi et al. (2015) studied the complexity of offline kernel learning with memory constraints, and proved several lower bounds on the optimization error, which is different from regret. Our work studies the lower bounds for online kernel selection with computational constraints and is suitable for online kernel learning.

Agarwal et al. (2011) initiated the study of computationally budgeted model selection, where the model selection procedure is restricted to a time budget. For a collection of finite number of model classes, by reducing the problems to a stochastic bandit problem, an upper-confidence bound algorithm was established, which can achieve the model selection oracle inequality. The algorithm is not suitable for online kernel selection, since the environments may not be i.i.d.. Our work is also related to online multiple kernel learning (Jin et al. 2010; Hoi et al. 2013). Given K candidate RKHSs, at each round t, the goal is to learn a linear combination of K predictions. Sahoo et al. (2014) proposed budgeted online multi-kernel regression algorithms, which use a budget B to limit the number of support vectors. However, they did not prove how the regret upper bound depends on B. Besides, the per-round time complexity of such algorithms is linear with K. Within time constraints, such algorithms allocate the time resources to K RKHSs which would not be optimal. Our work revels how the upper bound depends on the computational constraints, T and K, and can make up the omitted regret analysis.

There are some other related work, including parameter-free online learning (McMahan and Abernethy 2013; McMahan and Orabona 2014; Cutkosky and Boahen 2016), and model selection for the multi-armed bandit problems (Agarwal et al. 2017; Foster et al. 2019), where the CORRAL algorithm (Agarwal et al. 2017) was proposed for selecting bandit algorithms on the fly. For our focused problems, the sub-algorithms are online kernel learning algorithms rather than bandit algorithms, thus CORRAL is not the best candidate. Parameter-free online learning aims at making regret bounds depend on \(\Vert f\Vert _{{\mathcal {H}}}\) rather than \((\Vert f\Vert ^2_{{\mathcal {H}}}+1)\). Previous work did not consider computational constraints. Our work can achieve this goal within memory constraints.

1.2 Contributions

We study online kernel selection in the regime of memory constraints or time constraints, and analyze the regret in the worst case. Our contributions can be summarized as follows.

  • We prove the worst-case lower bounds on the regret of budgeted online kernel selection algorithm with memory constraints or time constraints. The lower bounds on the regret reveal the lower bounds on the computational constraints that are necessary for achieving a given upper bound on the regret. As a byproduct, our results are suitable for online kernel learning with memory constraints and make up the incomplete result established by Dekel et al. (2008).

  • We identify the condition for the first time under which online kernel selection with time constraints is different from memory constraints.

  • We separately propose nearly optimal algorithms for the two computational constraints which invent some new techniques, such as memory sharing, hypothesis space discretization and decoupled exploration-exploitation scheme.

2 Problem setup

Let \({\mathcal {I}}_T:=\{({\mathbf {x}}_t,y_t)\}_{t\in [T]}\) be a sequence of examples, where \({\mathbf {x}}_t\in {\mathcal {X}}\subset \mathbb {R}^d\) is an instance, \(y_t\in [-Y,Y]\) is the output and \([T] = \{1,\ldots ,T\}\). Let \(\kappa (\cdot ,\cdot ):\mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\) be a positive semidefinite kernel function, and \({\mathcal {H}}\) be the RKHS associated with \(\kappa\), such that, for any \(f\in {\mathcal {H}}\), (i) \(\langle f,\kappa ({\mathbf {x}},\cdot )\rangle _{{\mathcal {H}}}=f({\mathbf {x}}), \forall {\mathbf {x}}\in {\mathcal {X}}\), and (ii) \({\mathcal {H}}=\overline{\mathrm {span}(\kappa ({\mathbf {x}},\cdot )\vert {\mathbf {x}}\in {\mathcal {X}})}\). We define \(\langle \cdot ,\cdot \rangle _{{\mathcal {H}}}\) as the inner product in \({\mathcal {H}}\), which induces the norm \(\Vert f\Vert _{{\mathcal {H}}}=\sqrt{\langle f,f\rangle _{{\mathcal {H}}}}\). Assuming the loss function \(\ell :\mathbb {R}\times {[-Y,Y]}\rightarrow \mathbb {R}_{+}\) is convex in its first parameter.

Given a collection of kernel functions \({\mathcal {K}} = \{\kappa _i\}^K_{i=1}\), which induce K RKHSs \({\mathcal {H}}=\{{\mathcal {H}}_i\}^K_{i=1}\). If an oracle gives the best kernel \(\kappa ^*\) for \({\mathcal {I}}_T\), then we just need to learn a sequence of hypotheses in \({\mathcal {H}}^*\). Lacking the prior of \({\mathcal {H}}^*\), the learner hopes to develop some kernel selection algorithm and generate a sequence of hypotheses \(\{f_t\}^T_{t=1}\), which is competitive to that generated by the same algorithm running in \({\mathcal {H}}^*\) solely. The regret of the algorithm w.r.t. \({\mathcal {H}}_i\in {\mathcal {H}}\) is defined in (1). For the sake of clarity, we restate it as follows,

$$\begin{aligned} \mathrm {Reg}({\mathcal {H}}_i) := \sum ^T_{t=1}\ell (f_{t}({\mathbf {x}}_t),y_t) - \min _{f\in {\mathcal {H}}_i}\sum ^T_{t=1}\ell (f({\mathbf {x}}_t),y_t). \end{aligned}$$

To adapt to the unknown \({\mathcal {H}}^*\), a feasible approach is to keep sub-linear regret w.r.t. any \({\mathcal {H}}_i\).

To achieve this goal, the main challenge is the high time and space complexity. If we do not limit the model size, then the per-round time complexity and the space complexity would be O(T). In this paper, we consider online kernel selection under computational constraints, including a memory budget or a time budget, and analyze the worst-case regret. Next, we define the two kinds of computational constraints.

Definition 1

(Memory Budget) Define a memory budget of \({\mathcal {T}}\) quanta as the maximal memory that any online kernel selection algorithm can use.

Definition 2

(Time Budget) Let the interval of arrival time between \({\mathbf {x}}_t\) and \({\mathbf {x}}_{t+1}, t=1,\ldots ,T\) be less than \({\mathcal {T}}\) quanta. Define a time budget of \({\mathcal {T}}\) quanta as the maximal time interval that any online kernel selection algorithm outputs the prediction of \({\mathbf {x}}_t\) and \({\mathbf {x}}_{t+1}\).

In Definition 1, the term “quanta” is the unit of memory, such as “ Byte”. In Definition 2, the term “quanta” is the unit of time, such as “millisecond” or “second”. We further assume that the base kernels satisfy the following property.

Assumption 1

For all \(\kappa _i\in {\mathcal {K}}\) and \({\mathbf {u}},{\mathbf {v}}\in {\mathcal {X}}\), let \(\kappa _i({\mathbf {u}},{\mathbf {v}})\) be a function of \(\langle {\mathbf {u}},{\mathbf {v}}\rangle\), \(\Vert {\mathbf {u}}\Vert _2\) and \(\Vert {\mathbf {v}}\Vert _2\), and \(\kappa _i({\mathbf {u}},{\mathbf {u}})\in [0,D_{i}]\).

Such kernels are also called Euclidean kernel (Kothari and Livni 2020). For simplicity, let \(D:=\max _iD_i\). Usual kernel functions, such as shift-invariant kernel and polynomial kernel with bounded degree, satisfy the assumption. We further give three key assumptions, which reduce the memory budget and time budget to example budget.

Assumption 2

Let the memory budget be linear with the space complexity of algorithm, and the time budget be linear with the time complexity of algorithm.

The space complexity is defined as the memory required by algorithm. Thus it is intuitive to assume that the memory budget is linear with the space complexity of algorithm. Similarly, assuming that m multiply operations can be executed within a unit time. For a given time budget of \({\mathcal {T}}\) quanta, the algorithm can execute \(m{\mathcal {T}}\) multiply operations. The time complexity of algorithm is defined as the total number of multiply operations. Thus we can also assume that the time budget is linear with the time complexity.

Assumption 3

Under the condition of Assumptions 1 and 2, for any kernel \(\kappa \in {\mathcal {K}}\), there exist positive integers \(\alpha\) and \(\beta\), such that any budgeted online kernel leaning algorithm running in \({\mathcal {H}}_{\kappa }\) can maintain a budget storing \(B\le \alpha {\mathcal {T}}\) examples within a memory budget of \({\mathcal {T}}\) quanta, or can execute \(B\le \beta {\mathcal {T}}\) kernel evaluations at each round within a time budget of \({\mathcal {T}}\) quanta. If the space complexity and time complexity of algorithm are linear with B, then “\(=\)” holds.

Assumption 4

Under the condition of Assumption 3, let the maximal memory budget \({\mathcal {T}}\) satisfy \(B=T\), and the maximal time budget \({\mathcal {T}}\) satisfy \(B=T\).

Assumption 4 means that there is no need to assume an infinite \({\mathcal {T}}\) unless T is infinite. The reason is that any algorithm can store T examples at most. In practice, \({\mathcal {T}}\) may be very small. In Assumption 3, the budgeted online kernel learning algorithms are such algorithms that operate on a subset of the observed examples, such as, Forgetron (Dekel et al. 2008), BOGD (Zhao et al. 2012), BSGD (Wang et al. 2012) to name but a few. We claim that \(\alpha\) and \(\beta\) are independent of kernel function. It is reasonable, since the memory cost is used to store the support vectors and coefficient vectors, and the time cost of computing \(\kappa (\mathbf{u},\mathbf{v})\) is to compute \(\langle \mathbf{u},\mathbf{v}\rangle\), \({\Vert \mathbf{u}\Vert _2}\) and \({\Vert \mathbf{v}\Vert _2}\). We only focus on convex loss functions. Online gradient descent has the lowest space and time complexity, which is O(dB), where B is the budget size. For algorithms whose time complexities are \(O(dB^\gamma ),\gamma >1\), then “=” doest not hold in Assumption 3. Based on the above three assumptions, we only consider such online kernel selection algorithms that work in implicit RKHSs and operate on finite examples. For the sake of clarity, we denote such algorithms as budgeted online kernel selection algorithms.

Next we restate the main questions.

Q 1:

How does the regret depend on \({\mathcal {T}}\), T and K in the worst case?

Q 2:

What are the differences between memory constraints and time constraints?

To answer the two questions, we need to solve the following two problems, (i) proving the lower bounds on the regret under memory constraints or time constraints and, (ii) establishing algorithms achieving the lower bounds. Our main contributions are providing nearly complete answers to the questions.

3 Online kernel selection with memory constraints

In this section, we give both a lower bound on the regret for online kernel selection with a memory budget and two simple algorithms nearly achieving the lower bound.

3.1 Lower bound

We select K Gaussian kernel functions \(\kappa _i({\mathbf {x}},{\mathbf {z}}) = \exp \left( -\frac{\Vert {\mathbf {x}}-{\mathbf {z}}\Vert ^2_2}{\sigma _i}\right)\), \(i\in [K]\) as the candidates. Without loss of generality, let \(0<\sigma _1<\ldots <\sigma _K\), where \(\sigma _K\) is a bounded constant. We can also create candidates from other kernel functions, such as polynomial kernels, or the mixture of polynomial kernels and Gaussian kernels.

Theorem 1

Let \(\ell (\cdot ,\cdot )\) be the hinge loss or the absolute loss. There exist K kernel functions \(\{\kappa _{i}\}^K_{i=1}\) selected by the learner, and a sequence of examples \(\{({\mathbf {x}}_t,y_t)\}^T_{t=1}\) selected by an oblivious adversary, where \(y_t\in \{-1,1\}\), such that, for a memory budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, for all \(\kappa _i\), the expected regret of any budgeted online kernel selection algorithm satisfies

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}\ell (f_{t}({\mathbf {x}}_t),y_t)\right] - \sum ^T_{t=1}\ell (f^*_i({\mathbf {x}}_t),y_t) =\left\{ \begin{array}{ll} \varOmega \left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L\sqrt{T}\right) &{} \mathrm {if}~T=O(\alpha {\mathcal {T}}),\\ \varOmega \left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L\frac{T}{\sqrt{\alpha {\mathcal {T}}}}\right) &{} \mathrm {otherwise}, \end{array} \right. \end{aligned}$$
(2)

where L is the Lipschitz constant of \(\ell\), and \(f^*_i=\mathrm {argmin}_{f\in {\mathcal {H}}_i}\sum ^T_{t=1}\ell (f({\mathbf {x}}_t),y_t)\).

According to the lower bound, we can infer the relation between the upper bound on the regret and the lower bound on the required memory budget. In the case of \(T=O(\alpha {\mathcal {T}})\), the optimal upper bound on the regret is \(O\left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L\sqrt{T}\right)\). In the case of \(T=\varOmega (\alpha {\mathcal {T}})\), the optimal upper bound is \(O\left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L\frac{T}{\sqrt{\alpha {\mathcal {T}}}}\right)\). Let \(T(\alpha {\mathcal {T}})^{-\frac{1}{2}}\le CT^{\upsilon },\frac{1}{2}\le \upsilon <1\), where C is a constant. Solving the inequality yields that the required lower bound on the memory budget satisfies \({\mathcal {T}}\ge C^{-2}\alpha ^{-1} T^{2(1-\upsilon )}\). In the worst case, achieving a \(O(T^\upsilon ),\frac{1}{2}\le \upsilon <1\) regret bound requires a memory budget of order \(\varOmega (T^{2-2\upsilon })\). The lower bound on the regret seems surprising and may not be a strong result, since it is independent of K. We will show that it is optimal up to an additional penalty term.

If \(K=1\), then Theorem 1 reveals the lower bound of budgeted online kernel learning algorithms. We can not provide a \(O(\Vert f^*_1\Vert _{{\mathcal {H}}_1}L\sqrt{T})\) regret bound unless the memory budget \({\mathcal {T}}=\varOmega (T/\alpha )\). The BOGD algorithm (Zhao et al. 2012) enjoys a \(O((\Vert f^*_1\Vert ^2_{{\mathcal {H}}_1}+1)L{T}/{\sqrt{\alpha {\mathcal {T}}}})\) expected regret bound which is optimal w.r.t. T, but sub-optimal w.r.t. \(\Vert f^*_1\Vert _{{\mathcal {H}}_1}\). Dekel et al. (2008) proved an incomplete hardness result for online kernel learning under a memory budget B. There always exists \(B+1\) examples, such that any algorithm only storing B examples will make \(T=B+1\) mistakes. Besides, there is a hypothesis \(f^*_1\in {\mathcal {H}}_1\) satisfying \(\Vert f^*_1\Vert _{{\mathcal {H}}_1}=\sqrt{B+1}\) that never makes mistakes and attains a hinge loss of 0. Actually, their lower bound on the mistakes equals the lower bound on the regret for the hinge loss, or rather, the lower bound on the regret is \(B+1 = \Vert f^*_1\Vert _{{\mathcal {H}}_1}\sqrt{T}\), where we use the specific identity \(T=B+1\). The weakness of this lower bound is that it can not be extended to the case \(B=o(T)\). Our result in Theorem 1 provides a complete answer to the question.

3.2 A nearly optimal algorithm for any K

An intuitive approach is to allocate the memory budget to the K base kernels. According to the lower bound (2), such an approach will increase the regret by a factor of order \(O(\sqrt{K})\). Recalling that any hypothesis \(f_i\in {\mathcal {H}}_i\) can be represented by \(f_i=\sum ^T_{t=1}a_{t,i}\kappa _i({\mathbf {x}}_t,\cdot )\). Thus the memory cost is used to store the support vectors \(\{({\mathbf {x}}_t,y_t)^T_{t=1}:a_{t,i}\ne 0\}\), and the coefficients \(\{(a_{t,i})^T_{t=1}:a_{t,i}\ne 0\}\). According to this observation, we will present an algorithm that shares the support vectors and a coefficient vector among K different hypotheses \(\{f_i\}^K_{i=1}\).

Instead of selecting kernels from a finite collection \(\{\kappa _1,\ldots ,\kappa _K\}\), we will select kernels from an infinite kernel space \({\mathcal {K}}\) defined as follows,

$$\begin{aligned} {\mathcal {K}}=\left\{ \kappa =\sum ^K_{i=1}p_i\kappa _i:\sum ^K_{i=1}p_i=1,p_i\ge 0\right\} . \end{aligned}$$

The learning of the weight vector \(\mathbf{p}\) will be clarified later. At the beginning of round t, assuming that there is a weight vector \(\mathbf{p}_t\). We learn a new kernel \(\kappa _{\mathbf{p}_t}=\sum ^K_{i=1}p_{t,i}\kappa _i\), which induces a RKHS \({\mathcal {H}}_{\mathbf{p}_t}\) with embedding \(\phi _{\mathbf{p}_t}:{\mathcal {X}}\rightarrow {\mathcal {H}}_{\mathbf{p}_t}\) defined as follows

$$\begin{aligned} \phi _{\mathbf{p}_t}(\mathbf{x}) = \left( \sqrt{p_{t,1}}\phi ^\top _{\kappa _1}(\mathbf{x}), \ldots ,\sqrt{p_{t,K}}\phi ^\top _{\kappa _K}(\mathbf{x})\right) ^\top ,\forall \mathbf{x}\in {\mathcal {X}}, \end{aligned}$$
(3)

where \(\phi _{\kappa _i}\) is the embedding induced by \(\kappa _i\). We select a hypothesis \(f_t\in {\mathcal {H}}_{\mathbf{p}_t}\), defined by

$$\begin{aligned} f_t=\sum ^{{t-1}}_{\tau =1}a_{\tau }\phi _{\mathbf{p}_t}(\mathbf{x}_{\tau }) =&\left( \sqrt{p_{t,1}}\sum ^{{t-1}}_{\tau =1}a_{\tau }\phi ^\top _{\kappa _1}(\mathbf{x}_{\tau }), \ldots ,\sqrt{p_{t,K}}\sum ^{{t-1}}_{\tau =1}a_{\tau } \phi ^\top _{\kappa _K}(\mathbf{x}_{\tau })\right) ^\top \nonumber \\ =&\left( {\sqrt{p_{t,1}}}f_{t,1},\ldots ,\sqrt{p_{t,K}}f_{t,K}\right) . \end{aligned}$$
(4)

The prediction is given by \(f_t(\mathbf{x}_t)= \langle f_t,\phi _{\mathbf{p}_t}(\mathbf{x}_t)\rangle _{{{\mathcal {H}}_{\mathbf{p}_t}}} =\sum ^K_{i=1}p_{t,i}f_{t,i}(\mathbf{x}_t)\), or \(\mathrm {sign}(f_{t}(\mathbf{x}_t))\) for classification. Although there are K hypotheses \(\{f_{t,i}\}^K_{i=1}\), we just need to maintain a single set of support vectors and a single coefficient vector \((a_1,\ldots ,{a_{t-1}})\).

To keep the memory constraints, we propose a simple example adding strategy. At any round t, let \(\nabla _{f_t}:=\ell '(f_t(\mathbf{x}_t),y_t)\phi _{\mathbf{p}_t}(\mathbf{x}_t)\) be the (sub-)gradient of \(\ell (f_t(\mathbf{x}_t),y_t)\) w.r.t. \(f_t\). We define a Bernoulli random variable \(\rho _{t}\in \{0,1\}\) satisfying

$$\begin{aligned} \mathbb {P}[\rho _{t}=1]=\min \left\{ 1,\frac{C}{z_{t}}\right\} \cdot \mathbb {I}_{\nabla _{f_t}\ne 0}, \end{aligned}$$
(5)

where \(C>0\) is a constant and \(z_t>0\) depends on t. The definition of C and \(z_t\) will be given in Theorem 2. Let S be a buffer storing the support vectors. We sample \(\rho _{t}\sim \mathrm {Ber}(\mathbb {P}[\rho _{t}=1],1)\). If \(\rho _{t}=1\), then we update \(f_{t}\) and add the current example into the buffer, i.e., \(S = S\cup (\mathbf{x}_t,y_t)\). Let \({\tilde{\nabla }}_{f_t}\) be an estimator of \(\nabla _{f_t}\), which is defined as follows,

$$\begin{aligned} {\tilde{\nabla }}_{f_t}:=\frac{\nabla _{f_t}}{\mathbb {P}[\rho _t=1]}\mathbb {I}_{\rho _t=1} =\tilde{\ell '}(f_t(\mathbf{x}_t),y_t)\phi _{\mathbf{p}_t}(\mathbf{x}_t), \quad \tilde{\ell '}(f_t(\mathbf{x}_t),y_t):= \frac{\ell '(f_t(\mathbf{x}_t),y_t)}{\mathbb {P}[\rho _t=1]}\mathbb {I}_{\rho _t=1}. \end{aligned}$$

We update the hypothesis by online gradient descent

$$\begin{aligned} f_{t+1} =f_{t}-{\lambda }\tilde{\ell '}(f_t(\mathbf{x}_t),y_t)\phi _{\mathbf{p}_t}(\mathbf{x}_t), \end{aligned}$$

where \(\lambda\) is the learning rate (or stepsize) of gradient descent. According to (3) and the definition of \(f_t\) (4), the above updating can be rewritten by

$$\begin{aligned} f_{t+1,i} =f_{t,i}-{\lambda }\tilde{\ell '}(f_t(\mathbf{x}_t),y_t)\phi _{\kappa _i}(\mathbf{x}_t),\quad \forall i=1,\ldots ,K. \end{aligned}$$

For simplicity, we define \(\nabla _{t,i}:=\ell '(f_t(\mathbf{x}_t),y_t)\phi _{\kappa _i}(\mathbf{x}_t)\).

To update \(\mathbf{p}_t\), we reduce this problem to a problem of prediction with expert advice. Let \(c_{t,i}\) be a criterion evaluating base \(\kappa _i\), \(i=1,\ldots ,K\), which serves as the loss of the i-th action.

$$\begin{aligned} c_{t,i}=\left\{ \begin{array}{ll} \frac{\ell '(f_{t}({\mathbf {x}}_t),y_t) \left( f_{t,i}({\mathbf {x}}_t)-\min _{j=1,\ldots ,K}f_{t,j}({\mathbf {x}}_t)\right) }{\max \{\ell _{m},1\}}&{} \quad \mathrm {if}~\ell '(f_{t}({\mathbf {x}}_t),y_t)>0,\\ \frac{\ell '(f_{t}({\mathbf {x}}_t),y_t) \left( f_{t,i}({\mathbf {x}}_t)-\max _{j=1,\ldots ,K}f_{t,j}({\mathbf {x}}_t)\right) }{\max \{\ell _{m},1\}}&{}\quad \mathrm {otherwise}, \end{array} \right. \end{aligned}$$
(6)

where \(\ell _{m}=\max _{t}\{\vert \ell '(f_{t}({\mathbf {x}}_t),y_t)\vert \cdot \max _{i,j} \left( f_{t,i}({\mathbf {x}}_t)-f_{t,j}({\mathbf {x}}_t)\right) \}\) and can be tuned by the doubling trick. Let \({\mathcal {E}}(K)\) be the exponential weights algorithm in (Cesa-Bianchi and Lugosi 2006) (see Sect. 4.2). Then \(\mathbf{p}_{t+1}=(p_{t+1,1},\ldots ,p_{t+1,K})\) can be computed as follows,

$$\begin{aligned} p_{t+1,i}=\frac{p_{t,i}\exp (-\eta c_{t,i})}{\sum ^K_{j=1}p_{t,j}\exp (-\eta c_{t,j})}, \end{aligned}$$

where \(\eta\) is the learning rate.

We name the algorithm LKMBooks (Learning Kernel for Memory BOunded Online Kernel Selection). The algorithm description is shown in Algorithm 1.

figure a

Theorem 2

Let \(E_t=\{\tau <t:\nabla _{f_{\tau }}\ne 0\}\), \(B=\alpha {\mathcal {T}}\) and \(C=B\). Let \(z_{t}=(1-\upsilon )T^{1-\upsilon }(\vert E_t\vert +1)^{\upsilon }\), where \(0\le \upsilon <1\). If there exists a \(\upsilon \in [0,1)\) satisfying \((1-\upsilon )T^{1-\upsilon } > B\), then for any sequence \({\mathcal {I}}_T\), with probability at least \(1-\delta\), LKMBooks guarantees that

$$\begin{aligned} \vert S\vert \le B+\frac{2}{3}\ln \frac{1}{\delta }+\sqrt{2B\ln \frac{1}{\delta }}. \end{aligned}$$

Otherwise, \(\vert S\vert \le B\).

Theorem 2 shows that our algorithm will not excess the memory constraint in a high probability. \(z_t\) gives the probability that any support vector is added into the budget. It is worth noting that the key of \(z_t\) is the value of \(\upsilon\). If \(\upsilon =0\). then each support vector is added into the budget with a same probability. We can also use a non-uniform probability distribution, i.e., \(\upsilon >0\). In this case, the probability decreases with the increasing of support vectors. In experiments, we always set \(\upsilon >0\) and empirically find that the non-uniform probability distribution performs better. In theory, the two kinds of probability distributions are equivalent in the sense that they induce the same budget size and regret bounds.

Theorem 3

Given a memory budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, let \(B=\alpha {\mathcal {T}}\). Assuming that \(\ell\) satisfies \(\vert \ell '(f({\mathbf {x}}),y)\vert \le L\). Let \({\mathcal {K}}=\{\kappa _i\}^K_{i=1}\) be a collection of kernel functions, and \(\eta =\sqrt{8\ln (K)/T}\). If \(B<T\), then let \(\lambda ={\sqrt{(1+\upsilon )B}}/{(\sqrt{(1-\upsilon )D}LT)}\). Otherwise, let \(\lambda =1/(\sqrt{DT}L)\). For any \(\kappa _i\in {\mathcal {K}}\), the expected regret of LKMBooks satisfies

$$\begin{aligned} \mathbb {E}\left[ \mathrm {Reg}({\mathcal {H}}_{i})\right] \le O\left( \max \{\ell _{m},1\}\sqrt{T\ln {K}}+ (\Vert f^*_i\Vert ^2_{{\mathcal {H}}_{i}}+1)L \max \left\{ \sqrt{T},\frac{T}{\sqrt{\alpha {\mathcal {T}}}}\right\} \right) . \end{aligned}$$

Remark 1

LKMBooks is similar with the online multi-kernel learning algorithm in (Jin et al. 2010) (Algorithm 5, denoted by DA-OMKL-O for simplicity), and the budgeted online multi-kernel regression algorithm in (Sahoo et al. 2014) (denoted by BOKMR for simplicity), since the three algorithms use a convex combination of K outputs \(\{f_{t,i}({\mathbf {x}}_t)\}^K_{i=1}\). The difference is that, DA-OMKL-O and BOKMR make \(\{f_{t,i}\}^K_{i=1}\) possess different coefficient vectors. However, LKMBooks makes \(\{f_{t,i}\}^K_{i=1}\) share a single coefficient vector. Besides, DA-OMKL-O does not limit the support vectors, and one of the two versions of BOKMR can also not share the support vectors. The space complexity of LKMBooks is \(O(dB+K)\). The two versions of BOKMR suffer a \(O(dB+KB+K)\) and O(KBd) space complexity, respectively. For the case of \(K \gg d\), LKMBooks suffers the lowest space complexity. What’s more, BOKMR did not provide a regret bound.

We consider the optimality w.r.t. \({\mathcal {T}}, T\) and K. Compared with the lower bound (2), LKMBooks is optimal up to an additional penalty term of order \(O(\max \{\ell _{m},1\}\sqrt{T\ln {K}})\), which comes from the intrinsic complexity of prediction with expert advice. The penalty term is a lower order term. Thus LKMBooks avoids the dependence on \(O(\sqrt{K})\). However, LKMBooks depends on \((\Vert f^*_i\Vert ^2_{{\mathcal {H}}}+1)\), which is much worse than \(\Vert f^*_i\Vert _{{\mathcal {H}}}\). The reason is that LKMBooks uses online gradient descent (OGD) to update hypothesis. The standard regret bound of OGD depends on \((\Vert f^*_i\Vert ^2_{{\mathcal {H}}}+1)\) (Orabona 2013). Using OGD aims at sharing a single coefficient vector. Next we show an optimal algorithm for the case of \(K<d/\ln {\sqrt{T}}\).

3.3 Adapt to the norm of competitor for \(K<d/\ln {\sqrt{T}}\)

To adapt to \(\Vert f^*_i\Vert _{{\mathcal {H}}}\), we propose a hypothesis space discretization scheme. For each \(\kappa _i\), \(i=1,\ldots ,K\), we define the feasible hypothesis space by \(\mathbb {H}_i=\{f\in {\mathcal {H}}_i:\Vert f\Vert _{{\mathcal {H}}_i}\le U\}\). We discretize (0, U] as follows

$$\begin{aligned} (0,U] =\left( 0,\mathrm {e}^{\lceil \ln {U_{\min }}\rceil }\right] \bigcup ^{\lceil \ln {U}\rceil -1}_{j=\lceil \ln {U_{\min }}\rceil } \left( \mathrm {e}^j,\mathrm {e}^{j+1}\right] . \end{aligned}$$
(7)

This technique is also known as the peeling technique. The key is the choice of U and \(U_{\min }\), which depends on the memory budget \({\mathcal {T}}\) and will be determined later. For any \(f\in \mathbb {H}_i\), there exists some j such that \(\Vert f\Vert _{{\mathcal {H}}_i} \in (0,\mathrm {e}^{\lceil \ln {U_{\min }}\rceil }]\) or \((\mathrm {e}^j,\mathrm {e}^{j+1}]\). Let \(M=\lceil \ln {U}\rceil -\lceil \ln {U_{\min }}\rceil +1\). We construct \(K':=KM\) nested hypothesis spaces

$$\begin{aligned} \mathbb {H}_{i,j}=\{f\in {\mathcal {H}}_i:\Vert f\Vert _{{\mathcal {H}}_i}\le U_j\}, \quad i=1,\ldots ,K,\quad j=1,\ldots ,M, \end{aligned}$$

where \(U_j=\mathrm {e}^{j+\lceil \ln {U_{\min }}\rceil -1}\). Thus \(\mathbb {H}_{i,1}\subset \ldots \subset \mathbb {H}_{i,M}\subset {\mathcal {H}}_i\). For the sake of clarity, we define two index functions \(h:[K]\times [M]\rightarrow [K']\) and \(h^{*}: [K']\rightarrow [K]\times [M]\). Specifically, h(ij) maps (ij) to the h(ij)-th element in \([K']\). Similarly, \(h^{*}(k)\) maps \(k\in [K']\) to \((h^{*}(k)_1,h^{*}(k)_2)\), where \(h^{*}(k)_1=\lfloor (k-1)/M\rfloor + 1\) and \(h^{*}(k)_2 = k-(h^{*}(k)_1-1)M\).

To share the support vectors, we use an oblivious example adding strategy. The term “oblivious” means that the strategy is independent of algorithms. At any round t, let \(\rho _{t}\in \{0,1\}\) be a Bernoulli random variable satisfying

$$\begin{aligned} \mathbb {P}[\rho _{t}=1]=\min \left\{ 1,\frac{C}{z_{t}}\right\} . \end{aligned}$$

Let \(\{f_{t,i,j}\}^T_{t=1}\) be a sequence of hypotheses in \(\mathbb {H}_{i,j}\) and \(\nabla _{t,i,j}=:\nabla _{f_{t,i,j}}\ell (f_{t,i,j}(\mathbf{x}_t),y_t)\) be the (sub-)gradient w.r.t. \(f_{t,i,j}\), \(i\in [K],j\in [M]\). At the end of round t, we sample \(\rho _{t}\sim \mathrm {Ber}(\mathbb {P}[\rho _{t}=1],1)\). If \(\rho _{t}=1\), then we update the hypothesis \(f_{t,i,j}\) and add the current example into the buffer, i.e., \(S = S\cup (\mathbf{x}_t,y_t)\). Let \({\tilde{\nabla }}_{t,i,j}\) be an estimator of \(\nabla _{t,i,j}\), which is defined as follows,

$$\begin{aligned} {\tilde{\nabla }}_{t,i,j}=\frac{\nabla _{t,i,j}}{\mathbb {P}[\rho _{t}=1]}\mathbb {I}_{\rho _{t}=1}. \end{aligned}$$

We update the hypothesis by online gradient descent

$$\begin{aligned} {\overline{f}}_{t+1,i,j}=f_{t,i,j}-{\lambda _{i,j}}{\tilde{\nabla }}_{t,i,j}, \quad f_{t+1,i,j} =\mathop {\arg \min }_{f\in \mathbb {H}_{i,j}}\Vert f-{\overline{f}}_{t+1,i,j}\Vert ^2_{{\mathcal {H}}_{i}}. \end{aligned}$$
(8)

The projection of any \(f\in {\mathcal {H}}_i\) onto \(\mathbb {H}_{i,j}\) is defined by \(g=\min \{1,\frac{U_j}{\Vert f\Vert _{{\mathcal {H}}_i}}\}f\).

Next we show the kernel selection procedure. Let \({\mathcal {E}}(K')\) be an algorithm for prediction with expert advice. We select a hypothesis space \(\mathbb {H}_{h^{*}(I_t)_1,h^{*}(I_t)_2}\), where \(I_t\sim \mathbf{p}_t\), and make prediction \({\hat{y}}_t=f_{t,h^{*}(I_t)_1,h^{*}(I_t)_2}({\mathbf {x}}_t)\) or \(\mathrm {sign}({\hat{y}}_t)\). For each action \(h(i,j)\in [K']\), let the criterion be \(c_{t,h(i,j)}=\ell (f_{t,i,j}({\mathbf {x}}_t),y_t)\). For all \(f\in \mathbb {H}_{i,j}\), assuming that there is a function \(g(U_j,D_i,Y)\) satisfying \(c_{t,h(i,j)}\le g(U_j,D_i,Y)\). At the end of round t, we send \(\mathbf{c}_t=(c_{t,1},\ldots ,c_{t,K'})\) to \({\mathcal {E}}(K')\). To adapt to the norm of competitor, \({\mathcal {E}}(K')\) needs to achieve a multi-scale regret bound. Let \({\mathcal {E}}(K')\) be the MSMW algorithm in Bubeck et al. (2019). which is shown in Algorithm 3.

We name this algorithm PFMBooks (Parameter-Free for Memory BOunded Online Kernel Selection).

figure b
figure c

Theorem 4

Let \(B=\alpha {\mathcal {T}}\), \(C=B\) and \(z_{t}=2(1-\upsilon )T^{1-\upsilon }t^{\upsilon }\), where \(0\le \upsilon <1\). Under the condition of Assumption 4, there exists a \(\upsilon \in [0,1)\) such that \(2(1-\upsilon )T^{1-\upsilon } > B\). For any sequence \({\mathcal {I}}_T\), with probability at least \(1-\delta\), PFMBooks guarantees that

$$\begin{aligned} \vert S\vert \le \frac{B}{2}+\frac{2}{3}\ln \frac{1}{\delta }+\sqrt{B\ln \frac{1}{\delta }}. \end{aligned}$$

The proof is same with that of Theorem 2. PFMBooks ensures \(\vert S\vert =O(B/2)\) with a high probability and maintains KM coefficient vectors. The total space complexity is \(O(\frac{dB}{2}+\frac{BKM}{2})=O(dB)=O(d\alpha {\mathcal {T}})\) in the case of \(K<d/M\). We will set \(U_{\min }=U/\sqrt{T}\) in Theorem 6, and thus \(M<1+\ln \sqrt{T}\). PFMBooks will not exceed the total memory constraints in a high-probability. Next we state an important assumption, which is easily satisfied and forms the bases of obtaining the final regret bound.

Assumption 5

For any sequence of examples \({\mathcal {I}}_T:=\{({\mathbf {x}}_t,y_t)\}_{t\in [T]}\), let \(\vert y_t\vert \le Y\). For any hypothesis \(f\in {\mathcal {H}}_i,i=1,\ldots ,K\) and \(({\mathbf {x}},y)\in {\mathcal {I}}_T\), there always exists a function \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y):\mathbb {R}^3\rightarrow \mathbb {R}\) such that \(\ell (f({\mathbf {x}}),y) \le g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)\) and \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)=\varTheta (1+\Vert f\Vert _{{\mathcal {H}}_i})\).

Many loss functions satisfy Assumption 5, such as the \(\varepsilon\)-insensitive hinge loss, and the \(\varepsilon\)-insensitive absolute loss. For instance, if \(\ell (f({\mathbf {x}}),y)=\vert f({\mathbf {x}})-y\vert\), then we can define \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)=\Vert f\Vert _{{\mathcal {H}}_i}\sqrt{D_i}+Y\). If \(\ell (f({\mathbf {x}}),y)=\max \{0,1-yf({\mathbf {x}})\}\), then we can define \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)=1+Y\Vert f\Vert _{{\mathcal {H}}_i}\sqrt{D_i}\). Next we show the multi-scale regret bound of \({\mathcal {E}}(K')\).

Theorem 5

Let \(\eta =\sqrt{2\ln (K'T)/T}\) and \(U=\varTheta (B)\). Under the condition of Assumption 5, \(\forall k\in [K']\), the expected regret of \({\mathcal {E}}(K')\) satisfies

$$\begin{aligned} \sum ^T_{t=1}\langle c_{t},\mathbf{p}_t\rangle -\sum ^T_{t=1}c_{t,k} ={O}\left( g(U_{h^{*}(k)_2},D_{h^{*}(k)_1})\sqrt{T\ln {(K'T)}}\right) . \end{aligned}$$

Remark 2

\({\mathcal {E}}(K')\) is slightly different from the original MSMW algorithm in Bubeck et al. (2019), including: (i) MSMW uses “reward” as the feedback, but \({\mathcal {E}}(K')\) uses “loss” as the feedback; (ii) the initial distribution of MSMW and \({\mathcal {E}}(K')\) are different. Although we can transform “loss” to “reward” by \(r_{t,k}=g(U_{h^{*}(k)_2},D_{h^{*}(k)_1},Y)-c_{t,k}\), where \(r_{t,k}\) is the reward of the k-th action, the regret bound will increase a term \(\sum ^T_{t=1}[\sum ^{K'}_{k=1}p_{t,k}g(U_{h^*(k)_2},D_{h^*(k)_1},Y) -g(U_{h^*(k)_2},D_{h^*(k)_1},Y)]\), which can not adapt to the scale of individual action. Thus we need a different proof. We present a simpler proof in the Appendix. One of the key is using a different initial distribution.

Theorem 6

Given a memory budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, let \(B=\alpha {\mathcal {T}}\). Let \(U=\varTheta (\sqrt{B})\), \(U_{\min }=U/\sqrt{T}\) and \(\lambda _{i,j}=\frac{U_j\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )D_i}LT}\). The expected regret of PFMBooks w.r.t. any \({\mathcal {H}}_i,i=1,\ldots ,K\) satisfies

$$\begin{aligned} \mathbb {E}\left[ \mathrm {Reg}({\mathcal {H}}_i)\right] =O\left( \Vert f^*_i\Vert _{{\mathcal {H}}_i}L \max \left\{ \sqrt{T\ln (K'T)},\frac{T}{\sqrt{\alpha {\mathcal {T}}}}\right\} \sqrt{\ln (KT)} +\sqrt{T\ln (K'T)}\right) . \end{aligned}$$

Remark 3

In Theorem 1, the lower bound does not limit \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Our upper bound may be invalid if \(U<\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Inspecting the hard examples in the proof of Theorem 1, we find that \(\Vert f^*_i\Vert _{{\mathcal {H}}_i} =\varTheta (\sqrt{B})\). Thus our upper bound is still valid if \(U =\varTheta (\sqrt{B})\).

The expectation is w.r.t. the randomness of \({\mathcal {E}}(K')\) and the randomness of \(\{\rho _t\}^{T-1}_{t=1}\). Compared with the upper bound in Theorem 3, PFMBooks improves the dependence on \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Compared with the lower bound (2), PFMBooks is optimal up to a factor of order \(O(\sqrt{\ln (K'T)})\) and a small penalty term of order \(O\left( \sqrt{T\ln (K'T)}\right)\).

4 Online kernel selection with time constraints

In this section, we give both a lower bound on the regret for online kernel selection with a time budget and a simple algorithm nearly achieving the lower bound.

4.1 Lower bound

For the sake of clarity, we introduce a natation of resource allocation. Any kernel selection algorithm needs to assign a kernel selection strategy and a resource allocation strategy simultaneously. In this work, we consider the static resource allocation defined as follows.

Definition 3

(Static Resource Allocation) Define a static resource allocation \(R({\mathcal {T}}_1,\ldots ,{\mathcal {T}}_K)\) as a strategy that allocates a time budget of \(0<{\mathcal {T}}_i\le {\mathcal {T}}\) quanta to kernel function \(\kappa _i\) before the game, and does not change later.

For any budgeted kernel selection algorithm with static resource allocation \(R({\mathcal {T}}_1,\ldots ,{\mathcal {T}}_K)\), the following theorem gives a lower bound on the regret.

Theorem 7

Let \(\ell (\cdot ,\cdot )\) be the hinge loss or the absolute loss. There exist K kernel functions \(\{\kappa _{i}\}^K_{i=1}\) chosen by the learner, and a sequence of examples \(\{({\mathbf {x}}_t,y_t)\}^T_{t=1}\) chosen by an oblivious adversary, where \(y_t\in \{-1,1\}\), such that for a time budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, for all \(\kappa _i\), the expected regret of any budgeted online kernel selection algorithm with static resource allocation \(R({\mathcal {T}}_1,\ldots ,{\mathcal {T}}_K)\) satisfies

$$\begin{aligned} \mathbb {E}[L_{T}(f_{t})] - L_{T}(f^*_i) =\left\{ \begin{array}{llll} \varOmega \left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L\sqrt{T}\right) &{}\mathrm {if}\quad T = O(\beta \mathop {\max }_{j\in [K]}{\mathcal {T}}_j),\\ \varOmega \left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L \frac{T}{\sqrt{\beta \mathop {\max }_{j\in [K]}{\mathcal {T}}_j}}\right) &{}\mathrm {otherwise}, \end{array} \right. \end{aligned}$$
(9)

where L is the Lipschitz constant of \(\ell\), and \(f^*_i\in {\mathcal {H}}_i = \overline{\mathrm {span}(\kappa _i({\mathbf {x}}_1,\cdot ),\ldots , \kappa _i({\mathbf {x}}_t,\cdot ))}\).

The lower bound also reveals that, in the worst case, achieving a \(O(T^\upsilon ),\frac{1}{2}\le \upsilon <1\) regret bound requires a time budget of order \(\varOmega (T^{2-2\upsilon })\). To design algorithms achieving the lower bound (9), it is necessary to adopt the \(R({\mathcal {T}},\ldots ,{\mathcal {T}})\) resource allocation.

We first highlight the difference between memory constraints and time constraints. Recalling that the space complexity of LKMBooks is \(O(dB+K)\). The time complexity of LKMBooks is \(O(dB+KB+K)\), but not \(O(KdB+K)\). The reason is that, under Assumption 1, the main time cost of computing \(\kappa _i(\mathbf{x}_t,\mathbf{x}_{\tau })\) for all \(\mathbf{x}_{\tau }\in S\) is to compute the norm \({\Vert \mathbf{x}_t-\mathbf{x}_\tau \Vert _2}\) or the inner product \(\langle \mathbf{x}_t,\mathbf{x}_{\tau }\rangle\). Since LKMBooks only maintains a single S, we can first compute the norm or inner between \(\mathbf{x}_t\) and the support vectors in S. Thus the time complexity of computing \(f_{t,i}(\mathbf{x}_t)\) for all \(i=1,\ldots ,K\), is of order \(O(dB+KB)\). If \(K\le d\), the two constraints are equivalent and LKMBooks can also be a nearly optimal algorithm for the case of time constraints. Thing is different for the case of \(K>d\). Assuming that \(K=d^{\nu },\nu >1\). If an algorithm achieves the lower bound (9), then it would adopt the \(R({\mathcal {T}},\ldots ,{\mathcal {T}})\) resource allocation. Let the available budget of such an algorithm be \(B_1\), and \(B_2\) be the available budget of LKMBooks. According to Assumptions 3, we have the two identities \(dB_1={\mathcal {T}}\) and \((d+K)B_2={\mathcal {T}}\), which imply \(B_2=O(K^{\frac{1-v}{v}}B_1)\). Substituting into Theorem 3, LKMBooks will increase the regret by a factor of order \(O(K^\frac{v-1}{2v})\).

Thus for the case of \(K<d\), we can directly use LKMBooks or PFMBooks. Next we propose a nearly optimal algorithm for the case of \(K>d\). The algorithm adapts the \(R({\mathcal {T}}/2,\ldots ,{\mathcal {T}}/2)\) resource allocation.

4.2 A nearly optimal algorithm for \(K>d\)

A simply observation is that we need not to evaluate all of the base kernels at each round. An intuitive approach is to select a single kernel function, \(\kappa _{I_t}\), and use the hypothesis \(f_{t,I_t}\) to make prediction. Such an approach has been adopted in (Yang et al. 2012), where the kernel selection problem is reduced to a K-armed bandit problem. However, the regret bound is far from optimal for online kernel selection. At each round, the approach constructs estimated gradient \({\tilde{\nabla }}_{t,i}=\nabla _{t,i}/p_{t,i}\). The second moment is of order \(\max _{t}\nabla _{t,i}/p_{t,i}\), which may be a large term. To address this issue, we will propose a simple exploration-exploitation scheme.

For each \(\kappa _i\), we define the feasible hypothesis space by \(\mathbb {H}_i=\{f\in {\mathcal {H}}_i:\Vert f\Vert _{{\mathcal {H}}_i}\le U\}\). We slightly modify Algorithm 1. The key difference is that we randomly evaluate two kernel functions at each round. The two kernel functions are selected by a decoupled exploration-exploitation scheme, which is defined as follows

  • Exploitation: select a kernel function \(\kappa _{I_t}\sim \mathbf{p}_t\),

  • Exploration: select another kernel function \(\kappa _{J_t}\sim {\mathcal {K}}\) uniformly.

Note that it is possible that \(\kappa _{I_t}=\kappa _{J_t}\). The exploration procedure makes each kernel be selected with a high probability.

Let \(S_i,i=1,\ldots ,K\) be K buffers storing the support vectors. At each round t, we output the prediction \({\hat{y}}_t=f_{t,I_t}({\mathbf {x}}_t)\) or \(\mathrm {sign}({\hat{y}}_t)\). However, we do not update \(f_{t,I_t}\) unless \(I_t=J_t\). The goal is to make \(({\mathbf {x}}_t,y_t)\) be added into each \(S_{i}\) with equal probability. After receiving \(y_t\), we compute the gradient \(\nabla _{f_{t,J_t}}\ell (f_{t,J_t}({\mathbf {x}}_t),y_t)\). If \(\nabla _{f_{t,J_t}}\ell (f_{t,J_t}({\mathbf {x}}_t),y_t)\ne 0\), then we decide whether to update \(f_{t,J_t}\). Let \(\rho _{t,i}\in \{0,1\}\) be a Bernoulli random variable satisfying

$$\begin{aligned} \mathbb {P}[\rho _{t,i}=1]=\min \left\{ 1,\frac{C}{z_{t,i}}\right\} \cdot \mathbb {I}_{\nabla _{t,i}\ne 0}, \quad i=1,\ldots ,K, \end{aligned}$$

If \(\rho _{t,J_t}=1\), then we update \(f_{t,J_t}\) and add the current example into the budget, i.e., \(S_{J_t} = S_{J_t}\cup (\mathbf{x}_t,y_t)\). Let \({\tilde{\nabla }}_{t,i}\) be an estimator of \(\nabla _{t,i}\), defined as follows,

$$\begin{aligned} {\tilde{\nabla }}_{t,i}=\frac{\nabla _{t,i}}{\mathbb {P}[i=J_t]\cdot \mathbb {P}[\rho _{t,i}=1]} \mathbb {I}_{i=J_t}\mathbb {I}_{\rho _{t,i}=1}. \end{aligned}$$

We update the hypothesis \(f_{t,i}\) follows (8), where the projection can be computed incrementally in time O(1).

To update \(\mathbf{p}_t\), we define a K-armed adversarial bandit problem with an additional observation in which the algorithm may obtain two losses. \(\forall i\in [K]\), let \(c_{t,i}={\ell (f_{t,i}(\mathbf{x}_t),y_t)}/{\ell _{m}}\), where \({\ell _{m}=\max _{t,i}\{\ell (f_{t,i}(\mathbf{x}_t),y_t)\}}\) is a normalizing constant and can be tuned by the doubling trick. The key is the estimated loss \({\tilde{c}}_{t,i}\) defined as follows,

$$\begin{aligned} {\tilde{c}}_{t,i}=\frac{c_{t,i}}{\mathbb {P}[i\in \{I_t,J_t\}]} \mathbb {I}_{i\in \{I_t,J_t\}}, \quad \mathbb {P}[i\in \{I_t,J_t\}]=\frac{K-1}{K}p_{t,i}+\frac{1}{K}. \end{aligned}$$
(10)

We update \(\mathbf{p}_t\) by online stochastic mirror descent (OSMD) with the negative entropy regularizer (Bubeck and Cesa-Bianchi 2012),

$$\begin{aligned} p_{t+1}=\mathop {\arg \min }_{\mathbf{p}\in \varDelta _{K-1}} \left\{ \langle \mathbf{p},{\tilde{c}}_t\rangle +{\mathcal {D}}_{\psi _t}(\mathbf{p},\mathbf{p}_{t})\right\} , \end{aligned}$$
(11)

where \(\psi _t(\mathbf{p})=\sum ^K_{i=1}\eta _tp_i\ln {p_t}\) and \({\mathcal {D}}_{\psi _t}\) is Bregman divergence.

We name the algorithm BATBooks (Bandit with Additional observation for Time BOunded Online Kernel Selection). The algorithm description is shown in Algorithm 4.

figure d

Theorem 8

Let \(B=\beta {\mathcal {T}}\), \(C=KB\) and \(z_{t,i}=2(1-\upsilon )T^{1-\upsilon }t^{\upsilon }\), where \(0\le \upsilon <1\). For any sequence \({\mathcal {I}}_T\), with probability at least \(1-\delta\), BATBooks guarantees that

$$\begin{aligned} \vert S_{i}\vert \le \frac{B}{2}+\frac{2}{3}\ln \frac{K}{\delta }+\sqrt{B\ln \frac{K}{\delta }}. \end{aligned}$$

For all \(i=1,\ldots ,K\), we have \(\vert S_{i}\vert =O(B/2)\). BATBooks evaluates two hypotheses at each round. The total time complexity is \(O(dB)=O(d\beta {\mathcal {T}})\). Thus BATBooks will not excess the total time budget in a high-probability.

Theorem 9

Let \(c_{t}\in [0,1]^{K}\) be any loss vector, and \({\tilde{C}}_{T,*}= \min _{i\in [K]}\sum ^T_{t=1}{\tilde{c}}_{t,i}\), where \({\tilde{c}}_{t,i}\) is the estimator of \(c_{t,i}\) defined in (10). Let \(\eta =\min \{\sqrt{2\ln {K}/(K{\tilde{C}}_{T,*})},\frac{1}{K}\}\). BATBooks guarantees

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}[\langle \mathbf{p}_t,c_t\rangle -c_{t,i}]\right] \le 2\sqrt{2\mathbb {E}\left[ \sum ^T_{t=1}c_{t,i}\right] K\ln {K}}. \end{aligned}$$

We can obtain an expected small-loss regret bound for bandit with an additional observation, which may be of independent interest. Seldin et al. (2014) proved the worst-case expected regret bound for this problem. Thus we improve the previous result. Note that if \(\{c_t\}^T_{t=1}\) are fixed loss vectors, then we can remove the expectation operation.

Theorem 10

Given a time budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, let \(B=:\beta {\mathcal {T}}\). Let \(U=\varTheta (\sqrt{B})\) and \(\ell\) satisfy \(\vert \ell '(f({\mathbf {x}}),y)\vert \le L\). If there exists a \(\upsilon \in [0,1)\) satisfying

$$\begin{aligned} 2(1-\upsilon )T^{1-\upsilon } > KB, \end{aligned}$$
(12)

then for any \(\mathbb {H}_{i}, i\in [K]\), let \(\lambda _{i}=\frac{\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )D_i}LT}\), the expected regret of BATBooks satisfies,

$$\begin{aligned} \mathbb {E}\left[ \mathrm {Reg}(\mathbb {H}_{i})\right] ={O}\left( \sqrt{(U+1)L_T(f^*_i)K\ln {K}} +(\Vert f^*_i\Vert ^2_{{\mathcal {H}}_i}+1)L\sqrt{D_i}\frac{T}{\sqrt{\beta {\mathcal {T}}}}\right) . \end{aligned}$$

If condition (12) can not be satisfied, then let \(\lambda _{i}=\frac{1}{\sqrt{KD_iT}L}\). The expected regret satisfies,

$$\begin{aligned} \mathbb {E}\left[ \mathrm {Reg}(\mathbb {H}_{i})\right] ={O}\left( \sqrt{(U+1)L_T(f^*_i)K\ln {K}} +(\Vert f^*_i\Vert ^2_{{\mathcal {H}}_i}+1)L\sqrt{D_iTK}\right) . \end{aligned}$$

Remark 4

We show for the first time, that online kernel selection with time constraints is different from memory constraints only in the case of \(K>d\), which answers our second question, Q 2. Thus for the case of \(K\le d\), we can just use Algorithm 1 or Algorithm 2. All of previous work does not find such a condition. The online multi-kernel learning algorithms in (Hoi et al. 2013; Sahoo et al. 2014) and the online kernel selection algorithm in (Yang et al. 2012) randomly update a hypothesis for reducing time complexity. We prove that such an approach is unnecessary unless \(K>d\).

We analyze the optimality w.r.t. \({\mathcal {T}}\), T and K. First we consider a small time budget, i.e., \(B< 2T/K\) (condition (12) is satisfied). Compared with the lower bound (9), BATBooks has an additional cost of order \(O(\sqrt{UL_T(f^*_i)K\ln {K}})\). Then we consider a large time budget, i.e, \(2T/K\le B \le T\) (condition (12) is not satisfied). BATBooks is sub-optimal by a multiplicative factor of order \(O(\sqrt{K})\) and the same additional cost. Although \(U=\varTheta (\sqrt{B})\), we have \(L_T(f^*_i)=0\) for the hard examples in the proof of Theorem 7. In this case, our upper bounds are nearly optimal w.r.t. T, K and \({\mathcal {T}}\).

Next we consider the the dependence on \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Note that \(L_T(f^*_i)\) and U could not be large simultaneously. If \(L_T(f^*_i)\) is much large, then \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\) would be small, and we can ensure U being small. Using Assumption 5, we have \(L_T(f^*_i)=O(\Vert f^*_i\Vert _{{\mathcal {H}}_i}T)\). Thus the additional cost would be \(O(\sqrt{U\Vert f^*_i\Vert _{{\mathcal {H}}_i}TK\ln {K}})\). Our bounds depend on \(O(\sqrt{U\Vert f^*_i\Vert _{{\mathcal {H}}_i}})\) and \(O(\Vert f^*_i\Vert ^2_{{\mathcal {H}}_i})\), which are worse than the lower bound in Theorem 7. Improving the dependence on \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\) is left to further work.

5 Experiments

In this section, we conduct numerical experiments to verify our theoretical results. As a whole, our goal is to verify the following results,

(G 1):

Online kernel selection improves the learning performance relative to online single kernel learning with an empirical preset kernel.

(G 2):

The superior of memory sharing scheme. Within a same memory constraint, our algorithms are better than such algorithms that do not share the memory.

(G 3):

In the worst case, the time constraints is same with the memory constraints for the case of \(K < d\). Thus Algorithm 1 is also nearly optimal for online kernel selection with time constraints.

(G 4):

In the worst case, the time constraints is different from the memory constraints for the case of \(K\ge d\), that is, Algorithm 4 is better than Algorithm 1 for the case of \(K > d\).

We first state the experimental setting, and then show the experimental results for online kernel selection with memory constraints and time constraints, respectively.

5.1 Experimental setting

We compare our algorithms with the following baseline algorithms,

  • NORMA (Budgeted online kernel learning algorithm) (Kivinen et al. 2004)

  • BOGD (Budget online kernel learning algorithm) (Zhao et al. 2012)

  • OKS (Online Kernel Selection) (Yang et al. 2012)

  • OMKC (Online multi-kernel classification) (Hoi et al. 2013)

  • ISKA (Incremental sketched kernel alignment) (Zhang and Liao 2018)

  • BOMKR (Budget online multi-kernel regression) (Sahoo et al. 2014)

  • BOMKR-V (Variant of BOMKR).

The baseline algorithms for online classification include BOGD, OKS, OMKC and ISKA. The other algorithms including OKS are used for online regression.

We set 9 Gaussian kernels, \(\kappa (\mathbf{u},\mathbf{v})=\exp (-{\Vert \mathbf{u}-\mathbf{v}\Vert ^2}/{(2\sigma ^2)})\), of kernel width \(\sigma\) chosen from \(2^{-4:1:4}\). We adopt the best kernel function in hindsight for NORMA and BOGD. BOMKR-V is a variant of BOMKR by changing the loss function. We test the algorithms on online regression and online classification tasks. The datasets are shown in Table 2, which are downloaded from WEKA and UCI machine learning repository.Footnote 2ailerons-v, Hardware-v, Twitter-v and Adv-SUSY-v are constructed from ailerons, Hardware, Twitter and Adv-SUSY, respectively. For instance, we extract the first 6 features of ailerons and form ailerons-v. Our goal is to make \(d < K\) (\(K=9\)). We preprocess Hardware and Twitter by dividing the standard deviation. Note that we convert magic04, a9a and SUSY to adversarial datasets, denoted by Adv-magic04, Adv-a9a and Adv-SUSY. Our approach of constructing adversarial datasets is as follows: At each round \(t = 1,\ldots ,T\),

  • If \(t\le \lceil T/20\rceil\), let Adv-magic04 equal to magic04.

  • If \(t\ge \lceil T/20\rceil +1\), we multiply the features of magic04 by \(2^{-3}\).

The same operation is used to Adv-a9a and Adv-SUSY. There are two reasons that we construct adversarial datasets, i.e., (i) for online learning, the data may not be i.i.d., and may be provided by a malicious adversary; (ii) our theoretical results hold in the worst-case. The three adversarial datasets essentially yield hard learning tasks.

Table 2 Basic information of datasets

For online regression, we adopt the absolute loss \(\ell ({\hat{y}}_t,y)=\vert {\hat{y}}_t-y\vert\) except for NORMA and BOKMR. NORMA adopts the \(\varepsilon\)-insensitive absolute loss \(\ell ({\hat{y}}_t,y)=\max (0,\vert {\hat{y}}_t-y\vert -\varepsilon _t)+\nu \varepsilon _t\), and updates \(\varepsilon _t\) on the fly. For BOKMR, we adopt the version that uses NORMA as a sub-algorithm (Sahoo et al. 2014). We set \(\nu =0.5\) and \(\varepsilon _1=0.001\). For online classification, we adopt the hinge loss \(\ell ({\hat{y}}_t,y)=\max \{0,1-{\hat{y}}_ty\}\). We measure the Average Absolute Loss (AAL) defined by \(\mathrm {AAL}=\frac{1}{T}\sum ^T_{t=1}\vert {\hat{y}}_t-y_t\vert\) for online regression, and measure the Average Mistake Rate (AMR) defined by \(\mathrm {AMR}=\frac{1}{T}\sum ^T_{t=1}\mathbb {I}_{{\hat{y}}_t\ne y_t}\) for online classification. For OKS, we choose the smoothing parameter \(\delta \in \{0.2,0.02,0.002\}\). For all of the baseline algorithms, we set the stepsize of gradient descent to \({5}/{\sqrt{T}}\). The other hyper-parameters are set to the recommended value in original papers. For PFMBooks, we set \(g(U_j,D_i)= U_j+0.1\) where \(D_i=1\) for Gaussian kernel and set \(\eta =\sqrt{8\ln (KMT)/T}\). For LKMBooks, we set \(\eta =\sqrt{8\ln (K)/T}\). All algorithms are implemented in R on a Windows machine with 2.5 GHz Core(TM) i5-7200U CPU. To weaken the randomization, we execute each experiment 20 times with random permutation of all datasets and average all the results.

5.2 Memory constraints

5.2.1 Online regression

Let \({\mathcal {T}}\) be a given memory budget. According to Assumptions 2 and 3, we can reduce \({\mathcal {T}}\) to an example budget of size B. We must ensure that all algorithms have the same space complexity. Table 3 shows the results. Since OKS does not control the number of support vectors, we use a heuristic variant, called BOKS, which stops updating hypothesis if the number of support vectors equals B. We use NORMA as the baseline, that is, for a memory budget \({\mathcal {T}}\), NORMA can use an example budget of size \(B_0\). The third row of Table 3 is the available budget of each algorithm, which depends on the relation between d and K. BOKS and BOMKR do not share the memory and maintain K different sets of support vectors. For LKMBooks and PFMBooks, we set \(\upsilon =\frac{1}{3}\) for satisfying \(2(1-\upsilon )T^{1-\upsilon } > B\) (see Theorems 2 and 4), and set the stepsize to the values in Theorems 3 and 6. For PFMBooks, we set \(U=\sqrt{B}\), \(U_{\min }=U/\sqrt{T}\) as stated in Theorem 6. Since LKMBooks and PFMBooks can only achieve the memory constraints in high-probability, we stop updating hypotheses when the actual budget exceeds the available budget in Table 3.

Table 3 Space complexity and the available budget of individual algorithm
Table 4 AAL (Average Absolute Loss) comparison within memory constraints

Table 4 shows the empirical results. The bold in each column indicates the algorithm enjoying the best performance. It can be found that NORMA performs well on some datasets. There are two reasons: (i) we select the best kernel width in hindsight for NORMA, that is, we test all of the candidate kernel widths and select the one with minimal ALL; (ii) NORMA uses a good learning rate on those datasets. Tuning the learning rate is another problem of online learning algorithms. To avoid this issue, we set a fixed learning rate for baseline algorithms and use the theoretical values for our algorithms. In the first column of Table 4, we give the optimal kernel width of NORMA on each dataset. For instance, NORMA-2 means that the optimal kernel width is \(\sigma =2\) on housing dataset. For different datasets, the optimal kernel width is also different. Thus if we empirically set a fixed kernel for all datasets, then NORMA will perform badly on some datasets. On the contrary, the online kernel selection algorithms and online multi-kernel learning algorithms can perform well on all datasets (except for BOKS). The results verify the first goal, G 1.

Next we analyze BOMKR. Since BOMKR does not share the support vectors, \(\forall i\in [K]\), the available budget for constructing \(\{f_{t,i}\}^T_{t=1}\) is \(\frac{B_0}{K}\ll B_0\). Thus BOMKR performs bad. LKMBooks, PFMBooks and BOMKR-V can share the support vectors, whose available budget is \(B_0\), \(\frac{dB_0}{(d+K')}\) and \(\frac{dB_0}{(d+K)}\), respectively. Thus they perform well on all of the datasets. Besides, we also find that BOMKR-V performs worse than NORMA on some datasets. The main reason is that the learning rate of BOMKR-V is not well tuned. Since PFMBooks is applicable for the case of \(K<d/\lceil \ln {T}\rceil\), we do not run it on the two low dimensional datasets, housing and elevators. PFMBooks performs much better than all of the other algorithms on Slice dataset. The reason is that PFMBooks is parameter-free and uses a suitable learning rate. For all of the other algorithms including LKMBooks, we actually do not set a suitable learning rate for individual dataset. The results verify the second goal, G 2.

5.2.2 Online classification

The overall parameter setting is same with that of online regression, except that LKMBooks uses the same learning rate with the baseline algorithms, i.e., \(\lambda ={5}/{\sqrt{T}}\). Let \(U_{\min }=5\) for PFMBooks. For the hinge loss, if f satisfies \(\Vert f\Vert _{{\mathcal {H}}}<1\), then \(L_T(f)=\sum ^T_{t=1}(1-y_tf({\mathbf {x}}_t))=\varTheta (T)\). Thus we set \(U_{\min }>1\). OMKC is an algorithm framework, based on which four algorithms are derived (Hoi et al. 2013). In the case of memory constraints, algorithms can suffer more time cost. Thus we adopt \(\mathrm {OMKC}_{D,D}\) which has the best prediction performance, but also suffers the highest time cost among the four algorithms. We set the hyper-parameters of \(\mathrm {OMKC}_{D,D}\) to the recommended values in original paper.

We still reduce \({\mathcal {T}}\) to an example budget of size B and ensure all algorithms have the same space complexity. If the number of support vectors of \(\mathrm {OMKC}_{D,D}\) equals B, then we stop updating hypotheses. We use BOGD as the baseline, whose space complexity is O(Bd). Given \({\mathcal {T}}\) memory budget, BOGD can use an example budget of size \(B_0\). The space complexity of \(\mathrm {OMKC}_{D,D}\) is \(O(B(d+K))\). Thus \(B=\frac{dB_0}{d+K}\). The space complexity of ISKA is \(O(Bd+K)\). Thus \(B=B_0\). Table 3 gives the size of example budget of other algorithms.

Table 5 AMR (Average Mistake Rate) comparison within memory constraints

Table 5 shows the empirical results. It can be find that BOGD performs well on all datasets, since we select the optimal kernel width in hindsight. The first column shows the optimal kernel width on different datasets can be different, which is same with the result of Table 4. Thus we conclude that, if BOGD is equipped with a fixed kernel function for all datasets, then it will perform worse than the other algorithms. The results verify G 1.

Next we analyze \(\mathrm {OMKC}_{D,D}\), which performs bad on the last three datasets. We call the last three datasets hard dataset and call mushrooms easy dataset, since the mistake rates are very small on mushrooms. Recalling that \(\mathrm {OMKC}_{D,D}\) can use a budget of size \(\frac{dB_0}{d+K}\). \(\mathrm {OMKC}_{D,D}\) does not share the memory, and thus it allocates the budget over K hypothesis sequences, i.e., \(\{f_{t,i}\}^T_{t=1},i\in [K]\). In this way, each hypothesis sequence approximately obtains a budget of size \(\frac{1}{K}\cdot \frac{dB_0}{d+K}\). Thus it would perform bad on hard dataset. For mushrooms, since the number of mistakes is very small, thus a small budget is enough. For instance, for the case of \(B_0=200\), the number of mistakes of \(\mathrm {OMKC}_{D,D}\) is roughly \(0.62*T\approx 50\), where \(T=8124\). Thus the optimal hypothesis sequence \(\{f_{t,i^*}\}^T_{t=1}\) only needs a budget of size about 50. LKMBooks shares the memory and performs well on hard dataset. The experimental results do not match our theoretical results well, since we focus on the mistake rates not the average cumulative losses. Our theoretical results are the regret bounds, not the mistake bounds. Even so, the experimental results on the hard datasets still verify G 2.

ISKA also shares the memory and performs better than our algorithms on mushrooms and magic04, since it employs an elaborate removing strategy, while our algorithms just use simple randomized adding strategies. However, the regret bounds of ISKA does not reveal the superiority. We conjecture that data-dependent regret bounds can explain the superiority. Besides, ISKA performs worse than our algorithms on the two adversarial datasets. The kernel selection procedure of ISKA consists of two phases. During the first phase, ISKA converges to an empirically optimal kernel. During the second phase, ISKA always chooses the empirically optimal kernel. The adversary can easily change the optimal kernel by scaling the feature of instances and make ISKA converge to a bad kernel. Our algorithms randomly choose kernels and can converge to the optimal kernel defined on the whole datasets. Thus our algorithms are more robust than ISKA in adversarial environments.

5.3 Time constraints

5.3.1 Online regression

Let \({\mathcal {T}}\) be a given time budget. We also achieve the time constraints by fixing the budget size. To be specific, we choose BOMKR as baseline, where the budget is set to \(B_0\). Denote the average per-round running time of BOMKR by \(t_{\mathrm {p}}\). We tune the budget of other algorithms for ensuring the same running time with \(t_{\mathrm {p}}\). For BATBooks, we set the learning rate \(\eta =4\sqrt{\ln {K}/(K{\tilde{C}}_{T,*})}\), where \({\tilde{C}}_{T,*}\) is tuned by the doubling trick, \(U=B^{\frac{1}{3}}_0\) and \(\ell _{\max }=1\). For the parameter \(\upsilon\), we choose the maximal value from \(\{1/i\}_{i=3,4,\ldots ,12}\) for satisfying the condition (12). For the other algorithms, the parameter setting keeps unchanged.

Table 6 AAL (Average Absolute Loss) comparison within time constraints

Table 6 shows the empirical results. First, we consider the results on four high dimensional datasets, elevator, ailerons, Hardware and Twitter. In this case, we have \(K<d\). Within a same time budget, LKMBooks shows the best performance except for NORMA. Although LKMBooks is designed for memory constraints, it is still nearly optimal for time constraints. In the second and fifth columns, the available budgets of all algorithms are different, since the per-round time complexities are different. It seems strange that BOKS has the maximal available budget. The reason is that BOKS allocates the available budget \(B_0\) to K hypotheses \(\{f_{t,i}\}^K_{i=1}\). Thus the available budget of each \(f_{t,i}\) is less than \(B_0\). The results verify the third goal, G 3.

Next we consider the four low dimensional datasets, housing, ailerons-v, Hardware-v and Twitter-v. In this case, we have \(K>d\). Within a same time budget, BATBooks shows the best performance on all datasets except for NORMA. NORMA performs well, since it has the lowest time complexity and we set the optimal kernel width in hindsight. It is interesting to find that, the available budget of BATBooks is similar with that of NORMA. The reason is that the two algorithms have same per-round time complexity, which is \(O(dB+K)\) and O(dB), respectively. BATBooks performs better than LKMBooks for the case of \(d<K\), which verifies the fourth goal, G 4.

5.3.2 Online classification

For LKMBooks, the parameters follow the setting in Sect. 5.2.2. For BATBooks, the parameters follow the setting in Sect. 5.3.1, except that the stepsize is set to \(\lambda =\frac{U\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )}LT}\) which is slightly different from that of Theorem 10. We choose \(\mathrm {OMKC}_{D,D}\) as baseline, where the budget is set to \(B_0\). Let \(t_{\mathrm {p}}\) be the average per-round running time of \(\mathrm {OMKC}_{D,D}\). We tune the budget of other algorithms for ensuring the same running time with \(t_{\mathrm {p}}\).

Table 7 AMR (Average Mistake Rate) comparison within time constraints

Table 7 shows the empirical results. We first consider the results on two high-dimensional datasets, mushrooms and Adv-a9a in which \(K\ll d\). Within a same time budget, LKMBooks performs better than BATBooks. For Adv-SUSY, we have \(K\approx d\) (\(K=9, d=18\)). LKMBooks shows similar performance with BATBooks. The same result holds for Adv-magic04, in which \(K=9\) and \(d=10\). Besides, \(\mathrm {OMKC}_{D,D}\) performs much better than other algorithms on mushrooms. The reason is same with the analysis on mushrooms in Sect. 5.2.2. As a whole, for the case of \(K\ge d\), LKMBooks performs well on most of dataset. The results verify G 3.

Next we consider the two low-dimensional datasets, cod-rna and Adv-SUSY-v in which \(d<K\). We find that LKMBooks performs slightly better than BATBooks on cod-rna, and performs worse than BATBooks on Adv-SUSY-v. The results does not fully verify G 4. There may be two reasons: (i) for cod-rna, we have \(d\approx K\) (\(d=8, K=9\)); (ii) the performance measure is the mistakes rate, not the average cumulative losses. Even so, our algorithms still perform better than \(\mathrm {OMKC}_{D,D}\) and ISKA.

6 Conclusion and discussion

In this paper, we studied the computationally budgeted online kernel selection, where the kernel selection and online prediction procedures face memory constraints or time constraints. We separately proved a lower bound on the regret under the two kinds of computational constraints, and developed several simple algorithms that nearly achieve the lower bounds. We also identified the condition under which online kernel selection with a time constraint is different from that with a memory constraint.

This work will open up many directions for future research. One of the most important research is to identify the sufficient conditions under which a constant computational constraint can achieve a sub-linear regret bound. Model selection aims at choosing the inductive bias that matches the data and improving the learning performance of algorithms. Thus the worst-case regret guarantees do not reveal the essence of model selection. The sufficient conditions play the role of inductive bias. To this end, it is necessary to establish some kind of data-dependent regret bounds. Although many work has focus on achieving data-dependent regret bounds for general online learning problem, such as prediction with expert advice, multi-armed bandit problems, online convex optimization and so on, few of them considers the computational constraints.

We need further study the worst-case regret analysis. For the case of memory constraints and \(K>d/\ln \sqrt{T}\), our algorithm can not adapt to the norm of competitor. Thus the regret bound is far from optimality in terms of \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). For the case of time constraints and \(K>d\), if \({\mathcal {T}} =\omega (T/K)\), then there is a gap of order \(\sqrt{K}\) between the lower bound and upper bound. It is necessary to study whether this gap can be removed. Besides, the algorithm can also not adapt to the norm of competitor.