Abstract
We study the problem of online kernel selection under computational constraints, where the memory or time of kernel selection and online prediction procedures is restricted to a fixed budget. In this paper, we analyze the worst-case lower bounds on the regret of online kernel selection algorithm with a subset of the observed examples, and design algorithms enjoying corresponding upper bounds. We also identify the condition under which online kernel selection with time constraints is different from that with memory constraints. To design algorithms, we reduce the problems to two sequential decision problems, that is, the problem of prediction with expert advice and the multi-armed bandit problem with an additional observation. Our algorithms invent some new techniques, such as memory sharing, hypothesis space discretization and decoupled exploration-exploitation scheme. Numerical experiments on online regression and classification are conducted to verify our theoretical results.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Kernel selection is a fundamental problem of online kernel learning, which focuses on how to select kernel functions for online kernel learning algorithms on the fly. This problem is also termed as online kernel selection, and related to the more general online model selection (Foster et al. 2017; Muthukumar et al. 2019). Different from offline kernel selection, where we first execute kernel selection on a training set and then learn a predictor for the subsequent prediction tasks, the kernel selection and online prediction procedures are integrated and form a sequential prediction procedure. Given a collection of kernel functions \(\{\kappa _i\}^K_{i=1}\), which induce K reproducing kernel Hilbert spaces (RKHSs) \(\{{\mathcal {H}}_i\}^K_{i=1}\), an adversary sequentially sends the learner an example \(({\mathbf {x}}_t,y_t)\in \mathbb {R}^d\times \mathbb {R}, t=1,\ldots ,T\). The learner will choose a sequence of kernels \(\{\kappa _{I_t}\}^T_{t=1}\) and a sequence of hypotheses \(\{f_{t}\}^T_{t=1}\). At each round t, the learner suffers a loss \(\ell (f_t({\mathbf {x}}_t),y_t)\). General performance measurement is the regret. The regret with respect to (w.r.t.) \({\mathcal {H}}_i,i\in [K]\) is defined as follows
Since the best kernel function for the current learning task is unknown, the learner hopes to adapt to any \({\mathcal {H}}_i\) up to a small cost.
A major challenge of online kernel selection is the high computational complexity of evaluating kernel functions which requires to operate on the observed examples and thus incurs a O(T) per-round time complexity and space complexity. We can solve this problem from two computational perspectives. The first computational perspective aims at reducing the computational complexity. Most of previous work followed this line. The random feature based online kernel selection approach (Nguyen et al. 2017) embedded the implicit RKHSs to relatively low-dimensional explicit feature spaces, in which the time and space complexity of evaluating kernel functions are linear with the dimension of random feature spaces. The sketch based online kernel selection approach (Zhang and Liao 2018, 2020) maintained a budget and incrementally constructed sketched hypothesis spaces, in which the time and space complexity are linear with the budget size. Another approach reduces online kernel selection to a problem of prediction with expert advice, and uses some master algorithm to wrap computationally efficient online kernel learning algorithms, including budgeted online kernel learning (Crammer et al. 2003; Dekel et al. 2008; Orabona et al. 2009; Koppel et al. 2019), low-rank matrix approximation based online kernel learning and projection to a low-dimension space (Lu et al. 2016; Jézéquel et al. 2019). For instance, Foster et al. (2017) studied online model selection in Banach space and developed a multi-scale expert advice algorithm, which can adapt to the loss range of different hypothesis set.
The second computational perspective limits the usable computational resources and is more practical for online learning problem. Previous work did not consider this new computational perspective, or only indirectly considered the memory constraints (Nguyen et al. 2017; Zhang and Liao 2018). Thus many fundamental problems induced by computational constraints have been omitted. The first fundamental problem is that how the regret depends on the computational constraints, T and K, where K is the number of candidate kernel functions. For instance, given a memory budge B, it is still unclear how the lower bound on the regret depends on B, T and K. The second problem is what the differences between memory constraints and time constraints are. The main obstacle induced by the computational constraints is how to avoid allocating the available computational resources over K RKHSs. Existing approaches allocate the computational resources, and thus may not be optimal.
In this paper, we study online kernel selection under computational constraints, where the kernel selection and online prediction procedures are restricted by a memory budget or a time budget of \({\mathcal {T}}\) quanta. We focus on the worst-case regret analysisFootnote 1 and solve the above two fundamental problems. To start with, we make mild assumptions that relate the memory budget and time budget to the example budget. Thus we only consider such online kernel selection approaches that operate on a subset of observed examples. For unconstrained RKHSs and convex loss functions, we separately prove a lower bound on the regret under a memory budget and time budget. Our proof technique is novelty, which relies on a sequence of equi-distant instances and does not require the orthogonality or approximate orthogonality in RKHSs. For online kernel selection with memory constraints, we reduce it to the problem of prediction with expert advice, and establish two nearly optimal algorithms with different regret bounds. The keys include a memory sharing and a hypothesis space discretization scheme. For online kernel selection with time constraints, we consider two cases. If \(K\le d\), the number of features, this problem is equivalent to the case of memory constraints. For the case of \(K>d\), the two problems are different. We reduce it to the multi-armed bandit problem with an additional observation, and establish a nearly optimal algorithm. The key is a decoupled exploration-exploitation scheme. Table 1 gives a summary of the main results.
1.1 Related work
Online kernel learning with a memory budget has been studied for years (Crammer et al. 2003; Dekel et al. 2008; Orabona et al. 2009). The bounded online gradient descent algorithm (Zhao et al. 2012) enjoys a \(O((\Vert f\Vert ^2_{{\mathcal {H}}}+1){T}/{\sqrt{B}})\) expected regret bound for the hinge loss. However, the matching lower bound is still unknown. Dekel et al. (2008) proved an incomplete hardness result. There exists a sequence of examples and a fixed hypothesis that makes no mistakes, but any online kernel learning algorithm with limited memory always makes mistakes. How the lower bound depends on the memory budget is still unclear. For smooth loss functions, Zhang et al. (2013) proved a \(\varOmega (T/B)\) lower bound on the regret in the case of \(B=O(\sqrt{T})\). Cesa-Bianchi et al. (2015) studied the complexity of offline kernel learning with memory constraints, and proved several lower bounds on the optimization error, which is different from regret. Our work studies the lower bounds for online kernel selection with computational constraints and is suitable for online kernel learning.
Agarwal et al. (2011) initiated the study of computationally budgeted model selection, where the model selection procedure is restricted to a time budget. For a collection of finite number of model classes, by reducing the problems to a stochastic bandit problem, an upper-confidence bound algorithm was established, which can achieve the model selection oracle inequality. The algorithm is not suitable for online kernel selection, since the environments may not be i.i.d.. Our work is also related to online multiple kernel learning (Jin et al. 2010; Hoi et al. 2013). Given K candidate RKHSs, at each round t, the goal is to learn a linear combination of K predictions. Sahoo et al. (2014) proposed budgeted online multi-kernel regression algorithms, which use a budget B to limit the number of support vectors. However, they did not prove how the regret upper bound depends on B. Besides, the per-round time complexity of such algorithms is linear with K. Within time constraints, such algorithms allocate the time resources to K RKHSs which would not be optimal. Our work revels how the upper bound depends on the computational constraints, T and K, and can make up the omitted regret analysis.
There are some other related work, including parameter-free online learning (McMahan and Abernethy 2013; McMahan and Orabona 2014; Cutkosky and Boahen 2016), and model selection for the multi-armed bandit problems (Agarwal et al. 2017; Foster et al. 2019), where the CORRAL algorithm (Agarwal et al. 2017) was proposed for selecting bandit algorithms on the fly. For our focused problems, the sub-algorithms are online kernel learning algorithms rather than bandit algorithms, thus CORRAL is not the best candidate. Parameter-free online learning aims at making regret bounds depend on \(\Vert f\Vert _{{\mathcal {H}}}\) rather than \((\Vert f\Vert ^2_{{\mathcal {H}}}+1)\). Previous work did not consider computational constraints. Our work can achieve this goal within memory constraints.
1.2 Contributions
We study online kernel selection in the regime of memory constraints or time constraints, and analyze the regret in the worst case. Our contributions can be summarized as follows.
-
We prove the worst-case lower bounds on the regret of budgeted online kernel selection algorithm with memory constraints or time constraints. The lower bounds on the regret reveal the lower bounds on the computational constraints that are necessary for achieving a given upper bound on the regret. As a byproduct, our results are suitable for online kernel learning with memory constraints and make up the incomplete result established by Dekel et al. (2008).
-
We identify the condition for the first time under which online kernel selection with time constraints is different from memory constraints.
-
We separately propose nearly optimal algorithms for the two computational constraints which invent some new techniques, such as memory sharing, hypothesis space discretization and decoupled exploration-exploitation scheme.
2 Problem setup
Let \({\mathcal {I}}_T:=\{({\mathbf {x}}_t,y_t)\}_{t\in [T]}\) be a sequence of examples, where \({\mathbf {x}}_t\in {\mathcal {X}}\subset \mathbb {R}^d\) is an instance, \(y_t\in [-Y,Y]\) is the output and \([T] = \{1,\ldots ,T\}\). Let \(\kappa (\cdot ,\cdot ):\mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\) be a positive semidefinite kernel function, and \({\mathcal {H}}\) be the RKHS associated with \(\kappa\), such that, for any \(f\in {\mathcal {H}}\), (i) \(\langle f,\kappa ({\mathbf {x}},\cdot )\rangle _{{\mathcal {H}}}=f({\mathbf {x}}), \forall {\mathbf {x}}\in {\mathcal {X}}\), and (ii) \({\mathcal {H}}=\overline{\mathrm {span}(\kappa ({\mathbf {x}},\cdot )\vert {\mathbf {x}}\in {\mathcal {X}})}\). We define \(\langle \cdot ,\cdot \rangle _{{\mathcal {H}}}\) as the inner product in \({\mathcal {H}}\), which induces the norm \(\Vert f\Vert _{{\mathcal {H}}}=\sqrt{\langle f,f\rangle _{{\mathcal {H}}}}\). Assuming the loss function \(\ell :\mathbb {R}\times {[-Y,Y]}\rightarrow \mathbb {R}_{+}\) is convex in its first parameter.
Given a collection of kernel functions \({\mathcal {K}} = \{\kappa _i\}^K_{i=1}\), which induce K RKHSs \({\mathcal {H}}=\{{\mathcal {H}}_i\}^K_{i=1}\). If an oracle gives the best kernel \(\kappa ^*\) for \({\mathcal {I}}_T\), then we just need to learn a sequence of hypotheses in \({\mathcal {H}}^*\). Lacking the prior of \({\mathcal {H}}^*\), the learner hopes to develop some kernel selection algorithm and generate a sequence of hypotheses \(\{f_t\}^T_{t=1}\), which is competitive to that generated by the same algorithm running in \({\mathcal {H}}^*\) solely. The regret of the algorithm w.r.t. \({\mathcal {H}}_i\in {\mathcal {H}}\) is defined in (1). For the sake of clarity, we restate it as follows,
To adapt to the unknown \({\mathcal {H}}^*\), a feasible approach is to keep sub-linear regret w.r.t. any \({\mathcal {H}}_i\).
To achieve this goal, the main challenge is the high time and space complexity. If we do not limit the model size, then the per-round time complexity and the space complexity would be O(T). In this paper, we consider online kernel selection under computational constraints, including a memory budget or a time budget, and analyze the worst-case regret. Next, we define the two kinds of computational constraints.
Definition 1
(Memory Budget) Define a memory budget of \({\mathcal {T}}\) quanta as the maximal memory that any online kernel selection algorithm can use.
Definition 2
(Time Budget) Let the interval of arrival time between \({\mathbf {x}}_t\) and \({\mathbf {x}}_{t+1}, t=1,\ldots ,T\) be less than \({\mathcal {T}}\) quanta. Define a time budget of \({\mathcal {T}}\) quanta as the maximal time interval that any online kernel selection algorithm outputs the prediction of \({\mathbf {x}}_t\) and \({\mathbf {x}}_{t+1}\).
In Definition 1, the term “quanta” is the unit of memory, such as “ Byte”. In Definition 2, the term “quanta” is the unit of time, such as “millisecond” or “second”. We further assume that the base kernels satisfy the following property.
Assumption 1
For all \(\kappa _i\in {\mathcal {K}}\) and \({\mathbf {u}},{\mathbf {v}}\in {\mathcal {X}}\), let \(\kappa _i({\mathbf {u}},{\mathbf {v}})\) be a function of \(\langle {\mathbf {u}},{\mathbf {v}}\rangle\), \(\Vert {\mathbf {u}}\Vert _2\) and \(\Vert {\mathbf {v}}\Vert _2\), and \(\kappa _i({\mathbf {u}},{\mathbf {u}})\in [0,D_{i}]\).
Such kernels are also called Euclidean kernel (Kothari and Livni 2020). For simplicity, let \(D:=\max _iD_i\). Usual kernel functions, such as shift-invariant kernel and polynomial kernel with bounded degree, satisfy the assumption. We further give three key assumptions, which reduce the memory budget and time budget to example budget.
Assumption 2
Let the memory budget be linear with the space complexity of algorithm, and the time budget be linear with the time complexity of algorithm.
The space complexity is defined as the memory required by algorithm. Thus it is intuitive to assume that the memory budget is linear with the space complexity of algorithm. Similarly, assuming that m multiply operations can be executed within a unit time. For a given time budget of \({\mathcal {T}}\) quanta, the algorithm can execute \(m{\mathcal {T}}\) multiply operations. The time complexity of algorithm is defined as the total number of multiply operations. Thus we can also assume that the time budget is linear with the time complexity.
Assumption 3
Under the condition of Assumptions 1 and 2, for any kernel \(\kappa \in {\mathcal {K}}\), there exist positive integers \(\alpha\) and \(\beta\), such that any budgeted online kernel leaning algorithm running in \({\mathcal {H}}_{\kappa }\) can maintain a budget storing \(B\le \alpha {\mathcal {T}}\) examples within a memory budget of \({\mathcal {T}}\) quanta, or can execute \(B\le \beta {\mathcal {T}}\) kernel evaluations at each round within a time budget of \({\mathcal {T}}\) quanta. If the space complexity and time complexity of algorithm are linear with B, then “\(=\)” holds.
Assumption 4
Under the condition of Assumption 3, let the maximal memory budget \({\mathcal {T}}\) satisfy \(B=T\), and the maximal time budget \({\mathcal {T}}\) satisfy \(B=T\).
Assumption 4 means that there is no need to assume an infinite \({\mathcal {T}}\) unless T is infinite. The reason is that any algorithm can store T examples at most. In practice, \({\mathcal {T}}\) may be very small. In Assumption 3, the budgeted online kernel learning algorithms are such algorithms that operate on a subset of the observed examples, such as, Forgetron (Dekel et al. 2008), BOGD (Zhao et al. 2012), BSGD (Wang et al. 2012) to name but a few. We claim that \(\alpha\) and \(\beta\) are independent of kernel function. It is reasonable, since the memory cost is used to store the support vectors and coefficient vectors, and the time cost of computing \(\kappa (\mathbf{u},\mathbf{v})\) is to compute \(\langle \mathbf{u},\mathbf{v}\rangle\), \({\Vert \mathbf{u}\Vert _2}\) and \({\Vert \mathbf{v}\Vert _2}\). We only focus on convex loss functions. Online gradient descent has the lowest space and time complexity, which is O(dB), where B is the budget size. For algorithms whose time complexities are \(O(dB^\gamma ),\gamma >1\), then “=” doest not hold in Assumption 3. Based on the above three assumptions, we only consider such online kernel selection algorithms that work in implicit RKHSs and operate on finite examples. For the sake of clarity, we denote such algorithms as budgeted online kernel selection algorithms.
Next we restate the main questions.
- Q 1:
-
How does the regret depend on \({\mathcal {T}}\), T and K in the worst case?
- Q 2:
-
What are the differences between memory constraints and time constraints?
To answer the two questions, we need to solve the following two problems, (i) proving the lower bounds on the regret under memory constraints or time constraints and, (ii) establishing algorithms achieving the lower bounds. Our main contributions are providing nearly complete answers to the questions.
3 Online kernel selection with memory constraints
In this section, we give both a lower bound on the regret for online kernel selection with a memory budget and two simple algorithms nearly achieving the lower bound.
3.1 Lower bound
We select K Gaussian kernel functions \(\kappa _i({\mathbf {x}},{\mathbf {z}}) = \exp \left( -\frac{\Vert {\mathbf {x}}-{\mathbf {z}}\Vert ^2_2}{\sigma _i}\right)\), \(i\in [K]\) as the candidates. Without loss of generality, let \(0<\sigma _1<\ldots <\sigma _K\), where \(\sigma _K\) is a bounded constant. We can also create candidates from other kernel functions, such as polynomial kernels, or the mixture of polynomial kernels and Gaussian kernels.
Theorem 1
Let \(\ell (\cdot ,\cdot )\) be the hinge loss or the absolute loss. There exist K kernel functions \(\{\kappa _{i}\}^K_{i=1}\) selected by the learner, and a sequence of examples \(\{({\mathbf {x}}_t,y_t)\}^T_{t=1}\) selected by an oblivious adversary, where \(y_t\in \{-1,1\}\), such that, for a memory budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, for all \(\kappa _i\), the expected regret of any budgeted online kernel selection algorithm satisfies
where L is the Lipschitz constant of \(\ell\), and \(f^*_i=\mathrm {argmin}_{f\in {\mathcal {H}}_i}\sum ^T_{t=1}\ell (f({\mathbf {x}}_t),y_t)\).
According to the lower bound, we can infer the relation between the upper bound on the regret and the lower bound on the required memory budget. In the case of \(T=O(\alpha {\mathcal {T}})\), the optimal upper bound on the regret is \(O\left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L\sqrt{T}\right)\). In the case of \(T=\varOmega (\alpha {\mathcal {T}})\), the optimal upper bound is \(O\left( \left\| f^*_i\right\| _{{\mathcal {H}}_{i}}L\frac{T}{\sqrt{\alpha {\mathcal {T}}}}\right)\). Let \(T(\alpha {\mathcal {T}})^{-\frac{1}{2}}\le CT^{\upsilon },\frac{1}{2}\le \upsilon <1\), where C is a constant. Solving the inequality yields that the required lower bound on the memory budget satisfies \({\mathcal {T}}\ge C^{-2}\alpha ^{-1} T^{2(1-\upsilon )}\). In the worst case, achieving a \(O(T^\upsilon ),\frac{1}{2}\le \upsilon <1\) regret bound requires a memory budget of order \(\varOmega (T^{2-2\upsilon })\). The lower bound on the regret seems surprising and may not be a strong result, since it is independent of K. We will show that it is optimal up to an additional penalty term.
If \(K=1\), then Theorem 1 reveals the lower bound of budgeted online kernel learning algorithms. We can not provide a \(O(\Vert f^*_1\Vert _{{\mathcal {H}}_1}L\sqrt{T})\) regret bound unless the memory budget \({\mathcal {T}}=\varOmega (T/\alpha )\). The BOGD algorithm (Zhao et al. 2012) enjoys a \(O((\Vert f^*_1\Vert ^2_{{\mathcal {H}}_1}+1)L{T}/{\sqrt{\alpha {\mathcal {T}}}})\) expected regret bound which is optimal w.r.t. T, but sub-optimal w.r.t. \(\Vert f^*_1\Vert _{{\mathcal {H}}_1}\). Dekel et al. (2008) proved an incomplete hardness result for online kernel learning under a memory budget B. There always exists \(B+1\) examples, such that any algorithm only storing B examples will make \(T=B+1\) mistakes. Besides, there is a hypothesis \(f^*_1\in {\mathcal {H}}_1\) satisfying \(\Vert f^*_1\Vert _{{\mathcal {H}}_1}=\sqrt{B+1}\) that never makes mistakes and attains a hinge loss of 0. Actually, their lower bound on the mistakes equals the lower bound on the regret for the hinge loss, or rather, the lower bound on the regret is \(B+1 = \Vert f^*_1\Vert _{{\mathcal {H}}_1}\sqrt{T}\), where we use the specific identity \(T=B+1\). The weakness of this lower bound is that it can not be extended to the case \(B=o(T)\). Our result in Theorem 1 provides a complete answer to the question.
3.2 A nearly optimal algorithm for any K
An intuitive approach is to allocate the memory budget to the K base kernels. According to the lower bound (2), such an approach will increase the regret by a factor of order \(O(\sqrt{K})\). Recalling that any hypothesis \(f_i\in {\mathcal {H}}_i\) can be represented by \(f_i=\sum ^T_{t=1}a_{t,i}\kappa _i({\mathbf {x}}_t,\cdot )\). Thus the memory cost is used to store the support vectors \(\{({\mathbf {x}}_t,y_t)^T_{t=1}:a_{t,i}\ne 0\}\), and the coefficients \(\{(a_{t,i})^T_{t=1}:a_{t,i}\ne 0\}\). According to this observation, we will present an algorithm that shares the support vectors and a coefficient vector among K different hypotheses \(\{f_i\}^K_{i=1}\).
Instead of selecting kernels from a finite collection \(\{\kappa _1,\ldots ,\kappa _K\}\), we will select kernels from an infinite kernel space \({\mathcal {K}}\) defined as follows,
The learning of the weight vector \(\mathbf{p}\) will be clarified later. At the beginning of round t, assuming that there is a weight vector \(\mathbf{p}_t\). We learn a new kernel \(\kappa _{\mathbf{p}_t}=\sum ^K_{i=1}p_{t,i}\kappa _i\), which induces a RKHS \({\mathcal {H}}_{\mathbf{p}_t}\) with embedding \(\phi _{\mathbf{p}_t}:{\mathcal {X}}\rightarrow {\mathcal {H}}_{\mathbf{p}_t}\) defined as follows
where \(\phi _{\kappa _i}\) is the embedding induced by \(\kappa _i\). We select a hypothesis \(f_t\in {\mathcal {H}}_{\mathbf{p}_t}\), defined by
The prediction is given by \(f_t(\mathbf{x}_t)= \langle f_t,\phi _{\mathbf{p}_t}(\mathbf{x}_t)\rangle _{{{\mathcal {H}}_{\mathbf{p}_t}}} =\sum ^K_{i=1}p_{t,i}f_{t,i}(\mathbf{x}_t)\), or \(\mathrm {sign}(f_{t}(\mathbf{x}_t))\) for classification. Although there are K hypotheses \(\{f_{t,i}\}^K_{i=1}\), we just need to maintain a single set of support vectors and a single coefficient vector \((a_1,\ldots ,{a_{t-1}})\).
To keep the memory constraints, we propose a simple example adding strategy. At any round t, let \(\nabla _{f_t}:=\ell '(f_t(\mathbf{x}_t),y_t)\phi _{\mathbf{p}_t}(\mathbf{x}_t)\) be the (sub-)gradient of \(\ell (f_t(\mathbf{x}_t),y_t)\) w.r.t. \(f_t\). We define a Bernoulli random variable \(\rho _{t}\in \{0,1\}\) satisfying
where \(C>0\) is a constant and \(z_t>0\) depends on t. The definition of C and \(z_t\) will be given in Theorem 2. Let S be a buffer storing the support vectors. We sample \(\rho _{t}\sim \mathrm {Ber}(\mathbb {P}[\rho _{t}=1],1)\). If \(\rho _{t}=1\), then we update \(f_{t}\) and add the current example into the buffer, i.e., \(S = S\cup (\mathbf{x}_t,y_t)\). Let \({\tilde{\nabla }}_{f_t}\) be an estimator of \(\nabla _{f_t}\), which is defined as follows,
We update the hypothesis by online gradient descent
where \(\lambda\) is the learning rate (or stepsize) of gradient descent. According to (3) and the definition of \(f_t\) (4), the above updating can be rewritten by
For simplicity, we define \(\nabla _{t,i}:=\ell '(f_t(\mathbf{x}_t),y_t)\phi _{\kappa _i}(\mathbf{x}_t)\).
To update \(\mathbf{p}_t\), we reduce this problem to a problem of prediction with expert advice. Let \(c_{t,i}\) be a criterion evaluating base \(\kappa _i\), \(i=1,\ldots ,K\), which serves as the loss of the i-th action.
where \(\ell _{m}=\max _{t}\{\vert \ell '(f_{t}({\mathbf {x}}_t),y_t)\vert \cdot \max _{i,j} \left( f_{t,i}({\mathbf {x}}_t)-f_{t,j}({\mathbf {x}}_t)\right) \}\) and can be tuned by the doubling trick. Let \({\mathcal {E}}(K)\) be the exponential weights algorithm in (Cesa-Bianchi and Lugosi 2006) (see Sect. 4.2). Then \(\mathbf{p}_{t+1}=(p_{t+1,1},\ldots ,p_{t+1,K})\) can be computed as follows,
where \(\eta\) is the learning rate.
We name the algorithm LKMBooks (Learning Kernel for Memory BOunded Online Kernel Selection). The algorithm description is shown in Algorithm 1.
Theorem 2
Let \(E_t=\{\tau <t:\nabla _{f_{\tau }}\ne 0\}\), \(B=\alpha {\mathcal {T}}\) and \(C=B\). Let \(z_{t}=(1-\upsilon )T^{1-\upsilon }(\vert E_t\vert +1)^{\upsilon }\), where \(0\le \upsilon <1\). If there exists a \(\upsilon \in [0,1)\) satisfying \((1-\upsilon )T^{1-\upsilon } > B\), then for any sequence \({\mathcal {I}}_T\), with probability at least \(1-\delta\), LKMBooks guarantees that
Otherwise, \(\vert S\vert \le B\).
Theorem 2 shows that our algorithm will not excess the memory constraint in a high probability. \(z_t\) gives the probability that any support vector is added into the budget. It is worth noting that the key of \(z_t\) is the value of \(\upsilon\). If \(\upsilon =0\). then each support vector is added into the budget with a same probability. We can also use a non-uniform probability distribution, i.e., \(\upsilon >0\). In this case, the probability decreases with the increasing of support vectors. In experiments, we always set \(\upsilon >0\) and empirically find that the non-uniform probability distribution performs better. In theory, the two kinds of probability distributions are equivalent in the sense that they induce the same budget size and regret bounds.
Theorem 3
Given a memory budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, let \(B=\alpha {\mathcal {T}}\). Assuming that \(\ell\) satisfies \(\vert \ell '(f({\mathbf {x}}),y)\vert \le L\). Let \({\mathcal {K}}=\{\kappa _i\}^K_{i=1}\) be a collection of kernel functions, and \(\eta =\sqrt{8\ln (K)/T}\). If \(B<T\), then let \(\lambda ={\sqrt{(1+\upsilon )B}}/{(\sqrt{(1-\upsilon )D}LT)}\). Otherwise, let \(\lambda =1/(\sqrt{DT}L)\). For any \(\kappa _i\in {\mathcal {K}}\), the expected regret of LKMBooks satisfies
Remark 1
LKMBooks is similar with the online multi-kernel learning algorithm in (Jin et al. 2010) (Algorithm 5, denoted by DA-OMKL-O for simplicity), and the budgeted online multi-kernel regression algorithm in (Sahoo et al. 2014) (denoted by BOKMR for simplicity), since the three algorithms use a convex combination of K outputs \(\{f_{t,i}({\mathbf {x}}_t)\}^K_{i=1}\). The difference is that, DA-OMKL-O and BOKMR make \(\{f_{t,i}\}^K_{i=1}\) possess different coefficient vectors. However, LKMBooks makes \(\{f_{t,i}\}^K_{i=1}\) share a single coefficient vector. Besides, DA-OMKL-O does not limit the support vectors, and one of the two versions of BOKMR can also not share the support vectors. The space complexity of LKMBooks is \(O(dB+K)\). The two versions of BOKMR suffer a \(O(dB+KB+K)\) and O(KBd) space complexity, respectively. For the case of \(K \gg d\), LKMBooks suffers the lowest space complexity. What’s more, BOKMR did not provide a regret bound.
We consider the optimality w.r.t. \({\mathcal {T}}, T\) and K. Compared with the lower bound (2), LKMBooks is optimal up to an additional penalty term of order \(O(\max \{\ell _{m},1\}\sqrt{T\ln {K}})\), which comes from the intrinsic complexity of prediction with expert advice. The penalty term is a lower order term. Thus LKMBooks avoids the dependence on \(O(\sqrt{K})\). However, LKMBooks depends on \((\Vert f^*_i\Vert ^2_{{\mathcal {H}}}+1)\), which is much worse than \(\Vert f^*_i\Vert _{{\mathcal {H}}}\). The reason is that LKMBooks uses online gradient descent (OGD) to update hypothesis. The standard regret bound of OGD depends on \((\Vert f^*_i\Vert ^2_{{\mathcal {H}}}+1)\) (Orabona 2013). Using OGD aims at sharing a single coefficient vector. Next we show an optimal algorithm for the case of \(K<d/\ln {\sqrt{T}}\).
3.3 Adapt to the norm of competitor for \(K<d/\ln {\sqrt{T}}\)
To adapt to \(\Vert f^*_i\Vert _{{\mathcal {H}}}\), we propose a hypothesis space discretization scheme. For each \(\kappa _i\), \(i=1,\ldots ,K\), we define the feasible hypothesis space by \(\mathbb {H}_i=\{f\in {\mathcal {H}}_i:\Vert f\Vert _{{\mathcal {H}}_i}\le U\}\). We discretize (0, U] as follows
This technique is also known as the peeling technique. The key is the choice of U and \(U_{\min }\), which depends on the memory budget \({\mathcal {T}}\) and will be determined later. For any \(f\in \mathbb {H}_i\), there exists some j such that \(\Vert f\Vert _{{\mathcal {H}}_i} \in (0,\mathrm {e}^{\lceil \ln {U_{\min }}\rceil }]\) or \((\mathrm {e}^j,\mathrm {e}^{j+1}]\). Let \(M=\lceil \ln {U}\rceil -\lceil \ln {U_{\min }}\rceil +1\). We construct \(K':=KM\) nested hypothesis spaces
where \(U_j=\mathrm {e}^{j+\lceil \ln {U_{\min }}\rceil -1}\). Thus \(\mathbb {H}_{i,1}\subset \ldots \subset \mathbb {H}_{i,M}\subset {\mathcal {H}}_i\). For the sake of clarity, we define two index functions \(h:[K]\times [M]\rightarrow [K']\) and \(h^{*}: [K']\rightarrow [K]\times [M]\). Specifically, h(i, j) maps (i, j) to the h(i, j)-th element in \([K']\). Similarly, \(h^{*}(k)\) maps \(k\in [K']\) to \((h^{*}(k)_1,h^{*}(k)_2)\), where \(h^{*}(k)_1=\lfloor (k-1)/M\rfloor + 1\) and \(h^{*}(k)_2 = k-(h^{*}(k)_1-1)M\).
To share the support vectors, we use an oblivious example adding strategy. The term “oblivious” means that the strategy is independent of algorithms. At any round t, let \(\rho _{t}\in \{0,1\}\) be a Bernoulli random variable satisfying
Let \(\{f_{t,i,j}\}^T_{t=1}\) be a sequence of hypotheses in \(\mathbb {H}_{i,j}\) and \(\nabla _{t,i,j}=:\nabla _{f_{t,i,j}}\ell (f_{t,i,j}(\mathbf{x}_t),y_t)\) be the (sub-)gradient w.r.t. \(f_{t,i,j}\), \(i\in [K],j\in [M]\). At the end of round t, we sample \(\rho _{t}\sim \mathrm {Ber}(\mathbb {P}[\rho _{t}=1],1)\). If \(\rho _{t}=1\), then we update the hypothesis \(f_{t,i,j}\) and add the current example into the buffer, i.e., \(S = S\cup (\mathbf{x}_t,y_t)\). Let \({\tilde{\nabla }}_{t,i,j}\) be an estimator of \(\nabla _{t,i,j}\), which is defined as follows,
We update the hypothesis by online gradient descent
The projection of any \(f\in {\mathcal {H}}_i\) onto \(\mathbb {H}_{i,j}\) is defined by \(g=\min \{1,\frac{U_j}{\Vert f\Vert _{{\mathcal {H}}_i}}\}f\).
Next we show the kernel selection procedure. Let \({\mathcal {E}}(K')\) be an algorithm for prediction with expert advice. We select a hypothesis space \(\mathbb {H}_{h^{*}(I_t)_1,h^{*}(I_t)_2}\), where \(I_t\sim \mathbf{p}_t\), and make prediction \({\hat{y}}_t=f_{t,h^{*}(I_t)_1,h^{*}(I_t)_2}({\mathbf {x}}_t)\) or \(\mathrm {sign}({\hat{y}}_t)\). For each action \(h(i,j)\in [K']\), let the criterion be \(c_{t,h(i,j)}=\ell (f_{t,i,j}({\mathbf {x}}_t),y_t)\). For all \(f\in \mathbb {H}_{i,j}\), assuming that there is a function \(g(U_j,D_i,Y)\) satisfying \(c_{t,h(i,j)}\le g(U_j,D_i,Y)\). At the end of round t, we send \(\mathbf{c}_t=(c_{t,1},\ldots ,c_{t,K'})\) to \({\mathcal {E}}(K')\). To adapt to the norm of competitor, \({\mathcal {E}}(K')\) needs to achieve a multi-scale regret bound. Let \({\mathcal {E}}(K')\) be the MSMW algorithm in Bubeck et al. (2019). which is shown in Algorithm 3.
We name this algorithm PFMBooks (Parameter-Free for Memory BOunded Online Kernel Selection).
Theorem 4
Let \(B=\alpha {\mathcal {T}}\), \(C=B\) and \(z_{t}=2(1-\upsilon )T^{1-\upsilon }t^{\upsilon }\), where \(0\le \upsilon <1\). Under the condition of Assumption 4, there exists a \(\upsilon \in [0,1)\) such that \(2(1-\upsilon )T^{1-\upsilon } > B\). For any sequence \({\mathcal {I}}_T\), with probability at least \(1-\delta\), PFMBooks guarantees that
The proof is same with that of Theorem 2. PFMBooks ensures \(\vert S\vert =O(B/2)\) with a high probability and maintains KM coefficient vectors. The total space complexity is \(O(\frac{dB}{2}+\frac{BKM}{2})=O(dB)=O(d\alpha {\mathcal {T}})\) in the case of \(K<d/M\). We will set \(U_{\min }=U/\sqrt{T}\) in Theorem 6, and thus \(M<1+\ln \sqrt{T}\). PFMBooks will not exceed the total memory constraints in a high-probability. Next we state an important assumption, which is easily satisfied and forms the bases of obtaining the final regret bound.
Assumption 5
For any sequence of examples \({\mathcal {I}}_T:=\{({\mathbf {x}}_t,y_t)\}_{t\in [T]}\), let \(\vert y_t\vert \le Y\). For any hypothesis \(f\in {\mathcal {H}}_i,i=1,\ldots ,K\) and \(({\mathbf {x}},y)\in {\mathcal {I}}_T\), there always exists a function \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y):\mathbb {R}^3\rightarrow \mathbb {R}\) such that \(\ell (f({\mathbf {x}}),y) \le g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)\) and \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)=\varTheta (1+\Vert f\Vert _{{\mathcal {H}}_i})\).
Many loss functions satisfy Assumption 5, such as the \(\varepsilon\)-insensitive hinge loss, and the \(\varepsilon\)-insensitive absolute loss. For instance, if \(\ell (f({\mathbf {x}}),y)=\vert f({\mathbf {x}})-y\vert\), then we can define \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)=\Vert f\Vert _{{\mathcal {H}}_i}\sqrt{D_i}+Y\). If \(\ell (f({\mathbf {x}}),y)=\max \{0,1-yf({\mathbf {x}})\}\), then we can define \(g(\Vert f\Vert _{{\mathcal {H}}_i},D_i,Y)=1+Y\Vert f\Vert _{{\mathcal {H}}_i}\sqrt{D_i}\). Next we show the multi-scale regret bound of \({\mathcal {E}}(K')\).
Theorem 5
Let \(\eta =\sqrt{2\ln (K'T)/T}\) and \(U=\varTheta (B)\). Under the condition of Assumption 5, \(\forall k\in [K']\), the expected regret of \({\mathcal {E}}(K')\) satisfies
Remark 2
\({\mathcal {E}}(K')\) is slightly different from the original MSMW algorithm in Bubeck et al. (2019), including: (i) MSMW uses “reward” as the feedback, but \({\mathcal {E}}(K')\) uses “loss” as the feedback; (ii) the initial distribution of MSMW and \({\mathcal {E}}(K')\) are different. Although we can transform “loss” to “reward” by \(r_{t,k}=g(U_{h^{*}(k)_2},D_{h^{*}(k)_1},Y)-c_{t,k}\), where \(r_{t,k}\) is the reward of the k-th action, the regret bound will increase a term \(\sum ^T_{t=1}[\sum ^{K'}_{k=1}p_{t,k}g(U_{h^*(k)_2},D_{h^*(k)_1},Y) -g(U_{h^*(k)_2},D_{h^*(k)_1},Y)]\), which can not adapt to the scale of individual action. Thus we need a different proof. We present a simpler proof in the Appendix. One of the key is using a different initial distribution.
Theorem 6
Given a memory budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, let \(B=\alpha {\mathcal {T}}\). Let \(U=\varTheta (\sqrt{B})\), \(U_{\min }=U/\sqrt{T}\) and \(\lambda _{i,j}=\frac{U_j\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )D_i}LT}\). The expected regret of PFMBooks w.r.t. any \({\mathcal {H}}_i,i=1,\ldots ,K\) satisfies
Remark 3
In Theorem 1, the lower bound does not limit \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Our upper bound may be invalid if \(U<\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Inspecting the hard examples in the proof of Theorem 1, we find that \(\Vert f^*_i\Vert _{{\mathcal {H}}_i} =\varTheta (\sqrt{B})\). Thus our upper bound is still valid if \(U =\varTheta (\sqrt{B})\).
The expectation is w.r.t. the randomness of \({\mathcal {E}}(K')\) and the randomness of \(\{\rho _t\}^{T-1}_{t=1}\). Compared with the upper bound in Theorem 3, PFMBooks improves the dependence on \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Compared with the lower bound (2), PFMBooks is optimal up to a factor of order \(O(\sqrt{\ln (K'T)})\) and a small penalty term of order \(O\left( \sqrt{T\ln (K'T)}\right)\).
4 Online kernel selection with time constraints
In this section, we give both a lower bound on the regret for online kernel selection with a time budget and a simple algorithm nearly achieving the lower bound.
4.1 Lower bound
For the sake of clarity, we introduce a natation of resource allocation. Any kernel selection algorithm needs to assign a kernel selection strategy and a resource allocation strategy simultaneously. In this work, we consider the static resource allocation defined as follows.
Definition 3
(Static Resource Allocation) Define a static resource allocation \(R({\mathcal {T}}_1,\ldots ,{\mathcal {T}}_K)\) as a strategy that allocates a time budget of \(0<{\mathcal {T}}_i\le {\mathcal {T}}\) quanta to kernel function \(\kappa _i\) before the game, and does not change later.
For any budgeted kernel selection algorithm with static resource allocation \(R({\mathcal {T}}_1,\ldots ,{\mathcal {T}}_K)\), the following theorem gives a lower bound on the regret.
Theorem 7
Let \(\ell (\cdot ,\cdot )\) be the hinge loss or the absolute loss. There exist K kernel functions \(\{\kappa _{i}\}^K_{i=1}\) chosen by the learner, and a sequence of examples \(\{({\mathbf {x}}_t,y_t)\}^T_{t=1}\) chosen by an oblivious adversary, where \(y_t\in \{-1,1\}\), such that for a time budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, for all \(\kappa _i\), the expected regret of any budgeted online kernel selection algorithm with static resource allocation \(R({\mathcal {T}}_1,\ldots ,{\mathcal {T}}_K)\) satisfies
where L is the Lipschitz constant of \(\ell\), and \(f^*_i\in {\mathcal {H}}_i = \overline{\mathrm {span}(\kappa _i({\mathbf {x}}_1,\cdot ),\ldots , \kappa _i({\mathbf {x}}_t,\cdot ))}\).
The lower bound also reveals that, in the worst case, achieving a \(O(T^\upsilon ),\frac{1}{2}\le \upsilon <1\) regret bound requires a time budget of order \(\varOmega (T^{2-2\upsilon })\). To design algorithms achieving the lower bound (9), it is necessary to adopt the \(R({\mathcal {T}},\ldots ,{\mathcal {T}})\) resource allocation.
We first highlight the difference between memory constraints and time constraints. Recalling that the space complexity of LKMBooks is \(O(dB+K)\). The time complexity of LKMBooks is \(O(dB+KB+K)\), but not \(O(KdB+K)\). The reason is that, under Assumption 1, the main time cost of computing \(\kappa _i(\mathbf{x}_t,\mathbf{x}_{\tau })\) for all \(\mathbf{x}_{\tau }\in S\) is to compute the norm \({\Vert \mathbf{x}_t-\mathbf{x}_\tau \Vert _2}\) or the inner product \(\langle \mathbf{x}_t,\mathbf{x}_{\tau }\rangle\). Since LKMBooks only maintains a single S, we can first compute the norm or inner between \(\mathbf{x}_t\) and the support vectors in S. Thus the time complexity of computing \(f_{t,i}(\mathbf{x}_t)\) for all \(i=1,\ldots ,K\), is of order \(O(dB+KB)\). If \(K\le d\), the two constraints are equivalent and LKMBooks can also be a nearly optimal algorithm for the case of time constraints. Thing is different for the case of \(K>d\). Assuming that \(K=d^{\nu },\nu >1\). If an algorithm achieves the lower bound (9), then it would adopt the \(R({\mathcal {T}},\ldots ,{\mathcal {T}})\) resource allocation. Let the available budget of such an algorithm be \(B_1\), and \(B_2\) be the available budget of LKMBooks. According to Assumptions 3, we have the two identities \(dB_1={\mathcal {T}}\) and \((d+K)B_2={\mathcal {T}}\), which imply \(B_2=O(K^{\frac{1-v}{v}}B_1)\). Substituting into Theorem 3, LKMBooks will increase the regret by a factor of order \(O(K^\frac{v-1}{2v})\).
Thus for the case of \(K<d\), we can directly use LKMBooks or PFMBooks. Next we propose a nearly optimal algorithm for the case of \(K>d\). The algorithm adapts the \(R({\mathcal {T}}/2,\ldots ,{\mathcal {T}}/2)\) resource allocation.
4.2 A nearly optimal algorithm for \(K>d\)
A simply observation is that we need not to evaluate all of the base kernels at each round. An intuitive approach is to select a single kernel function, \(\kappa _{I_t}\), and use the hypothesis \(f_{t,I_t}\) to make prediction. Such an approach has been adopted in (Yang et al. 2012), where the kernel selection problem is reduced to a K-armed bandit problem. However, the regret bound is far from optimal for online kernel selection. At each round, the approach constructs estimated gradient \({\tilde{\nabla }}_{t,i}=\nabla _{t,i}/p_{t,i}\). The second moment is of order \(\max _{t}\nabla _{t,i}/p_{t,i}\), which may be a large term. To address this issue, we will propose a simple exploration-exploitation scheme.
For each \(\kappa _i\), we define the feasible hypothesis space by \(\mathbb {H}_i=\{f\in {\mathcal {H}}_i:\Vert f\Vert _{{\mathcal {H}}_i}\le U\}\). We slightly modify Algorithm 1. The key difference is that we randomly evaluate two kernel functions at each round. The two kernel functions are selected by a decoupled exploration-exploitation scheme, which is defined as follows
-
Exploitation: select a kernel function \(\kappa _{I_t}\sim \mathbf{p}_t\),
-
Exploration: select another kernel function \(\kappa _{J_t}\sim {\mathcal {K}}\) uniformly.
Note that it is possible that \(\kappa _{I_t}=\kappa _{J_t}\). The exploration procedure makes each kernel be selected with a high probability.
Let \(S_i,i=1,\ldots ,K\) be K buffers storing the support vectors. At each round t, we output the prediction \({\hat{y}}_t=f_{t,I_t}({\mathbf {x}}_t)\) or \(\mathrm {sign}({\hat{y}}_t)\). However, we do not update \(f_{t,I_t}\) unless \(I_t=J_t\). The goal is to make \(({\mathbf {x}}_t,y_t)\) be added into each \(S_{i}\) with equal probability. After receiving \(y_t\), we compute the gradient \(\nabla _{f_{t,J_t}}\ell (f_{t,J_t}({\mathbf {x}}_t),y_t)\). If \(\nabla _{f_{t,J_t}}\ell (f_{t,J_t}({\mathbf {x}}_t),y_t)\ne 0\), then we decide whether to update \(f_{t,J_t}\). Let \(\rho _{t,i}\in \{0,1\}\) be a Bernoulli random variable satisfying
If \(\rho _{t,J_t}=1\), then we update \(f_{t,J_t}\) and add the current example into the budget, i.e., \(S_{J_t} = S_{J_t}\cup (\mathbf{x}_t,y_t)\). Let \({\tilde{\nabla }}_{t,i}\) be an estimator of \(\nabla _{t,i}\), defined as follows,
We update the hypothesis \(f_{t,i}\) follows (8), where the projection can be computed incrementally in time O(1).
To update \(\mathbf{p}_t\), we define a K-armed adversarial bandit problem with an additional observation in which the algorithm may obtain two losses. \(\forall i\in [K]\), let \(c_{t,i}={\ell (f_{t,i}(\mathbf{x}_t),y_t)}/{\ell _{m}}\), where \({\ell _{m}=\max _{t,i}\{\ell (f_{t,i}(\mathbf{x}_t),y_t)\}}\) is a normalizing constant and can be tuned by the doubling trick. The key is the estimated loss \({\tilde{c}}_{t,i}\) defined as follows,
We update \(\mathbf{p}_t\) by online stochastic mirror descent (OSMD) with the negative entropy regularizer (Bubeck and Cesa-Bianchi 2012),
where \(\psi _t(\mathbf{p})=\sum ^K_{i=1}\eta _tp_i\ln {p_t}\) and \({\mathcal {D}}_{\psi _t}\) is Bregman divergence.
We name the algorithm BATBooks (Bandit with Additional observation for Time BOunded Online Kernel Selection). The algorithm description is shown in Algorithm 4.
Theorem 8
Let \(B=\beta {\mathcal {T}}\), \(C=KB\) and \(z_{t,i}=2(1-\upsilon )T^{1-\upsilon }t^{\upsilon }\), where \(0\le \upsilon <1\). For any sequence \({\mathcal {I}}_T\), with probability at least \(1-\delta\), BATBooks guarantees that
For all \(i=1,\ldots ,K\), we have \(\vert S_{i}\vert =O(B/2)\). BATBooks evaluates two hypotheses at each round. The total time complexity is \(O(dB)=O(d\beta {\mathcal {T}})\). Thus BATBooks will not excess the total time budget in a high-probability.
Theorem 9
Let \(c_{t}\in [0,1]^{K}\) be any loss vector, and \({\tilde{C}}_{T,*}= \min _{i\in [K]}\sum ^T_{t=1}{\tilde{c}}_{t,i}\), where \({\tilde{c}}_{t,i}\) is the estimator of \(c_{t,i}\) defined in (10). Let \(\eta =\min \{\sqrt{2\ln {K}/(K{\tilde{C}}_{T,*})},\frac{1}{K}\}\). BATBooks guarantees
We can obtain an expected small-loss regret bound for bandit with an additional observation, which may be of independent interest. Seldin et al. (2014) proved the worst-case expected regret bound for this problem. Thus we improve the previous result. Note that if \(\{c_t\}^T_{t=1}\) are fixed loss vectors, then we can remove the expectation operation.
Theorem 10
Given a time budget of \({\mathcal {T}}\) quanta, under the condition of Assumption 3, let \(B=:\beta {\mathcal {T}}\). Let \(U=\varTheta (\sqrt{B})\) and \(\ell\) satisfy \(\vert \ell '(f({\mathbf {x}}),y)\vert \le L\). If there exists a \(\upsilon \in [0,1)\) satisfying
then for any \(\mathbb {H}_{i}, i\in [K]\), let \(\lambda _{i}=\frac{\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )D_i}LT}\), the expected regret of BATBooks satisfies,
If condition (12) can not be satisfied, then let \(\lambda _{i}=\frac{1}{\sqrt{KD_iT}L}\). The expected regret satisfies,
Remark 4
We show for the first time, that online kernel selection with time constraints is different from memory constraints only in the case of \(K>d\), which answers our second question, Q 2. Thus for the case of \(K\le d\), we can just use Algorithm 1 or Algorithm 2. All of previous work does not find such a condition. The online multi-kernel learning algorithms in (Hoi et al. 2013; Sahoo et al. 2014) and the online kernel selection algorithm in (Yang et al. 2012) randomly update a hypothesis for reducing time complexity. We prove that such an approach is unnecessary unless \(K>d\).
We analyze the optimality w.r.t. \({\mathcal {T}}\), T and K. First we consider a small time budget, i.e., \(B< 2T/K\) (condition (12) is satisfied). Compared with the lower bound (9), BATBooks has an additional cost of order \(O(\sqrt{UL_T(f^*_i)K\ln {K}})\). Then we consider a large time budget, i.e, \(2T/K\le B \le T\) (condition (12) is not satisfied). BATBooks is sub-optimal by a multiplicative factor of order \(O(\sqrt{K})\) and the same additional cost. Although \(U=\varTheta (\sqrt{B})\), we have \(L_T(f^*_i)=0\) for the hard examples in the proof of Theorem 7. In this case, our upper bounds are nearly optimal w.r.t. T, K and \({\mathcal {T}}\).
Next we consider the the dependence on \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). Note that \(L_T(f^*_i)\) and U could not be large simultaneously. If \(L_T(f^*_i)\) is much large, then \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\) would be small, and we can ensure U being small. Using Assumption 5, we have \(L_T(f^*_i)=O(\Vert f^*_i\Vert _{{\mathcal {H}}_i}T)\). Thus the additional cost would be \(O(\sqrt{U\Vert f^*_i\Vert _{{\mathcal {H}}_i}TK\ln {K}})\). Our bounds depend on \(O(\sqrt{U\Vert f^*_i\Vert _{{\mathcal {H}}_i}})\) and \(O(\Vert f^*_i\Vert ^2_{{\mathcal {H}}_i})\), which are worse than the lower bound in Theorem 7. Improving the dependence on \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\) is left to further work.
5 Experiments
In this section, we conduct numerical experiments to verify our theoretical results. As a whole, our goal is to verify the following results,
- (G 1):
-
Online kernel selection improves the learning performance relative to online single kernel learning with an empirical preset kernel.
- (G 2):
-
The superior of memory sharing scheme. Within a same memory constraint, our algorithms are better than such algorithms that do not share the memory.
- (G 3):
-
In the worst case, the time constraints is same with the memory constraints for the case of \(K < d\). Thus Algorithm 1 is also nearly optimal for online kernel selection with time constraints.
- (G 4):
-
In the worst case, the time constraints is different from the memory constraints for the case of \(K\ge d\), that is, Algorithm 4 is better than Algorithm 1 for the case of \(K > d\).
We first state the experimental setting, and then show the experimental results for online kernel selection with memory constraints and time constraints, respectively.
5.1 Experimental setting
We compare our algorithms with the following baseline algorithms,
-
NORMA (Budgeted online kernel learning algorithm) (Kivinen et al. 2004)
-
BOGD (Budget online kernel learning algorithm) (Zhao et al. 2012)
-
OKS (Online Kernel Selection) (Yang et al. 2012)
-
OMKC (Online multi-kernel classification) (Hoi et al. 2013)
-
ISKA (Incremental sketched kernel alignment) (Zhang and Liao 2018)
-
BOMKR (Budget online multi-kernel regression) (Sahoo et al. 2014)
-
BOMKR-V (Variant of BOMKR).
The baseline algorithms for online classification include BOGD, OKS, OMKC and ISKA. The other algorithms including OKS are used for online regression.
We set 9 Gaussian kernels, \(\kappa (\mathbf{u},\mathbf{v})=\exp (-{\Vert \mathbf{u}-\mathbf{v}\Vert ^2}/{(2\sigma ^2)})\), of kernel width \(\sigma\) chosen from \(2^{-4:1:4}\). We adopt the best kernel function in hindsight for NORMA and BOGD. BOMKR-V is a variant of BOMKR by changing the loss function. We test the algorithms on online regression and online classification tasks. The datasets are shown in Table 2, which are downloaded from WEKA and UCI machine learning repository.Footnote 2ailerons-v, Hardware-v, Twitter-v and Adv-SUSY-v are constructed from ailerons, Hardware, Twitter and Adv-SUSY, respectively. For instance, we extract the first 6 features of ailerons and form ailerons-v. Our goal is to make \(d < K\) (\(K=9\)). We preprocess Hardware and Twitter by dividing the standard deviation. Note that we convert magic04, a9a and SUSY to adversarial datasets, denoted by Adv-magic04, Adv-a9a and Adv-SUSY. Our approach of constructing adversarial datasets is as follows: At each round \(t = 1,\ldots ,T\),
-
If \(t\le \lceil T/20\rceil\), let Adv-magic04 equal to magic04.
-
If \(t\ge \lceil T/20\rceil +1\), we multiply the features of magic04 by \(2^{-3}\).
The same operation is used to Adv-a9a and Adv-SUSY. There are two reasons that we construct adversarial datasets, i.e., (i) for online learning, the data may not be i.i.d., and may be provided by a malicious adversary; (ii) our theoretical results hold in the worst-case. The three adversarial datasets essentially yield hard learning tasks.
For online regression, we adopt the absolute loss \(\ell ({\hat{y}}_t,y)=\vert {\hat{y}}_t-y\vert\) except for NORMA and BOKMR. NORMA adopts the \(\varepsilon\)-insensitive absolute loss \(\ell ({\hat{y}}_t,y)=\max (0,\vert {\hat{y}}_t-y\vert -\varepsilon _t)+\nu \varepsilon _t\), and updates \(\varepsilon _t\) on the fly. For BOKMR, we adopt the version that uses NORMA as a sub-algorithm (Sahoo et al. 2014). We set \(\nu =0.5\) and \(\varepsilon _1=0.001\). For online classification, we adopt the hinge loss \(\ell ({\hat{y}}_t,y)=\max \{0,1-{\hat{y}}_ty\}\). We measure the Average Absolute Loss (AAL) defined by \(\mathrm {AAL}=\frac{1}{T}\sum ^T_{t=1}\vert {\hat{y}}_t-y_t\vert\) for online regression, and measure the Average Mistake Rate (AMR) defined by \(\mathrm {AMR}=\frac{1}{T}\sum ^T_{t=1}\mathbb {I}_{{\hat{y}}_t\ne y_t}\) for online classification. For OKS, we choose the smoothing parameter \(\delta \in \{0.2,0.02,0.002\}\). For all of the baseline algorithms, we set the stepsize of gradient descent to \({5}/{\sqrt{T}}\). The other hyper-parameters are set to the recommended value in original papers. For PFMBooks, we set \(g(U_j,D_i)= U_j+0.1\) where \(D_i=1\) for Gaussian kernel and set \(\eta =\sqrt{8\ln (KMT)/T}\). For LKMBooks, we set \(\eta =\sqrt{8\ln (K)/T}\). All algorithms are implemented in R on a Windows machine with 2.5 GHz Core(TM) i5-7200U CPU. To weaken the randomization, we execute each experiment 20 times with random permutation of all datasets and average all the results.
5.2 Memory constraints
5.2.1 Online regression
Let \({\mathcal {T}}\) be a given memory budget. According to Assumptions 2 and 3, we can reduce \({\mathcal {T}}\) to an example budget of size B. We must ensure that all algorithms have the same space complexity. Table 3 shows the results. Since OKS does not control the number of support vectors, we use a heuristic variant, called BOKS, which stops updating hypothesis if the number of support vectors equals B. We use NORMA as the baseline, that is, for a memory budget \({\mathcal {T}}\), NORMA can use an example budget of size \(B_0\). The third row of Table 3 is the available budget of each algorithm, which depends on the relation between d and K. BOKS and BOMKR do not share the memory and maintain K different sets of support vectors. For LKMBooks and PFMBooks, we set \(\upsilon =\frac{1}{3}\) for satisfying \(2(1-\upsilon )T^{1-\upsilon } > B\) (see Theorems 2 and 4), and set the stepsize to the values in Theorems 3 and 6. For PFMBooks, we set \(U=\sqrt{B}\), \(U_{\min }=U/\sqrt{T}\) as stated in Theorem 6. Since LKMBooks and PFMBooks can only achieve the memory constraints in high-probability, we stop updating hypotheses when the actual budget exceeds the available budget in Table 3.
Table 4 shows the empirical results. The bold in each column indicates the algorithm enjoying the best performance. It can be found that NORMA performs well on some datasets. There are two reasons: (i) we select the best kernel width in hindsight for NORMA, that is, we test all of the candidate kernel widths and select the one with minimal ALL; (ii) NORMA uses a good learning rate on those datasets. Tuning the learning rate is another problem of online learning algorithms. To avoid this issue, we set a fixed learning rate for baseline algorithms and use the theoretical values for our algorithms. In the first column of Table 4, we give the optimal kernel width of NORMA on each dataset. For instance, NORMA-2 means that the optimal kernel width is \(\sigma =2\) on housing dataset. For different datasets, the optimal kernel width is also different. Thus if we empirically set a fixed kernel for all datasets, then NORMA will perform badly on some datasets. On the contrary, the online kernel selection algorithms and online multi-kernel learning algorithms can perform well on all datasets (except for BOKS). The results verify the first goal, G 1.
Next we analyze BOMKR. Since BOMKR does not share the support vectors, \(\forall i\in [K]\), the available budget for constructing \(\{f_{t,i}\}^T_{t=1}\) is \(\frac{B_0}{K}\ll B_0\). Thus BOMKR performs bad. LKMBooks, PFMBooks and BOMKR-V can share the support vectors, whose available budget is \(B_0\), \(\frac{dB_0}{(d+K')}\) and \(\frac{dB_0}{(d+K)}\), respectively. Thus they perform well on all of the datasets. Besides, we also find that BOMKR-V performs worse than NORMA on some datasets. The main reason is that the learning rate of BOMKR-V is not well tuned. Since PFMBooks is applicable for the case of \(K<d/\lceil \ln {T}\rceil\), we do not run it on the two low dimensional datasets, housing and elevators. PFMBooks performs much better than all of the other algorithms on Slice dataset. The reason is that PFMBooks is parameter-free and uses a suitable learning rate. For all of the other algorithms including LKMBooks, we actually do not set a suitable learning rate for individual dataset. The results verify the second goal, G 2.
5.2.2 Online classification
The overall parameter setting is same with that of online regression, except that LKMBooks uses the same learning rate with the baseline algorithms, i.e., \(\lambda ={5}/{\sqrt{T}}\). Let \(U_{\min }=5\) for PFMBooks. For the hinge loss, if f satisfies \(\Vert f\Vert _{{\mathcal {H}}}<1\), then \(L_T(f)=\sum ^T_{t=1}(1-y_tf({\mathbf {x}}_t))=\varTheta (T)\). Thus we set \(U_{\min }>1\). OMKC is an algorithm framework, based on which four algorithms are derived (Hoi et al. 2013). In the case of memory constraints, algorithms can suffer more time cost. Thus we adopt \(\mathrm {OMKC}_{D,D}\) which has the best prediction performance, but also suffers the highest time cost among the four algorithms. We set the hyper-parameters of \(\mathrm {OMKC}_{D,D}\) to the recommended values in original paper.
We still reduce \({\mathcal {T}}\) to an example budget of size B and ensure all algorithms have the same space complexity. If the number of support vectors of \(\mathrm {OMKC}_{D,D}\) equals B, then we stop updating hypotheses. We use BOGD as the baseline, whose space complexity is O(Bd). Given \({\mathcal {T}}\) memory budget, BOGD can use an example budget of size \(B_0\). The space complexity of \(\mathrm {OMKC}_{D,D}\) is \(O(B(d+K))\). Thus \(B=\frac{dB_0}{d+K}\). The space complexity of ISKA is \(O(Bd+K)\). Thus \(B=B_0\). Table 3 gives the size of example budget of other algorithms.
Table 5 shows the empirical results. It can be find that BOGD performs well on all datasets, since we select the optimal kernel width in hindsight. The first column shows the optimal kernel width on different datasets can be different, which is same with the result of Table 4. Thus we conclude that, if BOGD is equipped with a fixed kernel function for all datasets, then it will perform worse than the other algorithms. The results verify G 1.
Next we analyze \(\mathrm {OMKC}_{D,D}\), which performs bad on the last three datasets. We call the last three datasets hard dataset and call mushrooms easy dataset, since the mistake rates are very small on mushrooms. Recalling that \(\mathrm {OMKC}_{D,D}\) can use a budget of size \(\frac{dB_0}{d+K}\). \(\mathrm {OMKC}_{D,D}\) does not share the memory, and thus it allocates the budget over K hypothesis sequences, i.e., \(\{f_{t,i}\}^T_{t=1},i\in [K]\). In this way, each hypothesis sequence approximately obtains a budget of size \(\frac{1}{K}\cdot \frac{dB_0}{d+K}\). Thus it would perform bad on hard dataset. For mushrooms, since the number of mistakes is very small, thus a small budget is enough. For instance, for the case of \(B_0=200\), the number of mistakes of \(\mathrm {OMKC}_{D,D}\) is roughly \(0.62*T\approx 50\), where \(T=8124\). Thus the optimal hypothesis sequence \(\{f_{t,i^*}\}^T_{t=1}\) only needs a budget of size about 50. LKMBooks shares the memory and performs well on hard dataset. The experimental results do not match our theoretical results well, since we focus on the mistake rates not the average cumulative losses. Our theoretical results are the regret bounds, not the mistake bounds. Even so, the experimental results on the hard datasets still verify G 2.
ISKA also shares the memory and performs better than our algorithms on mushrooms and magic04, since it employs an elaborate removing strategy, while our algorithms just use simple randomized adding strategies. However, the regret bounds of ISKA does not reveal the superiority. We conjecture that data-dependent regret bounds can explain the superiority. Besides, ISKA performs worse than our algorithms on the two adversarial datasets. The kernel selection procedure of ISKA consists of two phases. During the first phase, ISKA converges to an empirically optimal kernel. During the second phase, ISKA always chooses the empirically optimal kernel. The adversary can easily change the optimal kernel by scaling the feature of instances and make ISKA converge to a bad kernel. Our algorithms randomly choose kernels and can converge to the optimal kernel defined on the whole datasets. Thus our algorithms are more robust than ISKA in adversarial environments.
5.3 Time constraints
5.3.1 Online regression
Let \({\mathcal {T}}\) be a given time budget. We also achieve the time constraints by fixing the budget size. To be specific, we choose BOMKR as baseline, where the budget is set to \(B_0\). Denote the average per-round running time of BOMKR by \(t_{\mathrm {p}}\). We tune the budget of other algorithms for ensuring the same running time with \(t_{\mathrm {p}}\). For BATBooks, we set the learning rate \(\eta =4\sqrt{\ln {K}/(K{\tilde{C}}_{T,*})}\), where \({\tilde{C}}_{T,*}\) is tuned by the doubling trick, \(U=B^{\frac{1}{3}}_0\) and \(\ell _{\max }=1\). For the parameter \(\upsilon\), we choose the maximal value from \(\{1/i\}_{i=3,4,\ldots ,12}\) for satisfying the condition (12). For the other algorithms, the parameter setting keeps unchanged.
Table 6 shows the empirical results. First, we consider the results on four high dimensional datasets, elevator, ailerons, Hardware and Twitter. In this case, we have \(K<d\). Within a same time budget, LKMBooks shows the best performance except for NORMA. Although LKMBooks is designed for memory constraints, it is still nearly optimal for time constraints. In the second and fifth columns, the available budgets of all algorithms are different, since the per-round time complexities are different. It seems strange that BOKS has the maximal available budget. The reason is that BOKS allocates the available budget \(B_0\) to K hypotheses \(\{f_{t,i}\}^K_{i=1}\). Thus the available budget of each \(f_{t,i}\) is less than \(B_0\). The results verify the third goal, G 3.
Next we consider the four low dimensional datasets, housing, ailerons-v, Hardware-v and Twitter-v. In this case, we have \(K>d\). Within a same time budget, BATBooks shows the best performance on all datasets except for NORMA. NORMA performs well, since it has the lowest time complexity and we set the optimal kernel width in hindsight. It is interesting to find that, the available budget of BATBooks is similar with that of NORMA. The reason is that the two algorithms have same per-round time complexity, which is \(O(dB+K)\) and O(dB), respectively. BATBooks performs better than LKMBooks for the case of \(d<K\), which verifies the fourth goal, G 4.
5.3.2 Online classification
For LKMBooks, the parameters follow the setting in Sect. 5.2.2. For BATBooks, the parameters follow the setting in Sect. 5.3.1, except that the stepsize is set to \(\lambda =\frac{U\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )}LT}\) which is slightly different from that of Theorem 10. We choose \(\mathrm {OMKC}_{D,D}\) as baseline, where the budget is set to \(B_0\). Let \(t_{\mathrm {p}}\) be the average per-round running time of \(\mathrm {OMKC}_{D,D}\). We tune the budget of other algorithms for ensuring the same running time with \(t_{\mathrm {p}}\).
Table 7 shows the empirical results. We first consider the results on two high-dimensional datasets, mushrooms and Adv-a9a in which \(K\ll d\). Within a same time budget, LKMBooks performs better than BATBooks. For Adv-SUSY, we have \(K\approx d\) (\(K=9, d=18\)). LKMBooks shows similar performance with BATBooks. The same result holds for Adv-magic04, in which \(K=9\) and \(d=10\). Besides, \(\mathrm {OMKC}_{D,D}\) performs much better than other algorithms on mushrooms. The reason is same with the analysis on mushrooms in Sect. 5.2.2. As a whole, for the case of \(K\ge d\), LKMBooks performs well on most of dataset. The results verify G 3.
Next we consider the two low-dimensional datasets, cod-rna and Adv-SUSY-v in which \(d<K\). We find that LKMBooks performs slightly better than BATBooks on cod-rna, and performs worse than BATBooks on Adv-SUSY-v. The results does not fully verify G 4. There may be two reasons: (i) for cod-rna, we have \(d\approx K\) (\(d=8, K=9\)); (ii) the performance measure is the mistakes rate, not the average cumulative losses. Even so, our algorithms still perform better than \(\mathrm {OMKC}_{D,D}\) and ISKA.
6 Conclusion and discussion
In this paper, we studied the computationally budgeted online kernel selection, where the kernel selection and online prediction procedures face memory constraints or time constraints. We separately proved a lower bound on the regret under the two kinds of computational constraints, and developed several simple algorithms that nearly achieve the lower bounds. We also identified the condition under which online kernel selection with a time constraint is different from that with a memory constraint.
This work will open up many directions for future research. One of the most important research is to identify the sufficient conditions under which a constant computational constraint can achieve a sub-linear regret bound. Model selection aims at choosing the inductive bias that matches the data and improving the learning performance of algorithms. Thus the worst-case regret guarantees do not reveal the essence of model selection. The sufficient conditions play the role of inductive bias. To this end, it is necessary to establish some kind of data-dependent regret bounds. Although many work has focus on achieving data-dependent regret bounds for general online learning problem, such as prediction with expert advice, multi-armed bandit problems, online convex optimization and so on, few of them considers the computational constraints.
We need further study the worst-case regret analysis. For the case of memory constraints and \(K>d/\ln \sqrt{T}\), our algorithm can not adapt to the norm of competitor. Thus the regret bound is far from optimality in terms of \(\Vert f^*_i\Vert _{{\mathcal {H}}_i}\). For the case of time constraints and \(K>d\), if \({\mathcal {T}} =\omega (T/K)\), then there is a gap of order \(\sqrt{K}\) between the lower bound and upper bound. It is necessary to study whether this gap can be removed. Besides, the algorithm can also not adapt to the norm of competitor.
Availability of data and material
All data and materials as well as custom code support our claims and comply with field standards.
Notes
The worst-case regret is the regret that holds on any examples, also defined by \(\max _{({\mathbf {x}}_t,y_t)_{t=1,\ldots ,T}}\mathrm {Reg}({\mathcal {H}}_i)\). We aims at proving the lower bound on \(\max _{({\mathbf {x}}_t,y_t)_{t=1,\ldots ,T}}\mathrm {Reg}({\mathcal {H}}_i)\) defined on any algorithm, that is, \(\min _{\{f_t\}_{t=1,\ldots ,T}}\max _{({\mathbf {x}}_t,y_t)_{t=1,\ldots ,T}}\mathrm {Reg}({\mathcal {H}}_i)\), and designing algorithms enjoying corresponding upper bound.
References
Agarwal, A., Duchi, J.C., Bartlett, P.L., & Levrard, C. (2011). Oracle inequalities for computationally budgeted model selection. In Proceedings of the 24th Annual Conference on Learning Theory (pp. 69–86).
Agarwal, A., Luo, H., Neyshabur, B., & Schapire, R.E. (2017). Corralling a band of bandit algorithms. In Proceedings of the 30th Annual Conference on Learning Theory (pp. 12–38).
Bubeck, S., & Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends®in Machine Learning, 5(1), 1–122.
Bubeck, S., Devanur, N. R., Huang, Z., & Niazadeh, R. (2019). Multi-scale online learning: Theory and applications to online auctions and pricing. Journal of Machine Learning Research, 20(62), 1–37.
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge University Press.
Cesa-Bianchi, N., Mansour, Y., & Shamir, O. (2015). On the complexity of learning with kernels. In Proceedings of the 28th Annual Conference on Learning Theory (pp. 297–325).
Crammer, K., Kandola, J. S., & Singer, Y. (2003). Online classification on a budget. Advances in Neural Information Processing Systems, 16, 225–232.
Cutkosky, A., & Boahen, K. (2016). Online convex optimization with unconstrained domains and losses. Advances in Neural Information Processing Systems, 29, 748–756.
Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2008). The forgetron: A kernel-based perceptron on a budget. SIAM Journal on Computing, 37(5), 1342–1372.
Foster, D. J., Kale, S., Mohri, M., & Sridharan, K. (2017). Parameter-free online learning via model selection. Advances in Neural Information Processing Systems, 30, 6022–6032.
Foster, D. J., Krishnamurthy, A., & Luo, H. (2019). Model selection for contextual bandits. Advances in Neural Information Processing Systems, 32, 14741–14752.
Hoi, S. C. H., Jin, R., Zhao, P., & Yang, T. (2013). Online multiple kernel classification. Machine Learning, 90(2), 289–316.
Jézéquel, R., Gaillard, P., & Rudi, A. (2019). Efficient online learning with kernels for adversarial large scale problems. Advances in Neural Information Processing Systems, 32, 9427–9436.
Jin, R., Hoi, S.C.H., & Yang, T. (2010). Online multiple kernel learning: Algorithms and mistake bounds. In Proceedings of the 21st International Conference on Algorithmic Learning Theory (pp. 390–404)
Kivinen, J., Smola, A. J., & Williamson, R. C. (2004). Online learning with kernels. IEEE Transactions on Signal Processing, 52(8), 2165–2176.
Koppel, A., Warnell, G., Stump, E., & Ribeiro, A. (2019). Parsimonious online learning with kernels via sparse projections in function space. Journal of Machine Learning Research, 20(3), 1–44.
Kothari, P.K., & Livni, R. (2020). On the expressive power of kernel methods and the efficiency of kernel learning by association schemes. In Proceedings of the31st International Conferences on Algorithmic Learning Theory (pp 422–450).
Lu, J., Hoi, S. C. H., Wang, J., Zhao, P., & Liu, Z. (2016). Large scale online kernel learning. Journal of Machine Learning Research, 17(47), 1–43.
McMahan, B., & Abernethy, J. (2013). Minimax optimal algorithms for unconstrained linear optimization. Advances in Neural Information Processing Systems, 26, 2724–2732.
McMahan, H.B., & Orabona, F. (2014). Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference on Learning Theory (pp. 1020–1039).
Muthukumar, V., Ray, M., Sahai, A., & Bartlett, P. (2019). Best of many worlds: Robust model selection for online supervised learning. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (pp. 3177–3186).
Nguyen, T.D., Le, T., Bui, H., & Phung, D. (2017). Large-scale online kernel learning with random feature reparameterization. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (pp. 2543–2549).
Orabona, F. (2013). Dimension-free exponentiated gradient. Advances in Neural Information Processing Systems, 26, 1806–1814.
Orabona, F., Keshet, J., & Caputo, B. (2009). Bounded kernel-based online learning. Journal of Machine Learning Research, 10, 2643–2666.
Sahoo, D., Hoi, S.C.H., & Li, B. (2014). Online multiple kernel regression. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD (pp. 293–302).
Seldin, Y., Bartlett, P.L., Crammer, K., & Abbasi-Yadkori, Y. (2014). Prediction with limited advice and multiarmed bandits with paid observations. In Proceedings of the 31st International Conference on Machine Learning (pp. 280–287).
Wang, Z., Crammer, K., & Vucetic, S. (2012). Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale SVM training. Journal of Machine Learning Research, 13(1), 3103–3131.
Yang, T., Mahdavi, M., Jin, R., Yi, J., & Hoi, S.C.H. (2012). Online kernel selection: Algorithms and evaluations. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (pp 1197–1202).
Zhang, L., Yi, J., Jin, R., Lin, M., & He, X. (2013). Online kernel learning with a near optimal sparsity bound. In Proceedings of the 30th International Conference on Machine Learning (pp. 621–629).
Zhang, X., & Liao, S. (2018). Online kernel selection via incremental sketched kernel alignment. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (pp. 3118–3124).
Zhang, X., & Liao, S. (2020). Hypothesis sketching for online kernel selection in continuous kernel space. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (pp. 2498–2504).
Zhao, P., Wang, J., Wu, P., Jin, R., & Hoi, SCH. (2012). Fast bounded online gradient descent algorithms for scalable kernel-based online learning. In Proceedings of the 29th International Conference on Machine Learning (pp. 1075–1082).
Funding
This work was supported in part by the National Natural Science Foundation of China under Grants No. 62076181.
Author information
Authors and Affiliations
Contributions
The two authors have the same contributions to the study conception and design. The first draft of the manuscript was written by [Junfan Li] and the second author commented on previous versions of the manuscript. The two authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Consent for publication
Not applicable.
Ethical approval
Not applicable.
Code availability
custom code.
Consent to participate
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Yu-Feng Li, Mehmet Gönen, Kee-Eung Kim.
Appendices
Appendix
Proof of Theorem 1
Proof
We use the hinge loss \(\ell (u,y)=\max \{0,1-yu\}\) as an example. Our analysis is also applicable to the absolute loss. We select K Gaussian kernel functions \(\kappa _i({\mathbf {x}},{\mathbf {z}})=\exp (-\frac{\Vert {\mathbf {x}}-{\mathbf {z}}\Vert ^2_2}{\sigma _i})\), \(i=1,\ldots ,K\) as the candidates. Without loss of generality, we assume that \(0<\sigma _1<\sigma _2<\ldots <\sigma _K\). Our proof is based on a sequence of instances \({\mathcal {S}}=\{{\mathbf {x}}_t\}^T_{t=1}\), such that
where D is a constant. For \(r=1,\ldots ,K\) and \(i\ne j\), we have
where \(c_1< \ldots<c_K< 1\). For an Euclid space \(\mathbb {R}^d\), we can always find \(d+1\) points satisfying the property. Note that we do not require the instances are orthogonal or approximately orthogonal in RKHSs, which is different from the techniques adopted by (Dekel et al. 2008; Zhang et al. 2013; Cesa-Bianchi et al. 2015). We assume that \(T\le d+1\) and T is even. Next we will design a strategy for the adversary, based on which the adversary sends examples to the learner.
Before the game, the adversary assigns a label \(y_t\) for each instance \({\mathbf {x}}_t\), satisfying \(y_t=1\) if t is odd, otherwise, \(y_t=-1\). Define a sequence of example pairs \(s_i=\{({\mathbf {x}}_i,y_i),({\mathbf {x}}_{i+1},y_{i+1})\}\), where \(i=1,3,5,\ldots\). The adversary assigns the examples \(\{({\mathbf {z}}_t,y'_t)\}^T_{t=1}\) as follows,
-
Case 1 \(T\le 2\mathrm {e}^{\frac{1}{4}}B\)
If t is odd, the adversary selects \(({\mathbf {z}}_t,y'_t)\in s_t\) uniformly. Otherwise, the adversary assigns \(({\mathbf {z}}_t,y'_t)\in s_{t-1}{\setminus} \{({\mathbf {z}}_{t-1},y'_{t-1})\}\).
-
Case 2 \(T\ge 2\mathrm {e}^{\frac{1}{4}}B+1\)
If \(t\le 2B\) and t is odd, the adversary selects \(({\mathbf {z}}_t,y'_t)\in s_t\) uniformly. If \(t\le 2B\) and t is even, the adversary assigns \(({\mathbf {z}}_t,y'_t)\in s_{t-1}\setminus \{({\mathbf {z}}_{t-1},y'_{t-1})\}\). If \(t\ge 2B+1\), the adversary divides the time horizon \(\{2B+1,\ldots ,T\}\) into continuous epochs with length m, except for the last epoch. We require that m is even. Assuming there are \(\varDelta +1\) epoches. Let \(m= \lceil \frac{T-2B_i}{(2\mathrm {e}^{\frac{1}{4}}-2)B_i+1}\rceil\). If m is odd, then let \(m=m+1\). Thus \(\varDelta = \lfloor \frac{T-2B}{m}\rfloor\). If the length of the last epoch is odd, then we add one more new example. For the r-th epoch, denote the start point as \(s_r = (r-1)m+2B+1\) and the end point as \(e_r = rm+2B\). If \(t= s_r\), then the adversary selects \(({\mathbf {z}}_{t},y'_{t}) \in s_{t}\) uniformly, and assigns \(({\mathbf {z}}_{t+1},y'_{t+1}) \in s_{t}\setminus \{({\mathbf {z}}_{t},y'_{t})\}\). If \(t=s_r+2n,n=1,2,\ldots ,\frac{m}{2}-1\), the adversary first constructs an example pair \({\bar{s}}_{t} = \{(\bar{{\mathbf {x}}}_{t},{\bar{y}}_t),(\bar{{\mathbf {x}}}_{t+1},{\bar{y}}_{t+1})\}\). The adversary samples \((\bar{{\mathbf {x}}}_{t},{\bar{y}}_t)\) from \(S_r\) uniformly, where \(S_r\) is the set of examples selected at the end of the \(s_r\)-th round, and then samples \((\bar{{\mathbf {x}}}_{t+1},{\bar{y}}_{t+1})\) from \(S_{r}\setminus \{({\mathbf {x}},y)\in S_r:y={\bar{y}}_t\}\) uniformly. After that, the adversary selects \(({\mathbf {z}}_{t},y'_{t}) \in {\bar{s}}_{t}\) uniformly, and assigns \(({\mathbf {z}}_{t+1},y'_{t+1}) \in {\bar{s}}_{t}\setminus \{({\mathbf {z}}_{t},y'_{t})\}\).
Let \(S_{t}\) be the budget maintained by the learner at the beginning of round t, satisfying \(\vert S_{t}\vert \le B\). The hypothesis \(f_t\) used by the learner has the form \(f_t = \sum _{{\mathbf {x}}_\tau \in S_{t,I_t}}a_\tau \kappa _{I_t}({\mathbf {x}}_\tau ,\cdot )\), where \(I_t\in [K]\) is the index of kernel function selected by the learner, and \(S_{t,I_t}\) is the budget allocated for \(\kappa _{I_t}\), satisfying \(\bigcup ^K_{i=1}S_{t,i}=S_t\). Note that it is possible that \(S_{t,1}=\ldots =S_{t,K}\).
Case 1 \(T\le 2\mathrm {e}^{\frac{1}{4}}B\). If t is odd, then it is easy to verify that \(f_t({\mathbf {z}}_t) = f_t({\mathbf {x}}_t) = f_t({\mathbf {x}}_{t+1})\). Thus the expected loss of the learner is
If t is even, we have \(\ell _t(f_t({\mathbf {z}}_t),y_t)\ge 0\). Thus the cumulative loss of the learner is larger than \(\frac{T}{2}\). For each \({\mathcal {H}}_i\), let the optimal hypothesis be \(f^*_i = \sum ^T_{\tau =1}a_{\tau }\kappa ({\mathbf {x}}_{\tau },\cdot )\). Next we need to solve the coefficients \(a_1,\ldots ,a_T\).
First we require \(f^*_i\) satisfying condition (13),
From the above condition, we can obtain the relation \(f^*_i({\mathbf {x}}_{t}) = f^*_i({\mathbf {x}}_{t+2})\) for any t, i.e.,
Since \(c_i\ne 1\), we have \(a_t = a_{t+2}\). Thus \(f^*_i\) has the form
Furthermore, taking into \(f^*_i({\mathbf {x}}_1)=1\) and \(f^*_i({\mathbf {x}}_2)=-1\) yields condition (14) and (15),
Since we assume that T is even, solving the above two equations produces
Thus the optimal hypothesis \(f^*_i\) is
which satisfies \(\sum ^T_{t=1}\ell (f^*_i({\mathbf {x}}_t),y_t) = 0\), and
Then the regret of any budgeted online kernel selection algorithm can be bounded as follows
where \(L=\max _t\Vert -y_t\kappa _i({\mathbf {x}}_t,\cdot )\Vert _{{\mathcal {H}}_i} = 1\).
Case 2 \(T\ge 2\mathrm {e}^{\frac{1}{4}}B+1\). For the first 2B rounds, the expected cumulative losses of any algorithm is larger than B. For \(t\ge 2B+1,\ldots ,T\), we first analyze the expected loss in a fixed epoch. At the r-th epoch, \(r=1,\ldots ,\varDelta\), if \(t=s_r\), then the expected instantaneous loss is larger than 1. If \(t=s_r+2n,n=1,2,\ldots ,\frac{m}{2}-1\), the probability that \({\mathbf {z}}_t\) and \({\mathbf {z}}_{t+1}\) are not in \(S_{t,I_t}\) is
where \(B_{I_t,y'_t}\) is the number of examples in \(S_{t,I_t}\), whose label are \(y'_t\). In this case, we still have \(f_t({\mathbf {z}}_t) = f_t(\bar{{\mathbf {x}}}_{t}) = f_t(\bar{{\mathbf {x}}}_{t+1})\). Thus at round \(t=s_r+2n\), the expected instantaneous loss is larger than \(1-\frac{2B}{\vert S_r\vert }\). The expected loss in the r-th epoch satisfies
Summing over \(t=1,2,\ldots ,T\) gives
where we use the fact \(\varDelta = \left\lfloor \frac{T-2B}{m}\right\rfloor \le (2\mathrm {e}^{\frac{1}{4}}-2)B+1.\) The optimal hypothesis \(f^*_i\) is
According to the analysis in Case 1, we have \(\sum ^T_{t=1}\ell _t(f^*_i({\mathbf {x}}_t),y_t)=0\), and
where we omit the constant 4 in the square root. A lower bound on the expected cumulative expected loss of any budgeted online kernel selection algorithm is as follows,
We can verify that
Thus the expected regret can be lower bounded as follows
Combining with the two cases gives the desired lower bound. \(\square\)
Proof of Theorem 2
Before giving the detailed proof, we state an important lemma.
Lemma 1
(Bernstein’s inequality for martingales)
Let \(X_1,\ldots ,X_n\) be a bounded martingale difference with respect to the filtration \({\mathcal {F}}=({\mathcal {F}}_i)_{1\le i\le n}\) and with \(\vert X_i\vert \le a\). Let \(S_i=\sum ^i_{j=1}X_j\) be the associated martingale. Denote the sum of the conditional variances by
Then for all constants \(a,v>0\), with probability at least \(1-\delta\),
Lemma 1 is derived from Lemma 1.8 in (Cesa-Bianchi and Lugosi 2006).
Proof
First, assuming that \(B<T\). In this case, there exists a \(\upsilon \in [0,1)\) such that \(B<(1-\upsilon )T^{(1-\upsilon )}\). Thus \(\mathbb {P}[\rho _t=1]=\frac{B}{(1-\upsilon )T^{(1-\upsilon )}(\vert E_t\vert +1)^{\upsilon }}\). After the \((T-1)\)-th round, the number of support vectors in S satisfies \(\vert S\vert = \sum ^{T-1}_{t=1}\mathbb {I}_{\rho _{t}=1}\). Define a random variable \(X_{t}\) as follows
Under the condition of \(\rho _1,\ldots ,\rho _{t-1}\), it can be verified that \(\mathbb {E}[X_t]=0\) and \(\vert X_t\vert \le 1\). Thus \(X_{1},\ldots ,X_{T-1}\) forms bounded martingale sequence. The sum of conditional variances satisfies
Using Lemma 1, with probability at least \(1-\delta\),
Then we consider \(B= T\). In this case, there is no \(\upsilon\) satisfying \(B<(1-\upsilon )T^{1-\upsilon }\). Thus \(\mathbb {P}[\rho _{t}=1]=1, t\in E_t\). We have \(\vert S\vert \le T= B\). Combining with the two cases concludes the proof. \(\square\)
Proof of Theorem 3
Proof
Let \({\mathbf {r}}\in \varDelta _{K-1}\). We consider the regret w.r.t. any \(f\in {\mathcal {H}}_{\kappa _{{\mathbf {r}}}}\). We split the regret into two components,
where the last inequality is derived from (6). According to Theorem 2.2 in Cesa-Bianchi and Lugosi (2006), let \(\eta =\sqrt{8\ln (K)/T}\), the first term can be rewritten as follows,
Next we analyze \(\varXi _2\). Recalling that any \(f\in {\mathcal {H}}_{\kappa _{{\mathbf {r}}}}\) can be represented as follows
where \(f_i=\sum ^T_{t=1}\alpha _t\phi ^\top _{\kappa _i}({\mathbf {x}}_t)\). Thus \(\varXi _2\) can be rewritten as follows,
If \(B<T\), then using the standard analysis technique of online gradient descent and a constant learning rate, i.e. \(\lambda _t=\lambda\) yields
where \(\lambda ={\sqrt{(1+\upsilon )B}}/{(\sqrt{(1-\upsilon )D}LT)}\).
If \(B=T\), which implies \(\mathbb {P}[\rho _{t}=1]=1\) for \(t\in E_t\), then
where \(\lambda =1/(\sqrt{DT}L)\). Let \(\mathbf{r}\) satisfy \(r_i=1\). Combining with \(\varXi _1\) and \(\varXi _2\) yields
Replacing f with \(f^*_i\) and \(B=\alpha {\mathcal {T}}\) concludes the proof. \(\square\)
Proof of Theorem 5
Proof
According to the analysis in (Bubeck et al. 2019), the probability updating of \({\mathcal {E}}(K')\) is equivalent to the following online mirror descent
where \(\psi _t(\mathbf{p})=\frac{1}{\eta _t}\sum ^{K'}_{k=1}g(U_{h^*(k)_2},D_{h^*(k)_1}) p_k\ln {p_k}\) is the weighted negative entropy regularizer, and \({\mathcal {D}}_{\psi _t}(\mathbf{u},\mathbf{v}) =\psi _t(\mathbf{u})-\psi _t(\mathbf{v})-\langle \nabla \psi _t(\mathbf{v}),\mathbf{u}-\mathbf{v}\rangle\) is Bregman divergence. Let \(\mathbf{u}\in \varDelta _{K'-1}\). The expected regret w.r.t. any competitor \(\mathbf{u}\in \varDelta _{K'-1}\) is as follows
where we use a constant learning rate i.e, \(\eta _t=\eta\). Next we separately analyze the two terms.
The first derivative of the regularizer w.r.t. \(p_k\) is
The Bregman divergence between any \(\mathbf{u},\mathbf{v}\in \varDelta _{K'-1}\) is
Thus the first term can be rewritten as follows
Next we analyze the second term. We use the updating rule of \({\mathcal {E}}(K')\) (see Algorithm 3).
where we use the fact \(\exp (-x)\le 1-x+\frac{x^2}{2}\) for \(x\ge 0\) and the definition of \({\bar{p}}_{t+1,i}\). Combining with the two terms, we obtain
Denote \(A_{\min }=\{k_{\min }\in [K'], k_{\min }=\mathrm {argmin}_{k\in [K']}g(U_{h^*(k)_2},D_{h^*(k)_1},Y)\}\). Let the initial distribution \(\mathbf{p}_{1}\) satisfy \(p_{1,k}=(1-\frac{1}{U\sqrt{T}})\frac{1}{\vert A\vert }+\frac{1}{K'U\sqrt{T}}\) for \(k\in A_{\min }\), and \(p_{1,k}=\frac{1}{K'U\sqrt{T}}\) for \(k\notin A_{\min }\). We compare with the i-th action. Let \(u_i=1\) and \(u_k=0\) for \(k\ne i\). Then we have
Let \(C_{T,i}:=\sum ^T_{t=1}c_{t,i}\). Subtracting \(C_{T,i}\) on both sides yields
where \(g_{\min }=\min _{k\in [K']}g(U_{h^*(k)_2},D_{h^*(k)_1},Y)\) and \(g_{\max }=\max _{k\in [K']}g(U_{h^*(k)_2},D_{h^*(k)_1},Y)\). Using Assumption 5, we have \(g_{\max }=\max _{k\in [K']}g(U_{h^*(k)_2},D_{h^*(k)_1},Y)=\varTheta (\max _jU_j+1)\). Besides, \(\max _jU_j = U=\varTheta (\sqrt{B})\) and \(B\le T\) (see Assumption 4) and \(g_{\min }=U_1=\varTheta (U/\sqrt{T})\). Omitting the lower order terms, we complete the proof. \(\square\)
Proof of Theorem 6
Proof
For any \(f\in \mathbb {H}_i\), let \(\mathbb {H}_{i,j}\) be the smallest hypothesis space that contains f. If \(j=1\), then \(\Vert f\Vert _{{\mathcal {H}}_i}\le U_1\). Otherwise, we have \(\mathrm {e}^{-1}U_j< \Vert f\Vert _{{\mathcal {H}}_i}\le U_j\). We analyze the regret w.r.t. f.
where \(\varXi _1\) comes from Theorem 5. Next we analyze \(\varXi _2\).
Using the convexity of loss function, we have
Let \(\lambda _{t,i,i}=\lambda _{i,j}\). Using the property of projection, we have
Rearranging terms and summing over \(t=1,\ldots ,T\) yields
Let \(\mathbb {E}_t\) be the condition expectation w.r.t. \(\rho _t\). Taking expectation w.r.t. \(\{\rho _t\}^T_{t=1}\) yields
where we set \(\lambda _{i,j}=\frac{U_j\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )D_i}LT}\). Next we further consider two cases: (i) \(j>1\), (ii) \(j=1\).
-
Case (i) \(j>1\)
Using the fact \(\mathrm {e}^{-1}U_j\le \Vert f\Vert _{{\mathcal {H}}_i} \le U_j\), we have \(\mathbb {E}[\varXi _{2}]\le \mathrm {e}\Vert f\Vert _{{\mathcal {H}}_i}LT\sqrt{\frac{2D_i}{B}}\).
-
Case (ii) \(j=1\)
Recalling that \(U_{\min }=U/\sqrt{T}\) and \(U=\varTheta (B)\). Then \(U_1 \le \mathrm {e}\sqrt{B/T}\) (see (7)), and we obtain \(\mathbb {E}[\varXi _{2}]\le \mathrm {e}L\sqrt{2D_iT}\).
Combining with the results of Case (i) and Case (ii), we obtain,
Next we show the final regret. Using Assumption 5, we can rewrite \(\varXi _1\) as follows
Combining with \(\varXi _{2}\) and \(\varXi _{1}\) yields
where \(K'=K(\lceil \ln {U}\rceil -\lceil \ln {(U/\sqrt{T})}\rceil +1)\).
According to Assumption 4, we have \(B\le T\). If \(B=T\), then the expected regret becomes
Combining with the two cases concludes the proof. \(\square\)
Proof of Theorem 7
Proof
The proof is same with that of Theorem 1. Thus we omit the details. For a static resource allocation \({\mathcal {R}}({\mathcal {T}}_1,\ldots ,{\mathcal {T}}_K)\), let \(j^*=\max _{j\in [K]}{\mathcal {T}}_j\). According to Assumption 3, we have \(B_{j^*} = \beta {\mathcal {T}}_{j^*}\). We also choose K Gaussian kernel functions \(\kappa _i({\mathbf {x}},{\mathbf {z}})=\exp (-\frac{\Vert {\mathbf {x}}-{\mathbf {z}}\Vert ^2_2}{\sigma _i})\), \(i=1,\ldots ,K\) as the candidates. The strategy that the adversary sends examples to the learner is same with that in the proof of Theorem 1, except that we replace B with \(B_{j^*}\). Therefore, for all \(\kappa _i\), the expected regret of any budgeted online kernel selection algorithm satisfies
which recoveries the desired result. \(\square\)
Proof of Theorem 8
Proof
First, assuming that \(B<\frac{2T}{K}\). In this case, there exists \(\upsilon\) such that \(B<\frac{2(1-\upsilon )T}{K}\). We just consider a fixed \(i\in [K]\). After the \((T-1)\)-th round, the number of support vectors in \(S_{i}\) is \(\vert S_{i}\vert =\sum ^{T-1}_{t=1}\mathbb {I}_{\rho _{t,i}=1}\cdot \mathbb {I}_{i=J_t}\). Define a random variable \(X_{t}\) as follows
Under the condition of \((\rho _{\tau },J_{\tau })_{\tau <t}\), we can obtain \(\mathbb {E}_t[X_t]=0\) and \(\vert X_t\vert \le 1\). Thus \(X_1,\ldots ,X_{T-1}\) forms bounded martingale difference. The sum of conditional variances satisfies
where \(E_{T,i} = \{t<T,\nabla _{t,i}\ne 0\}\). Using Lemma 1, with probability at least \(1-\delta\),
Then we consider \(\frac{2T}{K}\le B \le T\). In this case, there is no \(\upsilon\) satisfying \(B<\frac{2(1-\upsilon )T}{K}\). Thus \(\mathbb {P}[\rho _{t,i}=1]=1\) for \(t\in E_{T,i}\). The same proof technique yields, with probability at least \(1-\delta\),
Combining with the two cases and using the union of events bound to \(i=1,\ldots ,K\) concludes the proof. \(\square\)
Proof of Theorem 9
Proof
Some of analysis is same with that of Theorem 5. We start with (16). Replacing \(g(U_{h^*(k)_2},D_{h^*(k)_1},Y)\) with 1 yields
in which we use the fact \({\tilde{c}}_{t,i}=\frac{c_{t,i}}{\mathbb {P}[i\in \{I_t,J_t\}]}\mathbb {I}_{i\in \{I_t,J_t\}}\le Kc_{t,i}\). Combining with \({\mathcal {D}}_{\psi }(\mathbf{u},\mathbf{p}_1)\) yields
Let the initial distribution \(\mathbf{p}_{1}\) satisfy \(p_{1,i}=\frac{1}{K}\) for all \(i=1,\ldots ,K\). We compare with the i-th action. Let \(u_i=1\) and \(u_k=0\) for \(k\ne i\). Then we have
Now we replace i with \(i^*=\mathrm {argmin}_{i\in [K]}\sum ^T_{t=1}{\tilde{c}}_{t,i}\). For simplicity, let \(\sum ^T_{t=1}c_{t,i^*}={\tilde{C}}_{T,*}\). Subtracting \({\tilde{C}}_{T,*}\) on both sides yields
where \(\eta =\min \{\sqrt{2\ln {K}/(K{\tilde{C}}_{T,*})},\frac{1}{K}\}\). Thus, for any \(i\in [K]\), we have
Taking expectation yields the desired result. \(\square\)
Proof of Theorem 10
The proof is similar with that of Theorem 6. We also consider two cases: Case 1: \(B<\frac{2T}{K}\) and Case 2: \(\frac{2T}{K}\le B\le T\).
1.1 Case 1 \(B <\frac{2T}{K}\)
We analyze the regret w.r.t. f. Recalling the regret decomposition (17),
Next we separately analyze \(\varXi _1\) and \(\varXi _2\). Similarly with the proof of Theorem 6, we have
Let \(\mathbb {E}_{t}\) be the conditional expectation w.r.t. \(J_t\) and \(\rho _{t,J_t}\). Taking expectation w.r.t. \(\{J_{\tau },\rho _{\tau ,J_\tau }\}^{T}_{\tau =1}\), we can obtain
where we set \(\lambda _{i,j}=\frac{\sqrt{(1+\upsilon )B}}{\sqrt{2(1-\upsilon )D_i}LT}\). Next we give the final regret.
Using Theorem 9 and the fact \(\mathbb {E}\left[ \sum ^T_{t=1}\ell (f_{t,i}({\mathbf {x}}_t),y_t)\right] \le \sum ^T_{t=1}\ell (f({\mathbf {x}}_t),y_t)+\mathbb {E}[\varXi _{2}]\), we have
where \(L_T(f):=\sum ^T_{t=1}\ell (f({\mathbf {x}}_t),y_t)\). Using Assumption 5, we have,
Combining with \(\varXi _{1}\) and \(\varXi _{2}\) yields
Replacing f with \(f^*_i\) yields the desired result.
1.2 Case 2 \(\frac{2T}{K}\le B\le T\)
In this case, \(\mathbb {P}[\rho _{t,i}=1]=1\) for \(i=J_t\).
where we set \(\lambda _{i,j}={\frac{1}{\sqrt{KD_iT}L}}\). Combining with \(\varXi _{1}\) and \(\varXi _{2}\), we obtain the regret,
Combining with the results of Case 1 and Case 2, we conclude the proof.
Rights and permissions
About this article
Cite this article
Li, J., Liao, S. Worst-case regret analysis of computationally budgeted online kernel selection. Mach Learn 111, 937–976 (2022). https://doi.org/10.1007/s10994-021-06082-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-021-06082-8