1 Introduction

In machine learning, online learning and kernel learning are two active research topics, which have been studied separately for years. Online learning is designed to sequentially learn a prediction model based on the feedback of answers to previous questions and possibly additional side information (Shalev-Shwartz 2007). It distinguishes from typical supervised learning algorithms that are designed to learn a classification model from a collection of given training examples. Kernel learning aims to learn an effective kernel function for a given learning task from training data (Lanckriet et al. 2004; Sonnenburg et al. 2006; Hoi et al. 2007). An example of kernel learning is Multiple Kernel Learning (MKL) (Bach et al. 2004; Sonnenburg et al. 2006), which finds the optimal combination of multiple kernels to optimize the performance of kernel based learning methods.

Among various existing algorithms proposed for online learning (Freund and Schapire 1999; Crammer et al. 2006), several studies are devoted to examining kernel techniques in online learning settings (Freund and Schapire 1999; Crammer et al. 2006; Kivinen et al. 2001; Jyrki Kivinen and Williamson 2004). However, most of the existing kernel based online learning algorithms assume that the kernel function is given a priori, significantly limiting their applications to real-world problems. As an attempt to overcome this limitation, we introduce a new research problem, Online Multiple Kernel Classification (OMKC), which aims to learn multiple kernel classifiers and their linear combination simultaneously. The main challenge arising from OMKC is that both the optimal kernel classifiers and their linear combinations need to be online learned simultaneously. More importantly, the solutions to kernel classifiers and their linear combinations are strongly correlated, making it a significantly more challenging problem than a typical online learning problem.

To this end, we propose a novel OMKC framework for online learning with multiple kernels, which fuses two kinds of online learning techniques: the Perceptron algorithm (Rosenblatt 1958) that learns a classifier for a given kernel, and the Hedge algorithm (Freund and Schapire 1997) that linearly combines multiple classifiers. We further develop kernel selection strategies that randomly choose a subset of kernels for model updating and combination, thus improving the learning efficiency significantly. We analyze the mistake bounds for the proposed OMKC algorithms. Our empirical studies with 15 datasets show promising performance of the proposed OMKC algorithms compared to the state-of-the-art algorithms for online kernel learning.

The rest of this paper is organized as follows. Section 2 reviews the related work in online learning and kernel learning. Section 3 defines the problem of online learning over multiple kernels and presents two different types of algorithms. Section 5 presents our empirical studies that extensively evaluates the performance of the proposed OMKC algorithms. Section 6 discusses some open issues and future directions. Section 7 concludes this paper.

2 Related work

This section briefly reviews some major related work of online learning and kernel learning.

2.1 Online learning

Recent years have witnessed a variety of online learning algorithms proposed and studied in different contexts and applications (Crammer et al. 2006; Yang et al. 2010a). For more references, please kindly refer to the overview of online learning in Shalev-Shwartz (2007), Cesa-Bianchi and Lugosi (2006) and references therein.

A large number of recent studies in online learning are based on the framework of maximum margin learning. Most of these algorithms either extended or enhanced the well-known Perceptron algorithm (Agmon 1954; Rosenblatt 1958; Novikoff 1962), a pioneering online learning algorithm for linear prediction models. Exemplar algorithms in this category include the Relaxed Online Maximum Margin Algorithm (ROMMA) (Li and Long 2002), the Approximate Maximal Margin Classification Algorithm (ALMA) (Gentile 2001), the Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer 2003), the NORMA algorithm (Kivinen et al. 2001; Jyrki Kivinen and Williamson 2004), the online Passive-Aggressive (PA) algorithms (Crammer et al. 2006), and the Double Updating Online Learning (DUOL) (Zhao et al. 2011), and the recent family of confidence-weighted learning algorithms (Dredze and Crammer 2008; Wang et al. 2012). Among them, several studies introduce kernel functions into online learning to achieve nonlinear classification (Kivinen et al. 2001; Freund and Schapire 1999; Crammer et al. 2006; Zhao et al. 2011). Similar to these studies, our OMKC framework is also a kernel based approach for online learning.

Moreover, some online learning studies mainly concern the budget issue, i.e., online learning with budget (Crammer et al. 2003; Cavallanti et al. 2007), which has received much interest recently. It differs from typical online learning methods in that the number of support vectors is bounded in training the classification models. Example algorithms include the first kind of approach to overcoming the unlimited growth of the support set (Crammer et al. 2003), the shifting Perceptron (Cavallanti et al. 2007), the Forgetron (Dekel et al. 2008), the Projectron (Orabona et al. 2008), and the very recent bounded online gradient descent algorithms (Zhao et al. 2012), etc.

In addition to the above online learning studies, our work is also related to online prediction with expert advice (Freund and Schapire 1997; Littlestone and Warmuth 1989, 1994; Vovk 1998). The most well-known and successful work is probably the Hedge algorithm (Freund and Schapire 1997), which was a direct generalization of Littlestone and Warmuth’s Weighted Majority (WM) algorithm (Littlestone and Warmuth 1989, 1994). Other recent studies include the improved theoretical bounds (Cesa-Bianchi et al. 2007) and the parameter-free hedging algorithm (Chaudhuri et al. 2009) for decision-theoretic online learning. We refer readers to the book (Cesa-Bianchi and Lugosi 2006) for the other in-depth discussion of this subject.

2.2 Kernel learning

How to find an effective kernel for a given task is critical to most kernel based methods in machine learning (Shawe-Taylor and Cristianini 2004; Cristianini et al. 2001). Most kernel methods assume that a predefined parametric kernel, e.g. a polynomial kernel or an RBF kernel, is given a priori and the parameters of these kernel functions are usually determined empirically by cross validation. Several studies proposed to learn parametric or semi-parametric kernel functions/matrices from labeled and/or unlabeled data. Exemplar techniques include cluster kernels (Chapelle et al. 2002), diffusion kernels (Kondor and Lafferty 2002), marginalized kernels (Kashima et al. 2003), idealized kernel learning (Kwok and Tsang 2003), and graph-based spectral kernel learning approaches (Zhu et al. 2004; Hoi et al. 2006; Bousquet and Herrmann 2002).

Another form of kernel learning, known as Multiple Kernel Learning (MKL) (Lanckriet et al. 2004), aims to find the optimal combination of multiple kernels for a classification task. Exemplar algorithms include the convex optimization (Lanckriet et al. 2004), the semi-infinite linear program (SILP) approach (Sonnenburg et al. 2006), the subgradient descent approach (Rakotomamonjy et al. 2008), and the level method (Xu et al. 2008). In addition, several recent studies (Zien and Ong 2007; Ji et al. 2008; Tang et al. 2009) addressed other MKL problems, such as MKL on multi-class and multi-labeled data, the compositional kernel combination method (Lee et al. 2007), multi-layer MKL (Zhuang et al. 2011b), and unsupervised MKL (Zhuang et al. 2011c). Our work differs from the existing MKL methods in that our goal is to resolve online classification tasks while most existing MKL methods were developed to mainly tackle batch classification tasks.

Besides learning kernels from labeled examples, several studies addressed the challenge of learning kernel matrices from side information (e.g., pairwise constraints). Methods in this category include nonparametric kernel learning (Hoi et al. 2007; Zhuang et al. 2009 2011a; Chen et al. 2009), lower-rank kernel learning (Kulis et al. 2006, 2009), generalized maximum entropy models (Yang et al. 2010b), and indefinite kernel learning (Chen and Ye 2008, 2009). Finally, there are some emerging studies for online multiple kernel learning (Jie et al. 2010; Martins et al. 2011) that address some other issues such as multi-class learning or structured prediction. We note that these studies might have been developed in parallel or after our earlier conference paper published in Jin et al. (2010). Our work differs from them in that we focus on enhancing online classification performance by choosing and combining multiple kernels, namely only a subset of kernels are selected for updating and combining during online learning process. It is the kernel selection strategies developed in this work that make the proposed learning algorithms significantly more efficient than the existing approaches for online multiple kernel learning.

3 Proposed framework for online classification with multiple kernels

We introduce the problem setting and regular Multiple Kernel Learning (MKL), and then present the proposed framework of online multiple kernel classification.

3.1 Problem setting and Multiple Kernel Learning

Consider a set of training examples \(\mathcal{D} = \{(x_{i}, y_{i}), i=1, \ldots, n\}\) where x i ∈ℝd, y i ∈{−1,+1},i=1,…,n, and a collection of m kernel functions \(\mathcal{K}=\{\kappa_{i}:\mathcal{X}\times\mathcal{X} \rightarrow \mathbb{R}, i=1, \ldots, m\}\). The goal of multiple kernel learning is to learn a kernel-based prediction function by identifying the optimal combination of the m kernels, denoted by θ=(θ 1,…,θ m ) to minimize the margin-based classification error. It is cast into the optimization below:

(1)

where

In the above formulation, we use notation 1 m to represent a vector of m dimensions with all its elements being 1. It can also be cast into the following mini-max optimization problem:

(2)

where K i∈ℝn×n with \(K^{i}_{j,l} = \kappa_{i}(x_{j}, x_{l})\), Ξ={α|α∈[0,C]n}, and ∘ defines the element-wise product between two vectors. We refer to the above formulation as a regular batch MKL problem. Despite some encouraging results achieved recently (Rakotomamonjy et al. 2008; Xu et al. 2008), developing an efficient and scalable MKL algorithm remains an open research challenge in order to solve the challenging optimization task. Unlike the recent efforts for online MKL studies (Jie et al. 2010; Martins et al. 2011) which are mainly concerned in optimizing the optimal kernel combination, in this paper, we present a new framework for online multiple kernel classification which is focused on exploring effective online combination of multiple kernel classifiers.

3.2 The proposed framework of Online Multiple Kernel Classification

The proposed Online Multiple Kernel Classification (OMKC) framework is based on the fusion of two online learning methods: the Perceptron algorithm (Rosenblatt 1958) and the Hedge algorithm (Freund and Schapire 1997). In particular, for each kernel, the Perceptron algorithm is employed to learn a kernel-based classifier with some selected kernel, and the Hedge algorithm is used to update their combination weights. Algorithm 1 shows the detailed steps of the proposed framework.

Algorithm 1
figure 1

Deterministic Algorithm for OMKC (OMKC (D,D))

In this framework, we use w i (t) to denote the combination weight for the i-th kernel classifier at round t, which is set to 1 at the initial round. For each learning round, we update the weight w i (t) by following the Hedge algorithm as follows:

$$w^i_{t+1} = w^i_t \beta^{z^i_t} $$

where β∈(0,1) is a discount weight parameter, which is employed to penalize the kernel classifier that performs incorrect prediction at each learning step, and z i (t) indicates if the i-th kernel classifier makes a mistake on the prediction of the example x t .

Next we derive a theorem to show the mistake bound for Algorithm 1. Throughout this paper, we assume κ(x,x)≤1 for any x. For the convenience of following discussions, we define the following notations:

where \(f_{i}^{t}(x)\) is used to represent the classifier at trial t that is constructed by using the kernel function κ i (⋅,⋅), and I(x) is an indicator function that outputs 1 when x is true and 0 otherwise. Here, \(\theta_{i}^{t}\) essentially defines the mixture of kernel classifiers, and \(z_{i}^{t}\) indicates if training example (x t ,y t ) is misclassified by the ith kernel classifier at trial t. Finally, we define the optimal margin classification error for the kernel κ i (⋅,⋅) with respect to a collection of training examples \(\mathcal{L} = \{(x_{t}, y_{t}), t=1, \ldots, T\}\) as

(3)

Theorem 1

After receiving a sequence of T training examples, denoted by \(\mathcal{L}=\{(x_{t}, y_{t}), t=1, \ldots, T\}\), the number of mistakes M made by running Algorithm  1, denoted by

$$M =\sum_{t=1}^T I(y_t \hat{y}_t \leq 0) = \sum_{t=1}^T I \Biggl(\sum_{i=1}^m \theta_i^t z_i^t \geq 0.5 \Biggr) $$

is bounded as follows

(4)
(5)

By choosing \(\beta = \frac{\sqrt{T}}{\sqrt{T} + \sqrt{\ln m}}\), we have

$$M \leq 2 \biggl( \biggl(1+\sqrt{\frac{\ln m}{T}} \biggr)\min _{1 \leq i \leq m} F(\kappa_i, \ell, \mathcal{L}) + \ln m + \sqrt{T \ln m} \biggr) $$

The proof of this theorem essentially combines the proof of the Perceptron algorithm and the Hedge algorithm. The details can be found in the Appendix. We note that the mistake bound in above theorem can be improved if we further tune the stepsize or the classification margin γ. However, since the focus of this study is online multiple kernel classification, we simply fix these two parameters to be 1. The above theorem also provides suggestion for the parameter β. It is important to note that the value for β suggested in the bound could be highly overestimated due to the rough approximation of \(\sum_{t=1}^{T} z_{i}^{t}\) as T. We will examine empirically how β affects the prediction accuracy of the proposed algorithm.

The main shortcoming of Algorithm 1 is its high computational cost. First, at step 5, to make a prediction \(\hat{y}_{t}\), Algorithm 1 requires combining predictions from all the kernel classifiers. Second, between step 7 and 11, Algorithm 1 requires updating all the kernel classifiers. When the number of kernels is large, both are computationally expensive. In the subsequential sections, we will study kernel selection strategies that reduce the computational cost of Algorithm 1 by selecting only a subset of kernels for prediction and updating. To distinguish from those approaches, we refer to Algorithm 1 as a deterministic approach or “OMKC(D,D)” for short, because all the kernels are used for prediction and updating.

4 Online Multiple Kernel Classification (OMKC) algorithms

4.1 OMKC by stochastic combination

Our first effort is to improve the computational efficiency of Algorithm 1 by selecting a subset of kernels for prediction. Algorithm 2 shows the key steps. It introduces the probability q i (t),i=1,…,m to denote the probability of sampling the i-th kernel at the t-th iteration, which is computed as follows:

(6)

Only the sampled kernel classifiers will be combined to make the prediction. We refer to this stochastic selection approach as stochastic combination, and Algorithm 2 as OMKC (D,S). We have the following theorem for the mistake bound of Algorithm 2.

Algorithm 2
figure 2

Stochastic Combination Algorithm for OMKC (OMKC (D,S))

Theorem 2

After receiving a sequence of T training examples, denoted by \(\mathcal{L}=\{(x_{t}, y_{t}), t=1, \ldots, T\}\), the number of mistakes M made by running Algorithm  2 is bounded as follows if \(\beta = \frac{\sqrt{T}}{\sqrt{T} + \sqrt{\ln m}}\)

$$\mathrm{E}[M] \leq 2 \biggl( \biggl(1+\sqrt{\frac{\ln m}{T}} \biggr)\min _{1 \leq i \leq m} F(\kappa_i, \ell, \mathcal{L}) + \ln m + \sqrt{T \ln m} \biggr) $$

The proof of Theorem 2 is identical to that of Theorem 1. Compared to Algorithm 1, we see that the mistake bound in expectation for Algorithm 2 remains unchanged, implying that the stochastic selection approach employed in Algorithm 2 does not affect the overall performance significantly.

4.2 OMKC by stochastic updating

Our second approach is to improve the learning efficiency of Algorithm 1 by sampling a subset of kernel classifiers, based on the weights assigned to kernel classifiers, for updating. Specifically, we introduce the sampling probability p i (t) which is computed by smoothing q i (t) (defined in (6)) with a uniform distribution δ/m, i.e.,

$$p_i(t) = (1 -\delta) q_i(t) + \delta/m,\quad i=1,\ldots,m $$

The smoothing parameter δ is introduced to guarantee that each kernel classifier will be selected with at least probability δ/m, avoiding that the sampling probability p i (t) is concentrated on a few kernels. The similar idea was also used in the study of the multi-arm bandit problem (Auer et al. 2003) to ensure the tradeoff between exploration and exploitation. Based on probabilities p i (t), we sample a subset of kernels by Bernoulli samplings which are independently conducted in each trial, one for each kernel classifier, i.e.,

$$m_i(t) = \textsf{Bernoulli}\_\textsf{Sampling}(p_i(t)),\quad i=1,\ldots,m $$

where m i (t)∈{0,1} denotes the sampling result. The i-th kernel is selected if and only if m i (t)=1. Algorithm 3 shows the detailed steps. We refer to this kernel selection strategy as stochastic updating approach, and Algorithm 3 as OMKC (S,D). The theorem below shows the mistake bound of Algorithm 3.

Algorithm 3
figure 3

Stochastic-Update Algorithm for OMKC (OMKC (S,D))

Theorem 3

After receiving a sequence of T training examples, denoted by \(\mathcal{L} = \{(x_{t}, y_{t}), t=1, \ldots, T\}\), the expected number of mistakes made by Algorithm  3, denoted by

$$M= \mathrm{E} \Biggl[ \sum_{t=1}^T I \Biggl(\sum_{i=1}^m q_i(t) z_i(t) \geq 0.5 \Biggr) \Biggr], $$

is bounded as follows

$$M \leq \frac{2m\ln(1/\beta)}{\delta(1 - \beta)} \min\limits _{1 \leq i \leq m} F( \kappa_i, \ell, \mathcal{L}) + \frac{2m\ln m}{\delta(1 - \beta)} $$

By choosing \(\beta = \frac{\sqrt{T}}{\sqrt{T} + \sqrt{\ln m}}\), we have

$$M \leq \frac{2m}{\delta} \biggl( \biggl(1+\sqrt{\frac{\ln m}{T}} \biggr) \min \limits _{1 \leq i \leq m} F(\kappa_i, \ell, \mathcal{L}) + \ln m + \sqrt{T\ln m } \biggr) $$

Proof

Similar to the proof for Theorems 1, we first give the lower bound and upper bound for ln(W T+1/W 1) by

$$-\ln(1/\beta)\sum_{t=1}^Tm_i(t)z_i(t)- \ln m\leq \ln \biggl(\frac{W_{T+1}}{W_1} \biggr)\leq -(1-\beta)\sum _{t=1}^T\sum_{i=1}^mq_i(t)m_i(t)z_i(t), $$

which leads to the following inequality

$$(1-\beta)\sum_{t=1}^T\sum _{i=1}^mq_i(t)m_i(t)z_i(t) \leq \ln(1/\beta)\sum_{t=1}^Tm_i(t)z_i(t)+ \ln m $$

Taking expectation on both sides, we have

$$\mathrm{E}\Biggl[(1-\beta)\sum_{t=1}^T\sum _{i=1}^mq_i(t)m_i(t)z_i(t) \Biggr]\leq \mathrm{E}\Biggl[\ln(1/\beta)\sum_{t=1}^Tm_i(t)z_i(t) \Biggr]+\ln m $$

Since p i (t)≥δ/m, then

$$\frac{\delta(1-\beta)}{m}\mathrm{E}\Biggl[\sum_{t=1}^T \sum_{i=1}^mq_i(t)z_i(t) \Biggr]\leq \mathrm{E}\Biggl[\ln(1/\beta)\sum_{t=1}^Tm_i(t)z_i(t) \Biggr]+\ln m $$

Using the result in Theorem 1, we have the following inequality for any \(f \in \mathcal{H}_{\kappa_{i}}\)

Combining the above results, we have

Following the same argument as in Theorem 1, we have the result in the theorem. □

As indicated in Theorem 3, the dependence of mistake bound on m is O(mlnm). Since Algorithm 3 only chooses one kernel classifier to be updated in each iteration, the algorithm is essentially similar to the multi-armed bandit problem. It is therefore not surprising to have O(mlnm) dependence for our algorithm, because the same dependence can be found in the regret bound for multi-armed bandit problem when m is the number of arms.

It is interesting to note that the mistake bound in Theorem 3 is inverse to δ, indicating that a larger δ may potentially lead to a better mistake bound for the combined kernel classifier. In the extreme case when choosing δ=1, which is equivalent to choosing the kernel classifiers uniformly at random for updating. However, in practice, we found the approach of choosing kernel classifiers uniformly at random usually leads to a poor performance because it wastes time on updating the kernel classifiers with low prediction accuracy (which could lead to poor mistake bounds due to the training on too many poor kernels). As a cautionary note about the inconsistency between the theoretical and empirical results, we conjecture that it is probably because the mistake bound is not tight enough to reveal the true behavior of the algorithm.

Besides the practical issue, another problem of choosing δ=1 is that a larger δ usually leads to a larger number of updates, as revealed by the following corollary, leading to a higher computational cost.

Corollary 1

After receiving a sequence of T training examples, denoted by \(\mathcal{L} = \{(x_{t}, y_{t}), t=1, \ldots, T\}\), the expected number of updates made by Algorithm  3, denoted by

$$U= \mathrm{E} \Biggl[\sum_{t=1}^T\sum _{i=1}^m m_i(t) z_i(t) \Biggr], $$

is bounded as follows if \(\beta = \frac{\sqrt{T}}{\sqrt{T} + \sqrt{\ln m}}\)

$$U \leq \frac{(1-\delta)m}{\delta} \biggl( \biggl(1+\sqrt{\frac{\ln m}{T}} \biggr) \min \limits _{1 \leq i \leq m} F(\kappa_i, \ell, \mathcal{L}) + \ln m + \sqrt{T\ln m } \biggr) + \delta T $$

Proof

According to the definitions, we have the following result:

Following the same argument as in Theorem 2, we have the result in Corollary 1. □

As indicated by the above corollary, a large δ usually leads to a potentially large number of updates. When δ is chosen appropriately, it could potentially improve the prediction performance through the exploration of more kernels. However, when δ is too large, it not only increases the number of updates, but also performs over-training on the poor kernels, leading to high computational cost, over-complex models, and even worse prediction accuracy.

4.3 OMKC by stochastic updating & stochastic combination

Our final approach is to combine the two kernel selection strategies, i.e., stochastic combination approach and stochastic updating approach. Algorithm 4 shows the details of this approach, to which we refer as OMKC (S,S). Apparently, Algorithm 4 is computationally most efficient compared to the other approaches.

Algorithm 4
figure 4

Stochastic Algorithm for OMKC (OMKC (S,S))

4.4 Summary of four OMKC algorithms

By choosing different selection strategies for classifier updating and combination, we can develop several variants of OMKC algorithms. Table 1 summarizes the proposed four variants of OMKC algorithms by a mixture of different updating and combination strategies.

Table 1 Summary of the variants of OMKC algorithms. Below, U and C denotes the selection strategies for Update and Combination, respectively; S and D denotes the Stochastic and Deterministic approaches, respectively

Among all the above four algorithms, OMKC(D,D) is the most computationally intensive algorithm that updates and combines all the kernel classifiers at each iteration, while OMKC(S,S) is the most efficient algorithm that selectively updates and combines a subset of kernel classifiers at each iteration. Finally, OMKC(D,S) and OMKC(S,D) are the other two variants of OMKC algorithms in between these two extremes. To better understand their advantages and disadvantages of these four algorithms under different situations, we will comprehensively examine their empirical performance in our experiments.

5 Experimental results

The goal of our empirical study is to answer the following questions: (1) Whether the proposed OMKC algorithms are more effective than the regular online learning algorithms with single kernel (e.g., Perceptron) for online classification? (2) Whether the proposed OMKC algorithms are more effective than the state-of-the-art online MKL method in literature for online classification? (3) How about the efficiency and efficacy of the proposed OMKC algorithms using the stochastic strategy in comparison to the OMKC algorithms using the deterministic strategy? (4) Among all the proposed OMKC algorithms, which algorithm achieves better accuracy, efficiency, and sparsity performance? (5) How does the discount weight parameter β affects the performance of the proposed OMKC algorithms?

5.1 Experimental testbed and setup

In our experiments, we test the algorithms over a testbed of 15 diverse datasetsFootnote 1 obtained from LIBSVMFootnote 2 and UCI machine learning repository.Footnote 3 These datasets were chosen quite arbitrarily, with different sizes and dimensions in order to examine every aspect of the performance of our algorithms. The details of these datasets are shown in Table 2.

Table 2 The details of 15 diverse datasets used in our experiments

We evaluate the empirical performance of the proposed online multiple kernel learning algorithms for online classification tasks. In particular, we predefine a pool of 16 kernel functions, including 3 polynomial kernels (i.e., \(k(x_{i},x_{j})=(x_{i}^{\top}x_{j})^{p}\)) of degree parameter p=1, 2 and 3), and 13 Gaussian kernels (i.e., k(x i ,x j )=exp(−∥x i x j 2/2σ 2)) of kernel width parameter σ in [2−6,2−5,…,26].

We compare the proposed four variants of OMKC algorithms with the following baseline algorithms:

  • Perceptron: the well-known Perceptron baseline algorithm with a linear kernel (Rosenblatt 1958; Freund and Schapire 1999);

  • Perceptron(u): another Perceptron baseline algorithm with an unbiased/uniform combination of all the kernels;

  • Perceptron(*): we conduct an online validation procedure to search for the best kernel among the pool of kernels (using the first 10 % training examples), and then apply the Perceptron algorithm with the best kernel;

  • OM-2: a state-of-the-art online learning algorithm for multiple kernel learning (Jie et al. 2010; Orabona et al. 2010);

For performance metrics, similar to the setups of a regular online learning task, we adopt the mistake rate, i.e., the percentage of mistakes made by the online learner over the total number of predictions made by the online learner. In addition, we also measure the size of support vectors of the classifiers learned by the online learning algorithms. Finally, we measure the average running time (including model updating and online prediction) for learning the classifiers by the online learning algorithms.

Regarding the parameter setup, for the proposed OMKC algorithms, the parameters β and δ are simply fixed to 0.8 and 0.01, respectively. We will empirically examine the parameter’s impact in Sect. 5.6. Further, to obtain the stable average results, all online learning experiments were conducted over 20 random permutations for each dataset, and all the reported results were averaged over these 20 runs, in which every experimental run was conducted over a single pass of the permutated dataset. For the experimental environment, our experiments were evaluated on a PC with 2.3 GHz CPU and 16 GB RAM by Matlab implementation.

5.2 Evaluation of the deterministic OMKC algorithm

Table 3 summarizes the average experimental results for comparing the proposed \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm with three Perceptron based algorithms (i.e., Perceptron, Perceptron(u) and Perceptron(*)) and the OM-2 algorithm for online MKL, on the 15 datasets. Based on the experimental results, we performed the student t-tests and highlighted the best results in the table, including statistically no different results (w.r.t. the top 1 result) according to the student t-tests. We discuss the performance comparison as follows.

Table 3 Comparison of the OMKC algorithm with the OM-2 and three Perceptron based algorithms. We conducted the student t-test on the mistake results and highlighted the best results for each dataset. Among 15 datasets, OMKC(D,D) achieved the best on 12 datasets, while OM-2 and Perceptron(*) achieved the best on 4 datasets and 1 dataset, respectively

First of all, we examine the performance of the three Perceptron based algorithms. We observe that the Perceptron algorithm using the unbiased combination of all kernels usually outperforms the regular Perceptron using a linear kernel, except for a few datasets (e.g., wdbc, breast, and spambase) where Perceptron(u) is considerably worse than Perceptron with a linear kernel. Further, among all the three Perceptron algorithms, Perceptron(*) with the best kernel significantly outperforms other two algorithms for most cases, except for a couple of datasets (e.g., ionosphere and votes84). This result shows that it is important to identify the best kernel for an online learning task.

Secondly, we examine the performance of the OM-2 algorithm with comparisons to the three Perceptron algorithms. We observe that this online MKL algorithm is often more effective than or at least comparable to the two regular Perceptron algorithms, i.e., Perceptron with a linear kernel and Perceptron(u) using an unbiased combined kernel. In addition, by comparing OM-2 with Perceptron(*) that uses the best kernel, we found they are in general quite comparable, in which Perceptron(*) tends to perform considerably better on some datasets (such as australian, diabetes, wdbc, breast, fourclass, splice, svmguide1), while OM-2 tends to perform better on the other datasets. This observation shows that both identifying the best kernel and combining multiple kernels approaches are important and can exhibit their advantages for different scenarios in online learning tasks.

Thirdly, among all the compared algorithms, \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) overall achieves the best performance, which obtained the best results on 12 out of 15 datasets, significantly outperforming both Perceptron(*) and OM-2 algorithms which only obtained the best results on 1 and 3 out of 15 datasets, respectively. By further comparing the performance of the proposed \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm with Perceptron(*) in detail, we found that it consistently outperforms Perceptron(*) almost on all datasets. This shows that \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) is excellent in tracking the best kernel classifier in the online learning task. Finally, by comparing \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) with OM-2, we found that it often outperforms OM-2 for most datasets, excepts three datasets (svmguide3, a3a, w5a) where OM-2 tends to perform better. This encouraging result shows that the proposed \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm is more effective in combining multiple kernels than the state-of-the-art online MKL algorithm for online learning tasks.

Finally, despite the considerably better predictive performance achieved by the proposed \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm, we notice that it has two limitations. The first limitation is the high computational cost for learning and prediction. This is because \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) adopts the deterministic updating strategy which has to check every kernel classifier at each iteration, and whenever there is a mistake for each kernel classifier, \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) has to update the kernel classifier. The second limitation is the high model complexity, i.e., the size of support vectors learned by \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) is significantly larger than the other algorithms. For example, on dataset “ionosphere”, the size of support vectors learned by \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) is almost 12 times over that by OM-2. The high learning and model complexities make \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) less efficient and attractive for some applications with a large number of kernels. This also indicates the importance of exploring the stochastic variants of OMKC algorithms, as to be evaluated in the next section.

5.3 Evaluation of stochastic strategies for OMKC algorithms

In this section, we evaluate the performance of several variants of OMKC algorithms using different stochastic strategies. In particular, we examine two kinds of stochastic strategies: stochastic updating and stochastic combinations in comparison to their deterministic counterparts. Table 4 shows the experimental results by comparing the deterministic \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm with the other three OMKC algorithms. Similarly, we highlighted the best results in the table after performing student t-tests on the mistake rates. We analyze the experimental results as follows in detail.

Table 4 Comparison of four OMKC algorithms and the OM-2 algorithm. We conducted student t-tests on the mistake rates and highlighted the best results for each dataset. OM-2 only achieved the best for 3 out of 15 datasets, while the best OMKC algorithm achieved the best for 11 out of 15 datasets

5.3.1 Deterministic updating v.s. stochastic updating

By comparing the results of different OMKC algorithms in Table 4, we have several observations for the comparisons between the deterministic and stochastic updating approaches.

First of all, by comparing \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) which adopt the same deterministic classifier combination approach, we found that \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) using the stochastic updating strategy is able to improve both the time efficiency and model complexity over \(\mathrm{OMKC}_{\mathrm{(D,D)}}\). This is not difficult to understand since \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) selectively updates a subset of kernel classifiers at each iteration, which thus runs more efficiently and produces less number of support vectors. Moreover, by examining the mistake rate results, we found that \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) is able to achieve comparable or even better mistake rate than \(\mathrm{OMKC}_{\mathrm{(D,D)}}\), which validates the efficacy of the stochastic updating strategy.

Second, by comparing another pair of OMKC algorithms \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) and \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) which adopt the same stochastic classifier combination approach, we can draw some similar observation. In particular, \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) only significantly improves the learning efficiency over \(\mathrm{OMKC}_{\mathrm{(D,S)}}\), but also achieves better or at least comparable predictive performance.

Third, by comparing the two OMKC algorithms with the stochastic updating strategy with the OM-2 algorithm, we found that both \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) can always achieve considerably better or at least comparable performance than the existing online MKL algorithm for most datasets.

The above observations confirm that the stochastic updating strategy can not only effectively improve the learning efficiency of OMKC, but also maintain a highly comparable or sometimes even better predictive performance for online learning tasks.

5.3.2 Deterministic combination v.s. stochastic combination

Similarly, we can also draw several observations about the comparisons between the deterministic and stochastic classifier combination approaches.

First of all, by comparing \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) which adopt the same deterministic updating strategy, we found that \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) using the stochastic classifier combination strategy is able to significantly reduce the model complexity as compared to \(\mathrm{OMKC}_{\mathrm{(D,D)}}\). For most datasets, \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) is able to achieve a significant reduction of over 90 percent support vectors. Despite the significant gain in the model complexity, \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) tends to result in slightly worse predictive performance; nonetheless, the mistake rate results achieved by \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) remain quite competitive as compared to those by OM-2.

Second, by comparing \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) which adopt the same stochastic updating strategy, we found that \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) using the stochastic combination strategy is able to significantly reduce both the model complexity and computational time cost. In addition, by examining the predictive performance, \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) achieves fairly comparable mistake rates as compared to \(\mathrm{OMKC}_{\mathrm{(S,D)}}\).

The above observations show that the OMKC algorithms using the stochastic classifier combination strategy are able to considerably improve the learning efficacy by maintaining comparable predictive performance as compared to their counterparts.

5.4 Evaluation of varied online learning sizes

To further examine the performance of OMKC algorithms with respect to different sizes of datasets for online learning, Fig. 1 and Fig. 2 show the changes of both average mistake rates and average numbers of support vectors during the online learning processes. Similar observations can be drawn from the experimental results as follows.

Fig. 1
figure 5

Comparison of average mistake rates during the online learning processes

Fig. 2
figure 6

Comparison of average sizes of support vectors during the online learning processes

First, the results in Fig. 1 validate the fact that the more the examples received in the online learning process, the better the performance achieved by the proposed OMKC algorithms. In particular, at the beginning of an online learning task, the Perceptron(*) algorithm with the best kernel is able to produce a smaller mistake rate than all of the OMKC algorithms; when more training examples are received, we found that the predictive performance of the OMKC algorithms can be improved more rapidly than the Perceptron(*) algorithm.

Second, the results in Fig. 2 again verified that the \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm produces the most complex classifier among all the OMKC algorithms, while \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) produces the simplest classifier, which is even more sparse than \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) when more training examples are received in the online learning process. This seems a little bit surprising, but it is not difficult to interpret this result. The reason is because \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) using the deterministic updating strategy always increases the weights of those good kernels and meanwhile decreases the weights of those poor kernels. As a result, when applying the stochastic combination strategy, only the very small set of good kernels will be selected for the final classifier combination (most other poor kernels will be discarded due to their small sampling weights).

Third, it seems to be a little bit surprising to observe that \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) sometimes can perform even better than the deterministic \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm; and similarly \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) usually performs better than \(\mathrm{OMKC}_{\mathrm{(D,S)}}\). However, this is possible and reasonable. We believe that it is likely to be caused by the fact that although \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) is good at tracking the best kernel, it is not always guaranteed to achieve the best performance primarily because \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) may rely too much on the best kernel classifier which itself may not always outperform a combination of multiple kernels; in contrast, \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) is able to exploit all the kernel classifiers more effectively by the stochastic updating strategy, and thus achieves a good tradeoff between tracking the best kernel classifier and combining multiple kernel classifiers.

To further verify the above argument, Fig. 3 shows the evolution results of the normalized weights in tracking the best kernel classifier by different OMKC algorithms. These weights indicate how confident the algorithm trusts on the best kernel in the online learning process. From the results, it is clear to verify that \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) using the deterministic updating strategy are significantly better than \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) for tracking the best kernel. These results are consistent to our previous analysis on the relationships between several different OMKC algorithms.

Fig. 3
figure 7

Evaluation of the normalized weights in tracking the best kernel achieved by different OMKC algorithms

5.5 Evaluation of computational time cost

In addition to learning accuracy, time efficiency is another important concern for online learning. In our experiments, we also examine the time efficiency by comparing different OMKC algorithms. In particular, we are interested in examining how the stochastic algorithms can reduce the computational time cost of the deterministic \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm. In our implementation, all the kernels are pre-computed and stored in memory, which is to facilitate the evaluation purpose. We report the average CPU running times obtained from the 20 runs that differ in the random draw of the sequential training examples.

First of all, from the experimental results in Table 3 and Table 4, we observe that among all OMKC algorithms, \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) using the stochastic updating and combination approach is the most efficient algorithm; the two deterministic updating algorithms, \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(D,S)}}\), are the least efficient algorithms; while the \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) algorithm is slower than \(\mathrm{OMKC}_{\mathrm{(S,S)}}\), but faster than \(\mathrm{OMKC}_{\mathrm{(D,D)}}\).

To further examine the results comprehensively, Table 5 further shows the details of quantitative evaluations of time cost and average speedup achieved by stochastic algorithms over \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) across various datasets. We have several observations from the results. First, similar to the previous results, we found that the two deterministic updating algorithms, \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(D,S)}}\), have similar time efficiency, in which \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) is slightly more efficient than \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) because \(\mathrm{OMKC}_{\mathrm{(D,S)}}\) only selectively combines a subset of kernel classifiers during the online prediction. Second, comparing the two updating approaches for OMKC, the results again verified that algorithms using the stochastic updating are considerably more efficient than the deterministic updating algorithms. Specifically, compared with the \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) algorithm, \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) usually saves about 30 %∼80 % time cost, while OMKC(S,D) saves about 5 %∼60 % time cost.

Table 5 Evaluation of time efficiency by several different OMKC algorithms. The last three columns show the time cost ratio between each of the other three OMKC algorithms over the OMKC(D,D) algorithm

In addition, we observe that the larger the dataset size, the more time cost typically can be saved by the two stochastic updating algorithms. For example, on dataset w5a with 9888 examples, \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) saves over 8 % time cost, while OMKC(S,D) saves over 60 % time cost. Finally, we note that the specific ratio of time cost saved by \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) or \(\mathrm{OMKC}_{\mathrm{(S,D)}}\) is also dependent on the number of kernels. Nonetheless, the observations show that the stochastic algorithms are important and scalable for large-scale online learning tasks.

5.6 Effect of discount weight parameter β

One important parameter in the proposed OMKC algorithms is the discount weight parameter β. For all the previous experiments, we simply fix parameter β to 0.8 in all situations. Although we have given the choice of β in the analysis of mistake bound, those values tend to be highly overestimated due to the approximation of \(\sum_{t=1}^{T} z_{i}^{t}\) as T. In this section, we aim to empirically examine the effect of the discount weight parameter. Figure 4 shows the performance of proposed OMKC algorithms on several randomly chosen datasets with β varied from 0.05 to 1.0.

Fig. 4
figure 8

Evaluation of the impact by the discount weight parameter (β)

According to Fig. 4, we observe that different values of β do affect the prediction performance of the OMKC algorithms for most datasets. Further, we also observe that there is no a single best value of β that is universally optimal for every dataset. Finally, we found that for both \(\mathrm{OMKC}_{\mathrm{(D,D)}}\) and \(\mathrm{OMKC}_{\mathrm{(S,S)}}\) algorithms, the best β often falls in the range between 0.7 and 0.9. We believe that this is because when β is too small (e.g. smaller than 0.5), it penalizes too much on the misclassified kernel classifiers at the very beginning of the learning process, leading to a poor performance in finding the optimal combination weights. We note that this is in general consistent with our analysis since the optimal choice of β is \(\sqrt{T}/[\sqrt{T} + \sqrt{\ln m}]\), which is a large value when T is large.

6 Discussions and future directions

Despite the encouraging results achieved, the current OMKC solutions can be improved in many aspects since it is a new open research problem. Below we discuss several possible directions for future research investigation.

First of all, the approach to learning each individual kernel-based classifier can be improved. The current approach used in our OMKC algorithms is an adaption of the regular kernel Perceptron algorithm (Rosenblatt 1958; Freund and Schapire 1999). It is possible to improve the learning performance and scalability by engaging more advanced online learning algorithms (Crammer et al. 2006; Orabona et al. 2008; Zhao et al. 2012).

Second, the approach to combining the kernel-based classifiers for prediction can be improved. Instead of using the Hedge algorithm, we might explore other more general approaches that perform the online prediction with expert advices, such as the general “Follow the Perturbed Leader” approaches (Kalai and Vempala 2005; Hutter and Poland 2005).

Third, instead of assuming a finite set of given kernels, it might be possible to investigate OMKC for learning with an infinite number of kernels, which is somewhat similar to other existing infinite kernel learning studies (Argyriou et al. 2006; Chen and Ye 2008).

Fourth, the current algorithms are designed for online classification tasks. It is also possible to investigate online to batch conversion algorithms for the batch classification extension by following similar studies in literature (Dekel and Singer 2005; Dekel 2008).

Fifth, to further speedup our techniques for very large scale applications, it is possible and not difficult to parallelize our method by exploring emerging parallel computing technologies, such as multi-processor and multi-core programming techniques.

Finally, it is possible to extend the proposed OMKC framework for various real applications. For example, the current approach assumes online learning with two-class data examples. Future work is necessary to handle multi-class learning applications.

7 Conclusions

This paper investigated a new problem, “Online Multiple Kernel Classification” (OMKC), which aims to attack an online learning task by learning a kernel based prediction function from a pool of predefined kernels. To solve this challenge, we propose a novel framework by combining two types of online learning algorithms, i.e., the Perceptron algorithm that learns a classifier for a given kernel, and the Hedge algorithm that combines multiple kernel classifiers by linear weighting. The key to an OMKC task is a proper selection strategy to choose a set of kernels from the pool of predefined kernels for online classifier updates and classifier combination towards prediction. To address this key issue, we present two kinds of selection strategies: (i) deterministic approach that simply chooses all of the kernels, and (ii) stochastic approach that randomly samples a subset of kernels according to their weights. Specifically, we proposed four variants of OMKC algorithms by adopting different online updating and combination strategies (i.e., deterministic or stochastic). It is interesting to find that each of these four OMKC algorithms enjoys different merits for different scenarios.

To examine the empirical performance of the proposed OMKC algorithms, we conducted extensive experiments on a testbed with 15 diverse real datasets. The promising results reveal three major findings: (1) all the OMKC algorithms always perform better than a regular Perceptron algorithm with an unbiased linear combination of multiple kernels, mostly perform better than the Perceptron algorithm with the best kernel found by validation, and often perform better or at least comparably than a state-of-the-art online MKL algorithm; (2) for the two different updating strategies, the stochastic updating strategy is able to significantly improve the efficiency by maintaining at least comparable performance as compared with the deterministic approach; (3) for the two different combination strategies, the deterministic combination strategy usually performs better results, while the stochastic combination strategy is able to produce a significantly more sparse classifier.