1 Introduction

Kernel algorithms have been successfully applied to various machine learning applications. Compared to other machine learning approaches, kernel algorithms have a strong theoretical foundation and become a popular tool because of their guaranteed convergence and good generalization capacity. Support Vector Machine [3], Kernal Principal Component Analysis [16], Kernel Ridge Regression [14] etc. are examples of kernel algorithms.

Kernel methods represent the solution f of the learning problem in the form

$$\begin{aligned} f(x)=\sum _{i=1}^{N}\alpha _{i}k(x, x_i)\end{aligned}$$
(1)

where \(x_i \in \mathbb {R}^n, i= 1, \dots N,\) are the given inputs, k is the reproducing kernel corresponding to the reproducing kernel Hilbert space in which f lies and \(\alpha _i \in \mathbb {R}, i = 1,2,\dots N\).

The performance of a kernel algorithm depends on the selection of reproducing kernel. The selection of suitable kernel can be automated using multiple kernel learning (MKL) algorithms, that is, these algorithms select the most suitable reproducing kernel from a pool of kernels by itself. Many formulations of MKL are proposed for learning the kernels which are extensively surveyed in [12].

Generally, in multiple kernel learning algorithms, the reproducing kernel is defined as a linear combination of a set of kernels. Using this concept, (1) can be written as

$$\begin{aligned} f(x) = \sum _{i=1}^N \alpha _i \sum _{l=1}^P d_l k_l(x_i, x), d_l \ge 0 \end{aligned}$$
(2)

where \(k_l\) are the reproducing kernels under consideration. The parameters in (2) can be optimized either by using two-step optimization [15] or one-step optimization [11]. In one-step method, all the parameters are updated in each iteration of optimization algorithm. In two step method, the learning parameters (\(\alpha _i\)) are optimized in first step by fixing kernel weights and kernel weights (\(d_l\)) are updated in next step (fixing learning parameters) and this process continues until convergence. One step method mostly uses an alignment measure [5] which is defined between the kernels. [7, 9, 19] are extensions of one step optimization technique in which the objective is to minimize the alignment between ideal kernel and combination of kernels by applying techniques like semi-definite programming, advanced gradient based methods etc. The works, [6, 18] use two stage optimization technique for solving the MKL. The faster optimization of parameters for adapting to large scale data set is detailed in [2, 17]. The non linear combination of kernels have been used in [4].

[10] used binary classification approach for finding the optimal kernel associated with binary classification problems. That is in this approach the optimal kernel is a function \(f^* : \mathcal {X}^* \subset \mathbbm {R}^P \rightarrow \mathbbm {R}\) such that

$$\begin{aligned} f^*(z) = d^T z \end{aligned}$$
(3)

where \(\mathcal {X}^* = Range(k_1(., .)) \times Range(k_2(., .)) \times ... \times Range(k_P(., .))\) and \(d = \{d_1,d_2, \dots d_P\}^T \in \mathbb {R}^P\) is as given in (2). From (3) it is clear that \(f^*\) is a hyperplane defined on \(\mathcal {X}^*\). Using this approach (2) is represented as

$$\begin{aligned} f(x) = \sum _{i=1}^N \alpha _i f^*(\tilde{K}(x,x_i)) \end{aligned}$$
(4)

where \(\tilde{K}(x,x_i)=[k_1(x,x_i) \;\;k_2(x,x_i) \;\;... \;\;k_p(x,x_i)]^T.\)

\(f^*\) is found out using the \(N^2\) data points \(\{(\tilde{K}(x_i,x_j), y_iy_j),i,j = 1, 2,\dots N\}.\) The output for \(f^*\) is generated using the ideal kernel, that is, \(f^*(\tilde{K}(x_i,x_j))= k(x_i, x_j) = y_i * y_j \) where \(x_i\) and \(x_j\) are input data points and \(y_i\) and \(y_j\) are corresponding labels.

The main contribution of this paper is the formulation of MKL as a regression problem for solving regression data sets. For that the methodology used by [10] is adopted. We proved that the ideal kernel for this formulation is same that of [10]. The main challenge in that approach is that, for training \( f^* \), \( N^2 \) training points has to be stored in memory. [10] used a fast optimization algorithm using all \( N^2 \) points for training \(f^*\). On the other hand we used data compression approach, namely, supervised pre-clustering approach for finding the vital points. Kernel Ridge regression was used for finding the models.

The rest of the paper can be summarized as follows. The details of the model we proposed is given in Sect. 2: we proved that ideal kernel concept used in classification MKL algorithms is valid for MKL Regression formulation also. Its description is given in Sect. 2.1; the concept of supervised pre-clustering is explained in Sect. 2.2, while the details of optimization we followed is discussed in Sect. 2.3. In Sect. 3 the experimental results and their analysis are given.

2 Regression Frame Work for MKL

We adopted the techniques used in [10] for developing the regression framework for MKL. This section explains the different components of the model we developed.

For developing \(f^*\) using regression, input and output data is needed. As the objective of MKL algorithms is to find the best possible kernel, it could be assumed that the output of \(f^*\) is the same as the output of the best available kernel (ideal kernel). We have proved that the ideal kernel for regression is \( k(x_i,x_j) = y_i . y_j \) using kernel ridge regression framework. The description is given below.

2.1 Ideal Kernel Over Regression Data

The cost function corresponding to kernel ridge regression can be stated as

$$\begin{aligned} {\begin{matrix} \min _{\alpha \in \mathbb {R}^n} \;\; \frac{1}{2}\Vert K \alpha - y\Vert ^2 + \frac{\lambda }{2} \alpha ^T K \alpha \end{matrix}} \end{aligned}$$

where K is the kernel matrix, y is the training output vector, \(\lambda > 0\) is the regularization parameter and \( \alpha \) is the solution vector. The representation for optimal \( \alpha \) is

$$\begin{aligned} \alpha =(K+\lambda I)^{-1} y \end{aligned}$$
(5)

Let v be the actual output value for a data point x then its predicted output label \( v_{pred} \) can be written as

$$\begin{aligned} \tilde{k}^T \; \alpha = v_{pred} \end{aligned}$$
(6)

where \( \tilde{k} = [k(x_1,x) \; k(x_2,x) \dots k(x_N,x)]^T\),

If the \(ij^{th}\) element of the kernel matrix is \(k(x_i, x_j) = y_i*y_j\) then (5) can be written as below

$$\begin{aligned} \alpha =(yy^T+\lambda I)^{-1} y \end{aligned}$$
(7)

where \(y = [y_1, y_2, \dots y_N]^T\)

Now \( \tilde{k} = yv \) and hence (6) becomes

$$\begin{aligned} v_{pred} = vy^T \; \alpha \end{aligned}$$

Using Eq. (7)

$$\begin{aligned} v_{pred} = vy^T \;(yy^T+\lambda I)^{-1} y \end{aligned}$$
(8)

Using Sherman-Morrison Theorem inverse associated with (8) can be found. If A is an invertible square matrix and uv are column vectors, then Sherman-Morrison formula states that

$$\begin{aligned} (A+uv^T)^{-1} = A^{-1}-\frac{A^{-1} uv^T A^{-1}}{1+v^T A^{-1} u} \end{aligned}$$
(9)

If we consider \( A = \lambda I \) and \( u = v = y \) then

$$\begin{aligned} {\begin{matrix} (\lambda I+yy^T)^{-1} = &{}(\lambda I)^{-1}-\frac{(\lambda I)^{-1} yy^T (\lambda I)^{-1}}{1+y^T (\lambda I)^{-1} y} = \frac{I}{\lambda }-\frac{ \frac{yy^T}{\lambda ^2} }{1+\frac{y^Ty}{\lambda }}\\ \end{matrix}} \end{aligned}$$
(10)

Now

$$\begin{aligned} {\begin{matrix} y^T(yy^T+\lambda I)^{-1} y = &{} y^T \left( \frac{I}{\lambda }-\frac{ \frac{yy^T}{\lambda ^2} }{1+\frac{y^Ty}{\lambda }} \right) y =\frac{y^Ty}{\lambda } - \frac{ \frac{y^Tyy^Ty}{\lambda ^2} }{1+\frac{y^Ty}{\lambda }} \\ = \frac{\frac{y^Ty}{\lambda }}{1+\frac{y^Ty}{\lambda }} \end{matrix}} \end{aligned}$$
(11)

Therefore

$$\begin{aligned} y^T \;(yy^T+\lambda I)^{-1} y \rightarrow 1, \text {when} \; \lambda \rightarrow 0 \end{aligned}$$
(12)

Substituting Eq. (12) in Eq. (8) we get

$$\begin{aligned} v_{pred} = vy^T \;(yy^T+\lambda I)^{-1} y \sim v \times 1 \sim v \end{aligned}$$
(13)

This means that \( k(x_i, x_j) = y_iy_j \) is an ideal kernel for regression problems.

2.2 Data Compression

As discussed earlier the data points corresponding to \(f^*\) scales as \(O(N^2)\). We used supervised pre-clustering approach for compressing the data in an efficient manner.

[13] developed a supervised pre-clustering approach for scaling kernel based regression by making use of the concepts of uniform continuity and compactness. In the pre-clustering approach developed by [13], the function f to be learned is uniformly continuous, by assuming that it lies in a continuous RKHS \(\mathcal {F}\), having the domain of its members a compact set \(\mathcal {X}\). i.e., for the function f, corresponding to similarity measure \(\epsilon \), there exists a radius, \(\delta \), independent of \(x \in \mathcal {X}\), such that

$$\begin{aligned} \hat{d}(f(x), f(x')) < \epsilon \; \forall \; x' \in B(x,\delta ) \end{aligned}$$
(14)

The basic idea of pre-clustering is that any data points which satisfy (14) can be considered to be “similar” and therefore form pre-clusters. The centers of the clusters are then used as a sparse data set for the function estimation.

If \(M<< N\) are the data points after compression then \(f^*\) can be found using the \(M^2<<N^2\) data points \(\left\{ \left( \tilde{K}(x_i,x_j), y_iy_j \right) , i,j = 1,2,\dots M \right\} .\)

2.3 Two Stage Approach

We used two stage optimization for finding f and \(f^*\), that is \(f^*\) is first solved and then f is found out using the new \(f^*\). Kernel ridge regression approach is used to find f and \(f^*\).

\(M^2\) data points find out using pre-clustering approach is used to train \(f^*\), that is the input data is \(\left\{ \left( \tilde{K}(x_i,x_j), y_iy_j \right) , i,j = 1,2,\dots M \right\} .\) The corresponding outputs are generated using the ideal kernel. As \(f^*\) is in the form of a hyperplane it is assumed that it lies in a RKHS whose reproducing kernel is the linear kernel.

Let \(\tilde{K} \) be the kernel matrix associated with f. Then its \(ij^{th}\) element \(\tilde{k}_{ij} = f^*(\hat{K}(x_i,x_j))\). The optimal \(\alpha \) associated with f is found out by minimizing

$$\begin{aligned} \begin{aligned} \frac{1}{2}\Vert \tilde{K} \alpha - y\Vert ^2 + \frac{\lambda }{2} \alpha ^T \alpha \end{aligned} \end{aligned}$$

On solving this equation, we get \( \alpha \) as

$$\begin{aligned} \alpha =(\tilde{K}+\lambda I)^{-1} y \end{aligned}$$
(15)
Fig. 1.
figure 1

Compression rate

Table 1. TSMKL results table

3 Experiments

The algorithm we developed is named as Two stage Multiple kernel learning approach for regression (TSMKLR). The experimental results are given below.

3.1 Setup

We implemented the proposed algorithms in matlab. The performance of TSMKLR was compared with that of SimpleMKL [15] and SPG-MKL [8] (a modified version of GMKL [18]). The codes for SimpleMKL [15] and SPG-MKL [8] are taken from the author web pages. All the experiments were conducted on the same machine throughout under similar conditions.

Using different hyper parameters in reproducing kernel functions such as Laplacian Kernel, Gaussian Kernel and Polnomial Kernel, 42 base kernels were generated. The \(\sigma \) of both Laplace and Gaussian kernel are assigned with values from \([2^{-9}, 2^{-8}, ..., 2^{9}]\). The polynomial kernel of degree 1,2,3 and 4 were used. The performance for the proposed model were assessed using root mean square (RMSE). Datasets are collected from UCI repository [1].

3.2 Results and Analysis

Using pre-clustering approach data was compressed. The ratio of compression for the datasets are shown in Fig. 1. The compressed data are used to compute the training points for \( f^* \). Using \( f^* \), f was computed. The experimental results are shown in Table 1. It shows that TSMKLR produced superior results in comparison with other models. The difference between the results of TSMKLR and that of other models were statistically significant.

The t-test was performed over the 30 times hold out results for verifying the statistical significance of the results (significance level \(\alpha =0.1\)). Based on the statistical significance measure, the models were ranked for their performance on each data. For example: let \(M_1\) and \(M_2\) are two models; let \(P_1\) and \(P_2\) are the values of a performance measure P for a given data set D. Then we say that \(M_1\) is better than \(M_2\) on the basis of P on D if \(P_1 > P_2\) and their difference is statistically significant.

4 Conclusion

We have extended the two stage MKL algorithm binary classification framework to regression domain. For that we proved that the ideal kernel for regression is \(k(x_i,x_j) = y_iy_j\). The supervised pre-clustering approach was used to select the vital points. The experiment results clearly proved that the proposed framework is a suitable approach in finding the optimal kernel as far regression data is concerned.