Neural Generalization of Multiple Kernel Learning

Multiple Kernel Learning is a conventional way to learn the kernel function in kernel-based methods. MKL algorithms enhance the performance of kernel methods. However, these methods have a lower complexity compared to deep learning models and are inferior to these models in terms of recognition accuracy. Deep learning models can learn complex functions by applying nonlinear transformations to data through several layers. In this paper, we show that a typical MKL algorithm can be interpreted as a one-layer neural network with linear activation functions. By this interpretation, we propose a Neural Generalization of Multiple Kernel Learning (NGMKL), which extends the conventional multiple kernel learning framework to a multi-layer neural network with nonlinear activation functions. Our experiments on several benchmarks show that the proposed method improves the complexity of MKL algorithms and leads to higher recognition accuracy.


Introduction
Kernel methods are a class of machine learning algorithms, which have been widely used in different types of problems (Grauman and Darrell, 2005;Vedaldi et al, 2009;Miwa et al, 2009;Longworth and Gales, 2009).The performance of kernel methods is highly dependent on the type of kernel and the parameters of that kernel.This raises the challenge of learning the kernel function.Multiple Kernel Learning (MKL), which learns the combination of a finite set of kernels, is a principled way to address this challenge (Lanckriet et al, 2004;Bach et al, 2004;Sonnenburg et al, 2006b;Bucak et al, 2013;Gönen and Alpaydın, 2011).In MKL algorithms, a conventional way for kernel learning is to combine the kernel functions linearly.More sophisticated ways for learning the kernel functions are to use nonlinear combinations (Varma and Babu, 2009;Cortes et al, 2009;Xia and Hoi, 2012;Zhuang et al, 2011).However, it is not clear that these methods can obtain better performance compared to a simple linear combination of kernels, and there is still room for improvement (Bucak et al, 2013).
Similar to other kernel-based methods, two well-known properties of typical MKL algorithms are (i) these methods usually converge to a global solution since they can be formulated as a convex optimization problem, and (ii) kernel functions can be used in large-margin classifiers, for example, Support Vector Machines (SVMs), which reduces overfitting and may lead to better generalization.Although these features have been very appealing in the traditional view of machine learning, their importance has been challenged by the remarkable successes of deep learning methods.
Deep learning models pass data through several layers, which produces a rich representation of the data and provides a way to learn complex functions.There are remarkable differences between the framework of deep models and kernel methods.In contrast to kernel methods, deep learning models are not usually trained by a convex optimization problem.While it might be considered a drawback, as argued by Bengio et al (2007), sacrificing convexity may be inevitable for learning complex functions.Furthermore, contrary to kernel methods that use hinge loss, which is margin-based, deep learning models use softmax with cross-entropy in the last layer for classification.However, Rosasco et al (2004) have shown that softmax with cross-entropy loss and hinge loss have very similar characteristics, and both entail a loss when the margin is not observed.Therefore, it can be said that deep learning methods that use softmax with cross-entropy are inherently margin-based.Finally, using stochastic gradient descent (SGD) with mini-batches for optimizing deep learning models provides a way for highly parallel computations via GPUs, which has led to state-of-the-art results in large-scale problems.
In recent years, finding the connection between kernel and deep learning methods has become an active research topic.These works have been done mainly with the purpose of boosting kernel methods (Cho and Saul, 2010;Mairal et al, 2014;Song et al, 2018) or achieving a better understanding of deep learning models (Daniely et al, 2016;Jacot et al, 2018;Belkin et al, 2018).In this paper, we investigate the connection between deep learning and a set of well-known algorithms in kernel methods, namely MKL, for improving kernel-based algorithms.We show that the conventional MKL algorithms can be interpreted as a one-layer neural network without nonlinear activation functions in which kernel values represent the input.From this point of view, we propose a Neural Generalization of Multiple Kernel Learning (NGMKL), which extends the shallow linear neural interpretation of MKL to a multilayer neural network with nonlinear activation functions.This practice leads to an MKL model with higher complexity, which improves the capacity of MKL models to learn more complex functions.
MKL models with deep structures have been investigated previously through Multilayer Multiple Kernel Learning (MLMKL) methods (Jiu and Sahbi, 2017, 2016, 2019).However, the framework of these methods still sticks to kernelized SVMs, which makes them significantly different from neural networks.More precisely, similar to ordinary MKL methods, MLMKL models use SVMs as classifiers, while neural networks utilize softmax for classification.Secondly, these models have some constraints on the parameters, such as the positivity of the combination weights to keep the deep kernel function positive definite.However, we note that the rationality behind imposing positive definiteness over kernel functions is to obtain a convex objective function, which MLMKL will not attain because of its deep structure.As a result, it may not be beneficial to constrain the kernel function to be a Mercer kernel in a deep structure.Finally, in contrast to typical deep learning models, in MLMKL models, the weights are not usually trained simultaneously, and the training algorithm alternates between learning the kernel combination weights and the classifier's parameters.
In contrast to MKL and MLMKL algorithms, our proposed method utilizes a softmax classifier in the last layer.There are no constraints, such as the positivity of the combination weights, on the parameters of NGMKL, and all the parameters are trained simultaneously.A consequence of the neural interpretation of MKLs is that the framework of deep learning algorithms can be leveraged, which, for instance, provides fast parallel computations via GPUs.Moreover, in the context of SGD, mini-batching allows high throughput since it can process a large number of input examples in many cores at once in GPU.We show that NGMKL outperforms conventional MKL algorithms on commonplace datasets in kernel methods.
The rest of the paper is organized as follows.Section 2 reviews the related work on connecting kernel methods and deep learning.We also review the MLMKL models.In Section 3, we describe the conventional formulation of MKL algorithms.In Section 4, we show how MKL can be seen in the neural network framework and propose the NGMKL model.The comparison of NGMKL and ordinary MKL algorithms by experiments on several datasets are presented in Section 5. Finally, we conclude the paper in Section 6.

Related Work
Several works in the literature investigate the connections between kernel methods and neural networks.In this section, we review three types of these works: (i) Kernel functions that provably imitate the architecture of multilayer neural networks, (ii) Multilayer neural networks constructed by kernel approximation techniques, and (iii) MKL methods with multilayer structures.
Some methods link kernel methods with neural networks by proposing kernel functions that mimic the computation of neural networks.For example, similar to neural networks, it has been demonstrated that the feature map of arc-cosine kernels consists of a linear transformation using an infinite number of weights followed by a nonlinearity Cho and Saul (2010).However, the interpreted weights are fixed and have a Gaussian distribution with zero mean and unit variance.To tackle this drawback, Pandey and Dukkipati (2014a) proposed to learn the covariance of the interpreted weights by stretching a finite number of trained weights to an infinite number.Pandey and Dukkipati (2014b) proposed learning the covariance by an iterative approach similar to restricted Boltzmann machines.In arc-cosine kernels, one can change the nonlinearity of the interpreted neural network and extend the structure to multiple layers by using two distinct hyperparameters.The connections between neural networks with an infinite number of weights and kernel methods are also discussed in the context of Gaussian processes, which has attracted attention for a better understanding of deep learning in recent years (Hazan and Jaakkola, 2015;Wilson et al, 2016;Lee et al, 2017;Jacot et al, 2018).
While kernel functions are implicitly seen as an infinite-width neural network in the aforementioned methods, some methods use kernel approximation techniques (Williams and Seeger, 2001;Rahimi and Recht, 2008) to explicitly construct one layer of a neural network by a kernel function.For example, convolutional kernel networks that link convolutional neural networks with kernel methods are constructed by consecutively approximating a convolution kernel (Mairal et al, 2014;Mohammadnia-Qaraei et al, 2018;Mairal, 2016).In another method called M-DKMO, a neural network is used upon the feature maps of some kernel functions approximated by the Nyström method for kernel learning in an end-to-end fashion (Song et al, 2018).Although this method uses neural networks for learning kernels, it differs from the proposed NGMKL method.NGMKL reformulates the MKL problem through neural networks, while M-DKMO uses neural networks for learning approximated kernel feature maps.
More similar to our work, some methods proposed MKL problems in which the kernels are combined through a neural-network-like architecture.These methods, which are known as Multilayer Multiple Kernel Learning (MLMKL) models, were first introduced by Zhuang et al (2011).In the method of Zhuang et al (2011), the kernel function is defined as a convex combination of a set of base kernels followed by an exponential nonlinearity, which can be seen as a two-layer MKL problem.Rebai et al (2016) extended two-layer MLMKLs to several layers and optimized the combination parameters by backpropagation.Jiu and Sahbi (2017) utilized unsupervised, semi-supervised, and Laplacian SVM as the objective functions for MLMKL.For reducing computations and improving the scalability of MLMKL, Jiu and Sahbi (2019) proposed kernel approximation in a greedy layer-wise fashion.To make MLMKL more similar to neural networks, Sahbi (2019) formulated the base kernels as a neural operation and optimized the support vectors of MLMKLs along with the other parameters.
Although MLMKL models combine multiple kernels in a multilayer structure, NGMKL differs since the multilayer neural architecture in MLMKL is only for combining the base kernels.In other words, the kernel combination in MLMKL has a multilayer structure, while in our proposed method, the whole MKL problem, namely kernel combination and classification, is interpreted as a single neural network that can be extended to a deep structure.As a result, firstly, MLMKLs use SVMs for classification, while NGMKLs utilize softmax with cross-entropy, similar to ordinary neural networks.Secondly, MLMKL has some constraints on the parameters, whereas NGMKL is formulated as an unconstrained optimization problem.Finally, in most of the MLMKL models, the parameters for combining the kernels and the parameters of the classifier are learned alternatively.In contrast, in NGMKL, all the parameters are learned simultaneously.

Multiple Kernel Learning
In a typical kernel-based method, data is represented by a positive definite function called a kernel, which computes the similarity between data.However, a single kernel might not be rich enough to capture the complex underlying patterns of real-world data.In MKL algorithms, data is represented by several kernels, and the aim is to learn the optimal combination of these representations.Using multiple kernels instead of a single kernel increases the learning capacity of kernel-based models.In this section, we describe the conventional framework of MKL models.
Let D = {(x 1 , y 1 ), ..., (x n , y n )} consist of n pairs of feature vectors x i ∈ R d and their corresponding labels y i ∈ {−1, +1}.Using a kernel k(., .)→ R, each input x is represented by the vector k x = [k(x, x 1 ), ..., k(x, x n )] T .Then a linear function f : R n → R parameterized by γ ∈ R n could be used on top of k x to predict the label of x: By using s different base kernels k 1 , ..., k s , the data can be represented by vectors k x 1 , ..., k x s corresponding to these kernels.If these vectors are passed through the function f , the final classifier f c : R n → {−1, 1} is formed by combining the output of these functions: where β j can be interpreted as the weight of the kernel k j .In support vector machines, which seek to find a hyperplane with the maximum margin, the MKL problem can be formulated as the following optimization problem (Sonnenburg et al, 2006b): where 1 is a vector of ones, and C is the hyperparameter of the SVM.The decision function obtained by solving Eq. ( 3) is similar to Eq. ( 2) in which γ = α y.

Proposed Model
In this section, we propose a formulation of the MKL problem as a neural network.The core element of a neural network is a basic module in which a linear operation followed by a nonlinear one are applied to the input: where u ∈ R n is a feature vector, Γ ∈ R n×n is a matrix of parameters, φ(.) is a nonlinear activation function, and û ∈ R n is the output of the module.
In MKL, multiplication of each kernelized representation of data, k x , with the parameter vector γ ∈ R n (Eq.( 1)) can be seen as a basic module in a neural network but without nonlinearity.By applying a nonlinear function φ(.), Eq. ( 1) becomes equivalent to this module in a neural network: Having multiple kernel representations k x 1 , ..., k x s for data x obtained by different kernels, by Eq. ( 5), we have parameters γ 1 , ..., γ s and representations z 1 1 , ..., z 1 s corresponding to these kernels.We concatenate z 1 1 , ..., z 1 s to form z 1 as follows: Then we apply the basic module of neural networks to z 1 repeatedly as follows: and the label of the data is defined as k = argmax i ŷi .
For training NGMKL, we use the cross-entropy loss function.Assume D = {(x 1 , y 1 ), ..., (x n , y n )} consists of n pairs of data x i ∈ R d and the corresponding labels y i ∈ {0, 1} c , where y ik = 1 if k is the relevant class for the sample x i , and y ik = 0 otherwise.The cross-entropy loss function for a sample (x, y), in which y k = 1 and y i =k = 0, is defined as follows: We train all the parameters of NGMKL, including W 1 , ..., W l and γ 1 , ..., γ s , simultaneously, using the backpropagation algorithm.The derivatives of the cross-entropy function with respect to the parameters of a two-layer NGMKL are as follows: where w 1 i * is the i-th row of W 1 and ŷ = [ ŷ1 , ..., ŷc ] T .As seen in Eq. ( 3), the SVM framework of MKL has several constraints on the parameters.These constraints are for preserving the convexity of the problem and ensuring a large margin.However, in our proposed model, there is no constraint on the parameters, and the goal of large-margin classification is obtained by a combination of softmax with cross-entropy loss and weight decay (which penalizes large weights).In Section 5, we show that although NGMKL is formulated as a non-convex optimization problem, it can outperform the typical MKL models.

Experimental Results
In this section, we evaluate NGMKL in several classification problems and compare it with conventional MKL models.We use a two-layer NGMKL and three variants of this model, namely "NGMKL1", "NGMKL2", and "NGMKL3", which are different from each other in the choice of the input kernel.In the following subsections, we describe the datasets we used for the evaluation of the models, the baseline models with which we compare NGMKL, the architecture and the optimization settings of NGMKL, and the three variants of NGMKL which are different in the choice of the input kernel.

Datasets
Since the conventional MKL models are not scalable, we use UCI datasets to evaluate the models.These datasets are described in Table 1.Following Rätsch et al (2001), each dataset is randomly split into the training and testing data, and this practice is repeated 100 times, except for the image dataset, which has 20 splits.

NGMKL setup
We use a two-layer NGMKL in all the experiments.The nonlinearity of the hidden layer is the scaled hyperbolic tangent f (x) = 1.7159 tanh( 2 3 x).We use the Stochastic Gradient Descent (SGD) algorithm to train our model in which the batch size is 40, the learning rate is 0.01, and the weight decay is 5 × 0 −6 .

Three variants of NGMKL
We use three variants of NGMKL named "NGMKL1", "NGMKL2", and "NGMKL3" regarding the methods used for selecting the input kernels from the base kernels.There are 17 base kernels in all the models, including NGMKLs and the baselines: three polynomial kernels (k(x, y) = (x T y) d ) of degrees 1, 2, and 3 as well as 14 Gaussian kernels (k(x, y) = e −1 2σ 2 x−y 2 2 ) with kernel widths σ = 2 −6 , ..., 2 7 .The three variants of NGMKL are as follows: -NGMKL1: All 17 base kernels are used as the input kernels in this model.-NGMKL2: In this model, 1 -MKL (Sonnenburg et al, 2006a) is used for selecting the input kernels from the base kernels. 1 -MKL produces sparse weights for combining the base kernels by penalizing the 1 norm of the weights.-NGMKL3: MKBoost-D1 (Xia and Hoi, 2012) is used for selecting the input kernels among the 17 base kernels.MKBoost-D1, in each trial of the boosting method, selects the kernel with the minimum error and adds it to the set of selected kernels.The selection is with replacement.

Comparison results
A comparison of three variants of NGMKLs is shown in Table 2.This table compares three variants of NGMKLs, showing that all NGMKL methods generally achieve similar results.However, NGMKL3 achieves better results compared to the other variants of NGMKLs in 9 datasets.The second successful variant is NGMKL2, achieving better results in 6 datasets, and NGMKL1 has a better error rate in the remaining datasets.Table 3 compares popular MKL algorithms with NGMKL3.Comparing NGMKL3 and the best competing method for each row, we have highlighted statically significant results, using an unpaired t test, in boldface.Results for "monks" and "sonar" datasets are not statistically significant.Overall for 17 datasets, the differences between NGMKL3 and the best other method are statistically significant.According to Table 3, all regular MKL algorithms have the same error rate because they have the same formulation but adopt different strategies to speed up the optimization process.However, NGMKL3 is superior to the traditional MKL methods in most datasets (12 out of 17).
In the next step, we evaluate NGMKL3 with different numbers of layers.For simplicity, we only compare with three different layers (L=2,3,4).When we have a higher number of layers, most of the datasets remain unchanged in terms of the error rate (16 out of 19), and only three of them (namely ala, a3a, and a4a) have a decrease in the classification error rate (Figure 2).As seen from Table 1, these three datasets have more dimensions than others.Since our proposed method (NGMKL3) can learn complex functions by extending the layers with nonlinear activation functions, the classification error on these three datasets has decreased.However, extending the layers cannot improve accuracy in most datasets, especially when using UCI datasets, since most of them are not complex enough.In addition, the three variants of our proposed methods are not sensitive to the number of layers; therefore, we evaluated our proposed methods in comparison with conventional methods based on two layers.
In Figure .3, we compare the performance of NGMKL3 with two Artificial Neural Networks (ANN) with two layers.The architecture of ANN, including the layers, neurons within each layer, and the activation functions, is the same as that of NGMKL3.In order to configure and train the ANN, two approaches can be considered.One approach is to use initial weights from MKL and freeze them during training (ANN1), and the second is to use random initial weights and train the whole neural network (ANN2).The reason behind the first approach is to demonstrate the importance of the kernel layer and provide

Conclusion
In this paper, we showed that the framework of the MKL could be interpreted as a neural network with one hidden layer in which the input data is represented by a set of kernels.We proposed NGMKL, a neural generalization of the MKL problem, using this interpretation.Like conventional neural networks, and unlike typical MKL algorithms, NGMKL employs the softmax layer for classification.Furthermore, there are no constraints on the parameters, and all the parameters are trained simultaneously.Moreover, It leverages nonlinear transformations, and the architecture can be extended to multiple layers.We evaluated NGMKL on the UCI datasets and compared it with typical MKL algorithms.The experimental results show that the proposed model can boost the performance of MKL models.

Figure. 1 :
Figure.1: Architecture of the proposed network with L layers.

Figure. 2 :
Figure.2: Evaluating the performance of NGMKL3 with different numbers of layers.

Figure. 3 :
Figure.3: Evaluating the performance of NGMKL3 with two variants of Artificial Neural Network (ANN) with two layers.

Figure. 4 :
Figure.4: The number of iterations to train NGMKL3 and ANN2 on b.cancer (left subfigure) and waveform (right subfigure) datasets.

Table 1 :
Summary of the datasets used in the experiments from the UCI repository.

Table 2 :
The results of our three proposed methods [mean classification error ± std] in percent on UCI datasets.The best performance in each dataset is highlighted in bold.

Table 3 :
The average testing errors [mean classification error ± std] in percent of five popular MKL Algorithms and one of the proposed methods (NGMKL3).The relative ranking of each algorithm on each data set is shown in the parenthesis.Rank 1 refers to the algorithm which achieves the highest accuracy on the given data set.