Learning deep kernels in the space of dot product polynomials
 1.2k Downloads
Abstract
Recent literature has shown the merits of having deep representations in the context of neural networks. An emerging challenge in kernel learning is the definition of similar deep representations. In this paper, we propose a general methodology to define a hierarchy of base kernels with increasing expressiveness and combine them via multiple kernel learning (MKL) with the aim to generate overall deeper kernels. As a leading example, this methodology is applied to learning the kernel in the space of DotProduct Polynomials (DPPs), that is a positive combination of homogeneous polynomial kernels (HPKs). We show theoretical properties about the expressiveness of HPKs that make their combination empirically very effective. This can also be seen as learning the coefficients of the Maclaurin expansion of any definite positive dot product kernel thus making our proposed method generally applicable. We empirically show the merits of our approach comparing the effectiveness of the kernel generated by our method against baseline kernels (including homogeneous and non homogeneous polynomials, RBF, etc...) and against another hierarchical approach on several benchmark datasets.
Keywords
Multiple kernel learning Kernel learning Deep kernel1 Introduction
Kernel methods have become a standard paradigm in machine learning and applied in a multitude of different learning tasks. Their fortune is mainly due their ability to perform well on different domains provided that adhoc kernels tailored to that domain can be designed. Given the crucial importance of the kernel adopted for the performance of a kernel machine, researchers are investigating on the automatic learning of kernels, also known as kernel learning.
MKL algorithms are supported by several theoretical results bounding the difference between the true error and the empirical margin error (i.e. estimation error). These bounds limit the Empirical Rademacher Complexity (ERC) of the combination of kernels (Cortes et al. 2010; Kloft and Blanchard 2011; Cortes et al. 2013). However, empirical studies on MKL are giving conflicting outcomes concerning the real effectiveness of MKL. For example, doing better than the simple average (or sum) of base kernels seems surprisingly challenging (Xu et al. 2013). This can be due to two main reasons: (i) standard MKL algorithms are typically applied with base kernels which are not so dissimilar to each other and (ii) the combined shallow kernels do not have structural differences, e.g. they have the same degree of abstraction, thus producing shallow representations.
Up to now, MKL research has been mainly focused on the learning of the combination weights. In this work, we take a different perspective of the MKL problem investigating on principled ways to design base kernels such to make their supervised combination really effective. Specifically, aiming at building deeper kernels, a hierarchy of features of different degrees of abstraction is considered. Features at the top of the hierarchy will be more general and less expressive features, while features at the bottom of the hierarchy will be more specific and expressive features. Features are then grouped based on a generaltospecific ordering (their level in the hierarchy) and base kernels built according to this grouping, in a way that the supervised MKL algorithm can detect the most effective level of abstraction for any given task. Similarly to the hierarchical kernel learning (HKL) approach in Bach (2009), features that can be embedded in a DAG will be considered.
As a further contribution of this paper, we give a characterization of the specificity of a representation (kernel function). Intuitively, more general representations correspond to kernels constructed on simpler features (e.g. the single variable \(x_i\)), at the top layers of the DAG, while, more specific representations correspond to kernels defined on elaborated features (e.g. high degree product of variables \(\prod _j x_j\)), at the bottom layers of the DAG. The characterization is based on the spectral ratio of the kernel matrices obtained in the target representation. We also prove relationships between the spectral ratio of a kernel matrix with its rank, with the radius of the Minimum Enclosing Ball (MEB) of examples in feature space, and with the ERC of linear functions using that representation.
Although the idea presented above is quite general and applicable in many different contexts which include ANOVA kernels, kernels for structures, and convolution kernels in general, here we exemplify the approach focusing on features which are a special kind of monomials (that is products of powers of variables with nonnegative integer exponents, possibly multiplied by a constant). See Fig. 2 for an example. In this case, base kernels will consist of homogeneous polynomial kernels (HPK) of different degree and their combination to a DotProduct Polynomial (DPP) with form \(K(\mathbf{x},\mathbf{z}) = \sum _{s=0}^{R}{a_s (\mathbf{x}\cdot \mathbf{z})^s}\). Exploiting the result in Schoenberg (1942) that any dotproduct kernel of the form \(K(\mathbf{x},\mathbf{z}) = f(\mathbf{x}\cdot \mathbf{z})\) can be seen as a DPP, \(K(\mathbf{x},\mathbf{z}) = \sum _{s=0}^{+\infty }{a_s (\mathbf{x}\cdot \mathbf{z})^s}\), it turns out that proposing a method to learn the coefficients of a general DPP means giving a method that virtually can learn any dotproduct kernel, including RBF and nonhomogeneous polynomials.
A related but different idea is exploited in deep neural networks. for example defining families of neural networks with polynomial activation functions, as it is done in Livni et al. (2014). In this approach the polynomial features are learned as nonlinear combinations of the original variables.
Similarly, another example is described in Livni et al. (2013), where the authors present an efficient deep learning algorithm (with polynomial computational time). The layers of this deep architecture are created onebyone and the final predictors of this algorithm are polynomial functions over the input space. Specifically, they create higherandhigher levels of representation of the data generating a good approximation of all the possible values obtained by using polynomial functions with bounded degree over the training set. The final linear combination (in the output layer) is a combination of sets of the polynomial functions that depend on the coefficient of the previous layers.
We can easily note that the feature space on which these methods work is completely different from ours. In our case, the hierarchy of features intrinsic in the polynomial kernels is used. This allows us to apply the results of this paper to other kernel functions besides the dotproduct kernels and polynomials, including most of the kernels for structures.

We propose a simple to compute qualitative measure of expressiveness of a kernel defined in terms of the trace (or nuclear) and Frobenius norms of the kernel matrix generated using that kernel and we show connections with the rank of the matrix, with the radius of the MEB, and with the ERC of linear functions defined in that feature space;

We propose a MKL based approach to learn the coefficients of general DPPs and we support the proposal by showing empirically that this approach outperforms the baselines, including RBF, homogeneous and nonhomogeneous polynomials (often significantly) against several benchmark datasets in terms of classification performance. Interestingly, the method is very robust to overfitting even when many base HPK are used, which permits to spare the tedious step of validation of the kernel hyperparameters;

Finally, we present empirical evidence that building base kernels exploiting the structure of the features and their dependencies, makes the combined kernel improve upon alternatives which do not exploit the same structure. In particular, a comparison with the HKL approach (Bach 2009; Jawanpuria et al. 2015) on the same DPP learning task shows the advantages of our method in terms of effectiveness and efficiency.
2 Notation, background and related work
In this section, we present the notation and briefly discuss some background and related work useful for the comprehension of the remainder of the paper.
Throughout the paper we consider a binary classification problem with training examples \(\{(\mathbf{x}_1,y_1), \dots , (\mathbf{x}_{L},y_L)\}\) , \(\mathbf{x}_i \in \mathbb {R}^m, \mathbf{x}_i_2=1, y_i \in \{1,+1\}\). \(\mathbf{X}\in \mathbb {R}^{L \times m}\) will denote the matrix where examples are arranged in rows and \(\mathbf{y}\in \mathbb {R}^{L}\) the corresponding vector of labels. The symbol \(\mathbf{I}_L\) will indicate the \(L \times L\) identity matrix and \(\mathbf{1}_L\) the Ldimensional vector with all entries equal to 1. A generic entry of a matrix \(\mathbf{M}\) will be indicated by \(\mathbf{M}_{i,j}\) and \(\mathbf{M}_{:,j}\) corresponds to the jth column vector of the matrix. When not differently indicated, the norm \(\cdot \) will refer to the 2norm of vectors, while \(\cdot _F\) and \(\cdot _T\) will refer to the Frobenius and trace matrix norms, respectively. B(0,1) will denote the unitary ball centered in the origin. Finally, \(\mathbb {R}_{+}\) will denote the set of nonnegative reals.
2.1 EasyMKL
EasyMKL (Aiolli and Donini 2015) is a recent MKL algorithm able to combine sets of base kernels by solving a simple quadratic problem. Besides its proved empirical effectiveness, a clear advantage of EasyMKL compared to other MKL methods is its high scalability with respect to the number of kernels to be combined. Specifically, its computational complexity is constant in memory and linear in time.
In the following, with no loss in generality, we consider coefficients of unitary 1norm (i.e. rescaled such that \(\varvec{\eta }^*_1 = 1\)) and normalized base kernels. Finally, the output kernel will be computed by \(k_{\mathrm {MKL}}(\mathbf{x},\mathbf{z}) = \sum _{s=0}^R \eta ^{*}_s k_s(\mathbf{x},\mathbf{z})\) which, in this case, it can be easily shown to be a normalized kernel as well.
2.2 Hierarchical kernel learning
Hierarchical kernel learning (HKL) (Bach 2009) is a generalization of the MKL framework. The idea is to design a hierarchy over the given base kernels/features. In particular, base kernels are embedded in a DAG each one defined on a single feature. An \(\ell _1/\ell _2\) blocknorm regularization is then added in a way to induce a group sparsity pattern. This implies that the prediction function will involve very few kernels. Also, the condition of the kernels being strictly positive, makes this hierarchical penalization inducing a strong sparsity pattern (Jawanpuria et al. 2015), that is if a kernel/feature \(k_s\) is not selected, then none of the kernels associated with its descendants in the DAG are selected. Also, the weight \(\eta _s\) assigned to a kernel associated to a specific DAG node is always greater than the weight of the kernels associated with its descendants, basically giving a bias toward more general features. Interestingly, even if the DAG is exponentially large, the proposed HKL optimization algorithm is able to work with polynomial complexity.
As noted in Jawanpuria et al. (2015), the sparsity pattern enforced by HKL can lead to the selection of many redundant features, namely the ones at the top of the DAG. For this, in the same work, a variant of the HKL, called generalized HKL (gHKL), is presented that partially overcomes this problem. The gHKL framework has a more flexible kernel selection pattern by using a \(\ell _1 / \ell _p\) regularizer, with \(p \in (1,2]\), and maintains the polynomial complexity of the original method.
As stated in the original paper, this generic regularizer enables the gHKL formulation to be employed in the rule ensemble learning (REL) where the goal is to construct an ensemble of conjunctive propositional rules. From this point of view, the task of gHKL is slightly different from ours (i.e. classification).
2.3 Dotproduct polynomial kernels
A generalized polynomial kernel can be built on the top of any other valid base kernel as in \(k(\mathbf{x},\mathbf{z}) = p(k_0(\mathbf{x},\mathbf{z}))\), where the base kernel \(k_0\) is a valid kernel and \(p: \mathbb {R}\longrightarrow \mathbb {R}\) is a polynomial function with nonnegative coefficients, that is \(p(x) = \sum _{s=0}^{d} a_s x^s, a_s \ge 0.\) In this paper, we focus on DotProduct Polynomials (using the acronym DPP) which is the class of generalized polynomial kernels where the simple dot product is used as base kernel, that is \(k(\mathbf{x},\mathbf{z}) = p(\mathbf{x}\cdot \mathbf{z})\).
A well known result from harmonic theory (Schoenberg 1942; Kar and Karnick 2012) gives us an interesting connection between DPP and general nonlinear dotproduct kernels.
Theorem 1
(Kar and Karnick 2012) A function \(f : \mathbb {R}\rightarrow \mathbb {R}\) defines a positive definite kernel \(k : \mathbf{B}(\mathbf{0},1) \times \mathbf{B}(\mathbf{0},1) \rightarrow \mathbb {R}\) as \(k : (\mathbf{x},\mathbf{z}) \rightarrow f(\mathbf{x}\cdot \mathbf{z})\) iff f is an analytic function admitting a Maclaurin expansion with nonnegative coefficients, \(f(x) = \sum _{s=0}^{\infty } a_s x^s, a_s \ge 0\).
Classical dotproduct kernels formulated as DPPs with coefficients \(a_s\) (the symbol !! denotes the semifactorial of a number)
Kernel  Definition  DPP coefficients \(a_s\) 

Polynomial (\(K_{D,c}\))  \((\mathbf{x}\cdot \mathbf{z}+ c)^D\)  \(\displaystyle {{D}\atopwithdelims (){s}} c^{Ds}, \forall s \in \{0,\dots ,D\}\) 
RBF (\(K_\mathrm{RBF}^{\gamma }\))  \( e^{\gamma \mathbf{x}\mathbf{z}^2}\)  \( e^{2\gamma } \frac{(2\gamma )^{2s}}{s!}, \forall s \) 
Rational quadratic  \(1  \frac{ \Vert \mathbf{x} \mathbf{z}\Vert _2^2 }{\Vert \mathbf{x} \mathbf{z}\Vert _2^2 + c}\)  \( \left(  \frac{ 2 \prod _{j=1}^s 2+(j1) }{(2+c)^{s+1}} + \frac{\prod _{j=1}^s 2+(j1)}{(2+c)^s} \right) \frac{1}{s!}, \forall s\) 
Cauchy  \(\left( 1 +\frac{\Vert \mathbf{x} \mathbf{z}\Vert _2^2}{\gamma }\right) ^{1}\)  \( \frac{s!!}{3^{s+1} \gamma ^s} \frac{1}{s!}, \forall s \) 
3 Learning the kernel from a hierarchy of features
In this section, the principal idea of the paper is described. Similarly to the HKL approach of Sect. 2.2, we also consider a hierarchical set of features which can be mapped into a DAG. However, differently from HKL, we propose to group the features layerwise, that is a different kernel \(k_s\) is built for each layer of the DAG. Kernels defined on bottom layers of the DAG will be more expressive leading to sparser kernel matrices, while kernels defined on top layers will be broader and their kernel matrices denser. In this way a hierarchy of representations of different levels of abstraction is created similarly to what happens in deep learning.
We will give a more formal definition of a measure of expressiveness for a kernel in Sect. 3.1. Generally speaking, we expect that the kernel expressiveness will increase going toward lower layers of the DAG. We also show the connection between our measure of expressiveness with the ERC and the rank of the kernel matrices induced by the kernel function.
Note that the procedure described above to construct base kernels is completely unsupervised and can be considered a sort of pretraining. The rationale of this construction is that too general or too specific features tend not to be useful in general. In particular, too general features are likely to be unable to discriminate since they tend to emphasize similarities between examples, while, using too specific features only, diversity is emphasized instead as examples are represented in such a way that distances are the same for every pair. It is important to note here that there is a difference between expressiveness of a kernel (which does not depend from the concept to learn) and informativeness of a kernel (which says how good the features of the kernel are on discriminating a given concept). Our intuition here is that different tasks defined on a same set of examples may need of feature spaces of different expressiveness. Given a binary task, these different representations are aggregated using a maximum margin based MKL algorithm, for example EasyMKL, as presented in Sect. 3.2.
3.1 Complexity and expressiveness of kernel functions
Now, we propose a methodology to compare different representations on the basis of the complexity of the hypotheses space induced by the associated kernel function. Representations inducing lower complexity hypotheses will correspond to more abstract or general representations.
Kernel learning methods typically impose some regularization over the combined kernels to limit their expressiveness with the hope to limit overfitting of the hypotheses constructed using that kernels. In the simplest case, the trace of the produced kernel can be used. However, the trace might not be the best choice of a measure of expressiveness for a kernel. For example, the identity matrix \(\mathbf{I}_L \in \mathbb {R}^{L \times L}\) and the constant matrix \(\mathbf{1}_L \mathbf{1}^{\top }_L \in \mathbb {R}^{L \times L}\) have the same trace but it is clear that the associated kernel functions have different expressiveness. In the first case, the examples are orthogonal in feature space and the expressiveness is maximal while, in the second case, they overlap and the expressiveness is minimal.
The expressiveness of a kernel function, that is the number of dichotomies that can be realized by a linear separator in that feature space, is more captured by the rank of the kernel matrices it produces. This can be motivated in several ways. A quite intuitive one can be given using the following theorem.
Theorem 2
Let \(\mathbf{K}\in \mathbb {R}^{L \times L}\) be a kernel matrix over a set of L examples. Let \(\text{ rank }(\mathbf{K})\) be the rank of \(\mathbf{K}\). Then, there exists at least one subset of examples of size \(\text{ rank }(\mathbf{K})\) that can be shattered by a linear function.
Proof
Let be given a diagonal matrix \(\mathbf{Y}\in \{1,+1\}^{L \times L}\) of binary labels for the examples (i.e. a diagonal matrix with labels on the diagonal), then we can see that the squared distance between the convex hulls of positive and the convex hull of negative examples can be written as \(\rho ^2 = \min _{\varvec{\gamma } \in {\varGamma }}\varvec{\gamma }^{\top } \mathbf{Y}\mathbf{K}\mathbf{Y}\varvec{\gamma }\) where \({\varGamma }= \{\varvec{\gamma } \in \mathbb {R}_+^{L}  \sum _{y_i = +1} \gamma _i = 1 , \sum _{y_i = 1} \gamma _i = 1\}\). If the kernel matrix has maximal rank L, then using the CourantFisher theorem (see ShaweTaylor and Cristianini 2004) we have that \(\frac{\varvec{\gamma }^{\top } \mathbf{Y}\mathbf{K}\mathbf{Y}\varvec{\gamma }}{\mathbf{Y}\varvec{\gamma }^2} \ge \lambda _L > 0\) where \(\lambda _L\) is the minimum eigenvalue, for any \(\varvec{\gamma } \in {\varGamma }\). Let \(L_+\) and \(L_\) be the number of positive and negative examples, then we have \(\varvec{\gamma }^{\top } \mathbf{Y}\mathbf{K}\mathbf{Y}\varvec{\gamma } \ge \lambda _L \mathbf{Y}\varvec{\gamma }^2 \ge (L_+^{1} + L_^{1}) \lambda _L > 0\) for any \(\varvec{\gamma } \in {\varGamma }\). This implies \(\rho ^2 > 0\) (the set can be linearly separated using that feature space) for any possible labeling of the examples, that is any choice of the matrix \(\mathbf{Y}\). Now, suppose \(\text{ rank }(\mathbf{K}) < L\), then it will exist a minor of \(\mathbf{K}\) of order \(\text{ rank }(\mathbf{K})\) with maximal rank and this corresponds to select a subset of k examples which can be linearly shattered. \(\square \)
In this section we propose a new, simple to compute, expressiveness measure for kernel matrices, namely the spectral ratio. Next, we will show that this measure is strongly related to the rank of the matrix, to the radius of the MEB, and to the ERC of the hypotheses space associated with that representation.
Now, we are ready to give a qualitative measure of expressiveness of kernel functions, in terms of specificity and generality as it follows:
Definition 1
Let be given \(k_i, k_j\), two kernel functions. We say that \(k_i\) is more general (or less expressive) than \(k_j\) (\(k_i \ge _G k_j\)) or equivalently that \(k_j\) is more specific (or more expressive) than \(k_i (k_j \le _G k_i\)) whenever for any possible dataset \(\mathbf{X}\), we have \(\mathcal{C}(\mathbf{K}_{\mathbf{X}}^{(i)}) \le \mathcal{C}(\mathbf{K}_{\mathbf{X}}^{(j)})\) with \(\mathbf{K}_{\mathbf{X}}^{(i)}\) the kernel matrix evaluated on data \(\mathbf{X}\) using the kernel function \(k_i\).
3.1.1 Connection between SR and the rank of a kernel matrix

the identity matrix \(\mathbf{I}_L\) having rank equal to L has the maximal spectral ratio with \(\mathcal{C}(\mathbf{I}_L) = \sqrt{L}\) and \(\bar{\mathcal{C}}(\mathbf{I}_L)=1\);

the kernel \(\mathbf{K}= \mathbf{1}_L \mathbf{1}_L^{\top }\) having rank equal to 1 has the minimal spectral ratio with \(\mathcal{C}(\mathbf{1}_L \mathbf{1}_L^{\top }) = 1\) and \(\bar{\mathcal{C}}(\mathbf{1}_L \mathbf{1}_L^{\top })=0\);

it is invariant to multiplication with a positive scalar as \(\mathcal{C}(\alpha \mathbf{K}) = \mathcal{C}(\mathbf{K}), \forall \alpha >0\).
3.1.2 Connection between SR and ERC
Theorem 3
Equation 6 gives a bound on the ERC dependent on the trace of the kernel.
Now, we can observe that, for a general kernel \(\mathbf{K}\), the value of \(\pmb {\alpha }^T \mathbf{K}\pmb {\alpha }\) can be bounded by the Frobenius norm of \(\mathbf{K}\), that is:
Proposition 1
Proof
Finally, we are ready to prove the following theorem that gives us a connection between the spectral ratio of a kernel matrix and the complexity of the induced hypotheses space:
Theorem 4
Proof
3.1.3 Connection between SR and the radius of the MEB
The subject of this section is to show that the SR, and hence the degree of sparsity of the kernel matrices, are related to the radius of the minimum enclosing ball (MEB). Given a dataset embedded in a feature space, the MEB is a the smallest hypersphere containing all the data. We can show that the radius increases with the SR of a kernel. In fact, see for example ShaweTaylor and Cristianini (2004), when considering a normalized kernel \(\mathbf{K}\in \mathbb {R}^{L \times L}\), the radius of the MEB of the examples in feature space can be computed by \(r^{*}(\mathbf{K}) = 1  \min _{\pmb {\alpha }\in A} \pmb {\alpha }^{\top } \mathbf{K}\pmb {\alpha }\), where \(A = \{\pmb {\alpha }\in \mathbb {R}_+^L, \sum _i \alpha _i = 1\}\). A nice approximation of the radius can be computed as \(\tilde{r}(\mathbf{K}) = 1  \bar{k}\) where \(\bar{k}\) is the average of the entries in the matrix \({\mathbf{K}}\). This much simpler formula is exact in the two extreme cases, as \(\tilde{r}(\mathbf{1}_L \mathbf{1}_L^{\top }) = r^{*}(\mathbf{1}_L \mathbf{1}_L^{\top }) = 0\) and \(\tilde{r}(\mathbf{I}_L) = r^{*}(\mathbf{I}_L) = 1  1/L\). In general, the approximation is a lower bound of the radius, that is \(\tilde{r}(\mathbf{K}) \le r^{*}(\mathbf{K})\), since \(\tilde{r}(\mathbf{K})\) can be obtained using a suboptimal \(\pmb {\alpha }= \frac{1}{L}\mathbf{1}_L\). In Fig. 1, the value of the MEB radius and the average approximation has been plotted for kernels of increasing expressiveness over two different datasets of example.
3.2 EasyMKL for learning over a hierarchy of feature spaces
In this section, the general approach proposed in this paper is summarized and its generalization ability is briefly discussed.
Finally, we briefly discuss about the generalization ability of the method. In particular, we can use the result presented in Sect. 3.1.3 to give a radiusmargin bound on the generalization error of the general method described above. Specifically, since the base kernels have increasing sparsity and expressiveness, then the radius of the enclosing ball is proved to increase. Then, let \(r_{\pmb {\eta }}\) be the MEB radius of the produced kernel \(k_\mathrm{MKL}\), then it can be proved that \(r_{\pmb {\eta }} \le \max _s(r_s) = r_R\) where \(r_s\) is the radius of the MEB of the examples when the sth feature space (\(k_s\)) is used (Do et al. 2009). Hence, looking for the EasyMKL solution \(\pmb {\eta }\) maximizing the margin \(\rho _{\pmb {\eta }}\) can be understood as trying to minimize the radiusmargin bound of the expected leaveoneout error, namely \(\frac{1}{L}r_R^2/\rho _{\pmb {\eta }}^2\).
4 Learning the kernel in the space of DPPs
4.1 Structure of homogeneous polynomial kernels
It is well known that the feature space of a ddegree HPK corresponds to all possible monomials of degree d, that is \(\phi _j(\mathbf{x}) = \prod _{i=1}^{d} x_{j_i}\) where \(j \in {\{1,\dots ,m\}^d}\) enumerates all possible dcombinations with repetitions from m variables, that is \(j_i \in \{1,\dots ,m\}\). Note that, there is a clear dependence between features of higher order HPKs and features of lower order HPKs.
For example, the value of the feature \(x_1 x_4 x_5 x_9\) in the 4degree HPK gives us some information about the values of the features \(x_1, x_4, x_5\) and \(x_9\) in the 1degree HPK and viceversa. An illustration of this kind of dependencies is depicted in Fig. 2. In general, we expect that the higher the order of the HPK, the sparser the kernel matrix produced. We will prove this is true at least when the HPKs are normalized. Specifically, the following theorem shows that the exponent d of a HPK of the form \(K(\mathbf{x},\mathbf{z}) = (\mathbf{x}\cdot \mathbf{z})^d\) induces an order of expressiveness in the kernel functions.
Proposition 2
For any choice \(D \in \mathbb {N}\), the family of kernels \(\mathcal{K}_D = \{ k_0, \dots , k_D \}\), with \(k_d(\mathbf{x},\mathbf{z}) = \left( \frac{\mathbf{x}\cdot \mathbf{z}}{\mathbf{x} \mathbf{z}}\right) ^d\) the ddegree normalized homogeneous polynomial kernel, has monotonically increasing expressiveness, that is \(k_i \ge _G k_j\) when \(i \le j\).
Proof
5 Experimental work
Datasets information: name, source, number of features and number of examples
Data set  Source  Features  Examples 

Haberman  \(\textit{UCI}\) (Lichman 2013)  3  306 
Liver  \(\textit{UCI}\)  6  345 
Diabetes  \(\textit{UCI}\)  8  768 
Abalone  \(\textit{UCI}\)  8  4177 
Australian  \(\textit{UCI}\)  14  690 
Pendigits  \(\textit{UCI}\)  16  4000 
Heart  \(\textit{UCI}\)  22  267 
German  Statlog  24  1000 
Ionosphere  \(\textit{UCI}\)  34  351 
Splice  \(\textit{UCI}\)  60  1000 
Sonar  \(\textit{UCI}\)  60  208 
MNIST  LeCun et al. (1998)  784  70000 
Colon  \(\textit{UCI}\)  2000  62 
Gisette  \({ NIPS}\)  5000  4000 
5.1 MKL for learning DPP on UCI datasets
In this section we describe the experiments we performed to test the accuracy in terms of AUC of the kernel generated by learning the coefficients of a dotproduct polynomial using \(\mathcal{K}_D = \{k_0,\dots ,k_D\}\) as base HPKs as defined in Sect. 4.1, and varying the value of D. This method is indicated with \(\mathbf{K}_{\mathrm {MKL}}\) in the following.

Each dataset is divided in 10 folds \(\mathbf{f}_1,\dots ,\mathbf{f}_{10}\) respecting the distribution of the labels, where \(\mathbf{f}_i\) contains the list of indexes of the examples in the ith fold;

One fold \(\mathbf{f}_j\) is selected as test set;

The remaining nine out of ten folds \( \mathbf{v}_j = \bigcup _{i = 1, i \ne j}^{10} \mathbf{f}_i\) are then used as validation set for the choice of the hyperparameters. In particular, another 10fold cross validation over \(\mathbf{v}_j\) is performed;

The set \(\mathbf{v}_j\) is selected as training set to generate a model (using the validated hyperparameters);

The test fold \(\mathbf{f}_j\) is used as test set to evaluate the performance of the model;

The reported results are the averages (with standard deviations) obtained repeating the steps above over all the 10 possible test sets \(\mathbf{f}_j\) (i.e. for each j in \(\{1,\dots ,10\}\)).

\(\mathbf{K}_D\): the weight \(\eta _D\) is set to 1 (and all the other weights are set to 0);

\(\mathbf{K}_{sum}\): the weight is set uniformly over the base kernels, that is \(\eta _s = \frac{1}{D+1}\) for \(s \in \{0,1,\dots ,D\}\) (as pointed before, this is generally a strong baseline);
 \(\mathbf{K}_{D,c}\): the weights are assigned using the polynomial kernel rule (see Table 1):In this case, the value c is selected optimistically as the one from the set \(\{ 0.5, 1, 2, 3 \}\) which obtained the best AUC on the test set.$$\begin{aligned} \eta _s \propto {{D}\atopwithdelims (){s}} c^{Ds}, \,\,\,\, s \in \{0,\dots ,D\}. \end{aligned}$$(9)
 \(\mathbf{K}_\mathrm{RBF}^{\gamma }\): the weights are assigned according to the truncated RBF rule (see Table 1):Again, the value for \(\gamma \) is selected optimistically as the one from the set \(\{ 2^i : i \in \{ 5, 4, \dots , 0, 1 \} \}\) which obtained the best AUC on the test set.$$\begin{aligned} \eta _s \propto \frac{(2\gamma )^{2s}}{s!}, \,\,\,\, s \in \{0,\dots ,D\}. \end{aligned}$$(10)
In all the cases above, we performed a stratified nested 10fold cross validation to select the optimal EasyMKL parameter \({\varLambda }\) from the set of values \(\{ \frac{v}{1v} : v \in \{0.0, 0.1, \dots , 0.9, 1.0\} \}\).
5.2 Is the deep structure important?
In this section, we show empirically that the structure present in HPKs is indeed useful in order to obtain good results using our MKL approach. With this aim, we built two alternative sets of base kernel matrices \(\mathcal{Q}_D = \{\mathbf{K}_0^{(Q)},\dots ,\mathbf{K}_D^{(Q)}\}\) and \(\mathcal{R}_D = \{\mathbf{K}_0^{(R)},\dots ,\mathbf{K}_D^{(R)}\}\). Specifically, we considered the same set of features (monomials with degrees less or equal to D) for both families of base kernels, but the features are assigned to base kernels in a different way. When generating \(\mathcal{Q}_D\), the features are assigned to the kernels according to the degree rule, that is features of degree d are assigned to the kernel \(\mathbf{K}_d^{(Q)}\). On the other hand, when generating \(\mathcal{R}_D\), the features are assigned randomly to one of its base kernels.
These results seem to confirm the importance of the deep structure imposed in \(\mathcal{Q}_D\) to obtain good results with EasyMKL. Interestingly, we noticed that the weights assigned by EasyMKL when the family \(\mathcal{R}_D\) was used were almost uniform thus generating a solution near to the one of the average kernel \(\mathbf{K}_{sum}\).
5.3 Analysis of the spectral ratio
5.4 Analysis of the weights assigned to the base kernels
Here, we present an analysis of the weights assigned by EasyMKL to the base kernels in the family \(\mathcal{K}_D\) for different values of D. Figure 8 reports the histograms for the weights for \(D \in \{3,5,10\}\) and on two datasets: Heart and Splice. Note that these results show how the optimal distribution of the weights, learned by EasyMKL, is not the trivial choice of a single kernel but instead it is a combination of different kernels with similar expressiveness. In Heart, the weights are anticorrelated with respect to the degree of the base kernels. However, this behavior is rarely observed, in fact, for the splice dataset, most of the total weight is shared among base kernels of degree in the range [1, 5].
5.5 Catching the task complexity
In this section, we present experiments we have performed to demonstrate that, when base kernels of increasing expressiveness are given, then the weights computed by EasyMKL change increasing the complexity of the task giving more and more weight to more specific kernels.
For this, we generated a toy problem similar to the Madelon dataset used by Guyon.^{2} To generate it, the same scikitlearn implementation for Python has been used. The task of the toy problem was a balanced binary classification task with 500 examples and 2 features. One of the features is informative, while the other is uncorrelated with the labels. The examples of different classes are initially arranged in two different clusters in the original space and then projected into the unit sphere (data was not linearly separable).
Starting from the original toy problem, noise is introduced in the task by swapping a fixed percentage of labels (randomly selected with replica). Then, models are trained by learning the coefficients of a DPP using \(\mathcal{K}_D\) as base HPKs (\(D = 10\)). The hyperparameter \({\varLambda }\) has been fixed to 0.01 in this case.
We then observed how the centerofmass of the list of assigned weights \(\pmb {\eta }= \{ \eta _0, \dots , \eta _D \}\) changed when increasing the complexity of the task. In particular, the centerofmass is computed by \( \mathcal {W}(\pmb {\eta }) = \frac{1}{D} \sum _{s=0}^D s \,\, \eta _s.\) This value is 0 whenever \(\eta _0 = 1\) and \(\eta _j = 0, \,\,\,\, \forall j>0\) and 1 whenever \(\eta _D = 1\) and \(\eta _j = 0, \,\,\,\, \forall j=1,\dots ,D1\). \(\mathcal {W}\) is higher if the weights are assigned to the most specific kernels.
5.6 Analysis of the computational complexity
In this section we present an analysis of the computational complexity of our method (with \(\mathcal{K}_{10}, \mathcal{K}_{20}\) and \(\mathcal{K}_{30}\) as families of base kernels), compared to the SVM with the Gaussian kernel.^{3} The theoretical analysis of the complexity of EasyMKL, presented in Aiolli and Donini (2015), shows that EasyMKL has a linear increase of the computational complexity with respect to the number of base kernels. In fact, the optimization problem presented in Eq. 1 has the same complexity of the standard SVM.
The difference in complexity between the two approaches is the evaluation of the base kernels and the computation of the weights, using the closed formula: \( \eta _s = \frac{\mathbf{d}(\pmb {\gamma }^*)}{\Vert \mathbf{d}(\pmb {\gamma }^*) \Vert } \,\,\,\, \forall s=1,\dots ,D \) (see Sect. 2.1).
Computational time required by our method with three different families of base kernels (\(\mathcal{K}_{10}, \mathcal{K}_{20}\) and \(\mathcal{K}_{30}\)) compared to the standard SVM with the Gaussian kernel using a validation set of parameters with cardinality V
Training time (average) in seconds  

Dataset  \(\mathbf{K}_\mathrm{RBF}^{\gamma }\)  Our method \(\mathcal{K}_{10}\)  Our method \(\mathcal{K}_{20}\)  Our method \(\mathcal{K}_{30}\) 
Heart  \(0.016 \times V\)  0.129  0.158  0.165 
Ionosphere  \(0.034 \times V\)  0.243  0.276  0.341 
Splice  \(0.139 \times V\)  2.092  2.400  2.882 
The results are summarized in Table 3. From these results we can notice that the complexity of our method is only one order of magnitude larger with respect to the simple SVM with a Gaussian kernel with fixed \(\gamma \) (i.e. with V=1). The difference is slightly higher when we use a larger amount of base kernels. As we noticed in the previous experimental results, 30 HPKs contain a sufficient level of complexity in order to learn effectively all the proposed tasks.
5.7 A comparison with the generalized hierarchical kernel learning
In this set of experiments, the performance of the proposed method and the gHKL method presented in Sect. 2.2 are compared on a subset of UCI datasets. Unfortunately, gHKL is quite computational demanding and could only cope with very small datasets with few features. In these experiments, we used the implementation of the \(\text{ gHKL }_\rho \) algorithm provided by the authors.^{4}
We performed a 10fold cross validation for the AUC evaluation, tuning the parameter C of the SVM for \(\text{ gHKL }_\rho \) (Jawanpuria et al. 2015) with a 3fold cross validation selecting C in \(\{10^i : i=3,\dots ,3 \}\). The same procedure has been repeated for \(\rho \in \{1.1, 1.5, 2.0\}\). The number of base kernels is fixed to \(2^m\), where m is the number of features, as in the original paper (Jawanpuria et al. 2015). It is important to point out that, with \(\rho = 2\) the HKL formulation of Bach (2009) is obtained.
For our algorithm, we fixed \(D=10\) (i.e. the family of base kernels is \(\mathcal {K}_{10}\)) and validated the parameter \({\varLambda }\) of EasyMKL by using the same methodology (3fold cross validation) with \({\varLambda }\in \{ \frac{v}{1v} : v \in \{0.0, 0.1, \dots , 0.9, 1.0\} \}\).
Nested 10fold AUC\(_{\pm std}\) using EasyMKL (\(\mathbf{K}_\mathrm{MKL}\)) with \(\mathcal {K}_{10}\) as base family compared to gHKL\(_\rho \) with \(\rho \in \{ 1.1, 1.5, 2.0 \}\)
Dataset  \(\mathbf{K}_\mathrm{MKL}\)  \(g\mathrm{HKL}_{1.1}\)  \(g\mathrm{HKL}_{1.5}\)  \(g\mathrm{HKL}_{2.0}\) 

Haberman  \(0.716_{\pm 0.014}\)  \(0.617_{\pm 0.166}\)  \(0.518_{\pm 0.110}\)  \(0.556_{\pm 0.070}\) 
Liver  \(0.689_{\pm 0.056}\)  \(0.565_{\pm 0.109}\)  \(0.583_{\pm 0.110}\)  \(0.623_{\pm 0.038}\) 
Diabetes  \(0.842_{\pm 0.027}\)  \(0.636_{\pm 0.118}\)  \(0.733_{\pm 0.058}\)  \(0.766_{\pm 0.046}\) 
Australian  \(0.924_{\pm 0.081}\)  \(0.923_{\pm 0.101}\)  \(0.918_{\pm 0.049}\)  \(0.920_{\pm 0.045}\) 
5.8 Experiments on the MNIST dataset
In this section we report on the performance of our method on the MNIST dataset (LeCun et al. 1998). The MNIST dataset of handwritten digits is a realworld benchmark dataset and it is widely used to evaluate the classification performance of pattern recognition algorithms. Digits are sizenormalized and centered in a fixedsize image.
In our experiments, we have generated the family of base kernels \(\mathcal {K}_{D}\) using different values of D, and considered two different tasks. Firstly, the evenodd task, where the goal was to correctly discriminate between even and odd digits. Specifically, even digits (0, 2, 4, 6, 8) have been selected as positives and odd digits (1, 3, 5, 7, 9) as negatives. The second task is the typical multiclass task of recognizing the label of a given handwritten digit.
These results confirm the effectiveness in accuracy of our method. However, the average kernel \(\mathbf{K}_{sum}\) represents a strong baseline in this case maintaining a good performance even when adding a large number of base kernels (i.e. all the HPKs \(\mathcal {K}_{D}\) with \(D>10\)).
For the experimental setting of the multiclass task an allpairs approach has been used to cope with multiclass classification. In particular, 45 binary tasks, one for each possible pair of classes has been created. When a test example needs to be classified, each classifier is considered as a voter, and it votes for the class it predicts. Finally, the class with the highest number of votes is the predicted class of the algorithm.

Generation of the family of base HPK \(\mathcal {K}_{D}\), with \(D=8\);

Run of one EasyMKL for each binary task to learn a different kernel for each task ( \({\varLambda }= \frac{0.01}{10.01}\));

Training of the 45 binary SVM models using the kernels computed above (fixing \(C=4.0\)).
Classification error % of our method (Our) with and without data deskewing, with respect to the stateoftheart results using the SVM with different kernels: RBF and polynomial with optimal degree (i.e. 4)
Also in this case, our methodology is able to create a model that outperforms the SVM with the optimal RBF kernel in terms of classification performance. Moreover, using deskewing, our method improves further its performance with an error of \(0.8\,\%\) (i.e. 80 erroneous digit classifications over 10, 000).
6 Conclusion and future work
Starting from a new perspective of the MKL problem, we have investigated on principled ways to design base kernels such to make their supervised combination really effective. Specifically, a hierarchy of features of different level of abstraction is considered. As a leading example of this methodology, a MKL approach is proposed to learn the kernel in the space of DotProduct Polynomials (DPP), that is a positive combination of homogeneous polynomial kernels (HPKs). We have given a deep theoretical analysis and empirically shown the merits of our approach comparing the effectiveness of the generated kernel against baseline kernels (including homogeneous and non homogeneous polynomials, RBF, etc...) and against the hierarchical kernel learning (HKL) approach on many benchmark UCI/Statlog datasets and the large MNIST dataset. A deep experimental analysis has been also presented to get more insight of the method.
In the future, we want to investigate on extensions of the same methodology to general convolution kernels where the same type of hierarchy among features exist.
Footnotes
 1.
Note that, besides the running example of monomials used in this paper, other possibilities are available, including ANOVA features, subtrees of different length for trees, substrings of different length for strings, etc.
 2.
 3.
For these experiments, the scikitlearn implementation of SVM at http://scikitlearn.org/stable/modules/svm.html has been used
 4.
References
 Aiolli, F., & Donini, M. (2015). Easymkl: A scalable multiple kernel learning algorithm. Neurocomputing, 169, 215–224. doi: 10.1016/j.neucom.2014.11.078.CrossRefGoogle Scholar
 Bach, F. R. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. Advances in Neural Information Processing Systems, 21, 105–112.Google Scholar
 Bucak, S. S., Jin, R., & Jain, A. K. (2014). Multiple kernel learning for visual object recognition: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7), 1354–1369.CrossRefGoogle Scholar
 Castro, E., GómezVerdejo, V., MartínezRamón, M., Kiehl, K. A., & Calhoun, V. D. (2014). A multiple kernel learning approach to perform classification of groups from complexvalued fmri data analysis: Application to schizophrenia. NeuroImage, 87, 1–17.CrossRefGoogle Scholar
 Cortes, C., Kloft, M., & Mohri, M. (2013). Advances in neural information processing systems. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Weinberger (Eds.), Learning kernels using local rademacher complexity (pp. 2760–2768). Red Hook: Curran Associates, Inc.Google Scholar
 Cortes, C., Mohri, M., & Rostamizadeh, A. (2010). Generalization bounds for learning kernels. Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, (pp. 247–254) Haifa, Israel.Google Scholar
 Damoulas, T., & Girolami, M. A. (2008). Probabilistic multiclass multikernel learning: On protein fold recognition and remote homology detection. Bioinformatics, 24(10), 1264–1270.CrossRefGoogle Scholar
 Do, H., Kalousis, A., Woznica, A., & Hilario, M. (2009). Margin and radius based multiple kernel learning. Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, Bled, Slovenia, September 711, 2009, Proceedings, Part I (pp. 330–343). doi: 10.1007/9783642041808_39.
 Gönen, M., & Alpaydin, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12, 2211–2268.MathSciNetMATHGoogle Scholar
 Jawanpuria, P., Nath, J. S., & Ramakrishnan, G. (2015). Generalized hierarchical kernel learning. Journal of Machine Learning Research, 16, 617–652.MathSciNetMATHGoogle Scholar
 Kar, P., & Karnick, H. (2012). Random feature maps for dot product kernels. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, La Palma, Canary Islands (pp. 583–591) April 2123, 2012.Google Scholar
 Kloft, M., & Blanchard, G. (2011). The local rademacher complexity of lpnorm multiple kernel learning. Advances in Neural Information Processing Systems, 24, 2438–2446.Google Scholar
 LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2323. doi: 10.1109/5.726791.CrossRefGoogle Scholar
 Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml.
 Livni, R., ShalevShwartz, S., & Shamir, O. (2013). An algorithm for training polynomial networks. arXiv preprint arXiv:1304.7045.
 Livni, R., ShalevShwartz, S., & Shamir, O. (2014). On the computational efficiency of training neural networks. Advances in Neural Information Processing Systems, 27, 855–863.Google Scholar
 RomeraParedes, B., Aung, H., BianchiBerthouze, N., & Pontil, M. (2013). Multilinear multitask learning. Proceedings of the 30th International Conference on Machine Learning (pp. 1444–1452).Google Scholar
 Schoenberg, I. J. (1942). Positive definite functions on spheres. Duke Mathematical Journal, 9(1), 96–108.MathSciNetCrossRefMATHGoogle Scholar
 ShaweTaylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
 Srebro, N. (2004). Learning with matrix factorizations. Ph.D. thesis, Massachusetts Institute of Technology.Google Scholar
 Watrous, J. (2011). Theory of quantum information, lecture notes. https://cs.uwaterloo.ca/~watrous/LectureNotes.html.
 Xu, X., Tsang, I. W., & Xu, D. (2013). Soft margin multiple kernel learning. IEEE Transactions on Neural Networks and Learning Systems, 24(5), 749–761. doi: 10.1109/TNNLS.2012.2237183.CrossRefGoogle Scholar
 Yang, J., Li, Y., Tian, Y., Duan, L., & Gao, W. (2009). Groupsensitive multiple kernel learning for object categorization. Computer Vision, 2009 IEEE 12th International Conference on (pp. 436–443). IEEE.Google Scholar
 Yu, S., Falck, T., Daemen, A., Tranchevent, L. C., Suykens, J. A., De Moor, B., et al. (2010). L2norm multiple kernel learning and its application to biomedical data fusion. BMC bioinformatics, 11(1), 309.CrossRefGoogle Scholar
 Zien, A., & Ong, C.S. (2007). Multiclass multiple kernel learning. Proceedings of the 24th international conference on Machine learning (pp. 1191–1198). ACM.Google Scholar