Abstract
Kernel learning is the problem of determining the best kernel (either from a dictionary of fixed kernels, or from a smooth space of kernel representations) for a given task. In this paper, we describe a new approach to kernel learning that establishes connections between the Fourieranalytic representation of kernels arising out of Bochner’s theorem and a specific kind of feedforward network using cosine activations. We analyze the complexity of this space of hypotheses and demonstrate empirically that our approach provides scalable kernel learning superior in quality to prior approaches.
Keywords
 Convolutional Neural Network
 Multiple Kernel Learn
 Stochastic Gradient Descent
 Kernel Learning
 Convolutional Layer
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This research was partially supported by the NSF under grants CCF0953066, IIS1251049 and CNS1302688.
Download conference paper PDF
1 Introduction
Kernel methods have been a powerful tool in machine learning for decades and kernel learning is the problem of learning the “right” or “best” kernel for a given task. Broadly speaking, we can divide kernel learning methods into two categories. Multiple kernel learning (MKL) methods largely assume that the desired kernel can be represented as a combination of a dictionary of fixed kernels, and seeks to learn their mixing weights. The other approach is based on a Fourieranalytic representation of shiftinvariant kernels via Bochner’s theorem: roughly speaking, a kernel can be represented (in dual form) as a probability distribution, and so the search for a kernel becomes a search over distributions.
In both approaches, training the model is challenging with many thousands of training points and hundreds of dimensions. Standard training approaches either employ some form of convex or alternating optimization (for MKL) or parameterize the space of distributions in terms of known distributions and try to optimize their parameters.
In this paper, we describe continuous kernel learning (CKL), a new way of tackling this problem by establishing and exploiting a connection to feedforward networks. Working within the Fourieranalytic framework for kernel learning, we propose to search directly over the space of shiftinvariant kernels instead of optimizing the parameters of a known family of distributions. In doing so, though we lose the ability to isolate parameters of a single learned kernel, we gain representability in terms of a nonlinear basis of cosines that can be naturally interpreted as activations for a feedforward network. This interpretation allows us to deploy the power of backpropagation on this network to learn the desired kernel representation. In addition, the generalization power of the cosine representation can be established formally using machinery from learning theory: this also helps guide the regularization that we use to learn the resulting kernel. We support these arguments with a suite of experiments on relatively large data sets (tens of thousands of points, hundreds of dimensions) that demonstrate that our learned kernels are more accurate than the stateoftheart MKL methods.
In summary, our main contributions are:

We develop the continuous kernel learning (CKL) framework, a kernel learning method that learns an implicit representation of a kernel. We show that we can interpret the learning task as a feedforward network. This allows us to utilize recent advances in optimization technology from deep learning to train a classifier.

We prove VCdimension and generalization bounds for a single Fourier embedding, which yields natural regularization techniques for CKL.

We show via experiments that CKL outperforms existing scalable MKL methods.
1.1 Technical Overview
The starting point for our work is the representation of any shiftinvariant kernel^{Footnote 1} as an infinite linear combination of cosine basis elements via Bochner’s theorem [9], as first demonstrated by Rahimi and Recht [41]. This representation is typically used to generate a random lowdimensional embedding of the associated Hilbert space.
If we move away from a random lowdimensional embedding and embrace the entire distribution that we sample from, we reach infinitewidth embeddings. Dealing with infinitewidth embeddings simply means that we consider the expectation of the embedding over the distribution. Neal [36] linked infinitewidth networks to Gaussian processes when the distribution is Gaussian. Much later, Cho and Saul [11] applied the technique to infinitewidth rectified linear units (ReLUs), and showed a correspondence to a kernel they called the arccosine kernel. Hazan and Jaakkola [21] extended this result further, and analyzed the kernel corresponding to two infinite layers stacked in series. In all of this, a specific distribution is chosen in order to obtain a kernel.
In our work, we return to the infinite representation provided by Bochner’s theorem. Rather than picking a specific distribution over weights, we learn a distribution based on our training data. This effectively means we learn a representation of a kernel. While we cannot learn an infinitewidth embedding directly, since the space of functions is itself infinite, we are able to construct approximate representations from a finite number of Fourier embeddings. Since the learned kernel representations are a form of kernel learning, we dub our technique continuous kernel learning (CKL).
2 Prior Kernel Learning Work
2.1 Multiple Kernel Learning (MKL)
Multiple Kernel Learning, or MKL, is an extension to kernelized support vector machines (SVMs) that employs a combination of kernels to extend the space of possible kernel functions. MKL algorithms learn not only the parameters of the SVM, but also the parameters of the kernel combination. In this sense, MKL algorithms seek to find the correct kernel function for the training data instead of relying on a predefined kernel function.
Lanckriet et al. [28] describe several convex optimization problems that learn the coefficients of a linear combination of kernel functions \(\kappa _\gamma (\cdot ,\cdot ) = \sum _i\gamma _i\kappa _i(\cdot ,\cdot )\). There are several algorithms to solve the MKL problem, including [1, 3, 17, 18, 42]. In addition to solving the MKL problem, MWUMKL [34] and SPGGMKL [22] also work at scale.
2.2 Approaches Utilizing Bochner’s Theorem
The key mathematical tool that drives much of kernel learning work is Bochner’s theorem:
Theorem 1
(Bochner [9]). A continuous function \(k : \mathbb {R}^d\rightarrow \mathbb {R}\) is positivedefinite iff \(k(\cdot )\) is the Fourier transform of a nonnegative measure.
Several papers have been published that explore the connection between Bochner’s theorem and learning a kernel^{Footnote 2}. A Bayesian view produces an interpretation of this optimization as learning the kernel of a Gaussian process (GP). Wilson and Adams [46] equate stationary (shiftinvariant) kernels to the spectral density function of a GP. They observe that linear combinations of squaredexponential kernels are dense in the space of stationary kernels. The resulting kernel has few parameters and is relatively easy to interpret.
Yang et al. [51] extend the ideas in [46] and combine it with the principles from Fastfood [29]. The authors also discuss variants of their algorithms such as computing a piecewise linear kernel. Similarly, the BaNK method by Oliva et al. [37] learns a kernel using the GP technique and trains the kernel using MCMC. Finally in the GP vein, Wilson et al. [47] integrate a deep network as input to the GP, treating the GP as an “infinitedimensional” layer of the network, and optimize the parameters of the GP simultaneously with the parameters of the network using backpropagation.
Băzăvan et al. [10], in contrast, optimize Fourier embeddings, but decompose each \(\omega _i\) into a parameter \(\sigma _i\) multiplied by a nonlinear function of a uniform random variable to represent the sample. The uniform variable is resampled during optimization as the parameter is learned.
3 Continuous Kernel Learning
3.1 Bochner’s Theorem
A couple observations must be made in order for Theorem 1 to be relevant to our setting. First, we observe that (for the purposes of this paper) a positive definite function \(k(\cdot )\) is a positive definite kernel \(\kappa (\cdot ,\cdot )\) when \(\kappa (\mathbf {x},\mathbf {x}') = k(\mathbf {x}\mathbf {x}')\). A kernel of this type is a shiftinvariant kernel. Examples include the Gaussian or RBF kernel (\(e^{\Vert \mathbf {x}\mathbf {x}'\Vert ^2/\sigma ^2}\)) and the Laplacian kernel (\(e^{\lambda \Vert \mathbf {x}\mathbf {x}'\Vert }\)).
Next, any nonnegative measure \(\mu : \mathbb {R}^d\rightarrow \mathbb {R}^+\) can be converted to a probability distribution if we normalize by \(Z = \int _{\mathbb {R}^d} d\mu \). Since Fourier transforms are linear, we can normalize the kernel by the same factor Z and maintain the equivalence. So without loss of generality, we can assume that the measure \(\mu \) is a probability measure. This equivalence between shiftinvariant kernel and distribution is important in the rest of this paper.
3.2 Fourier Embeddings
Rahimi and Recht [41] built on Bochner’s theorem by observing that the Fourier transform of \(\mu \) is also an expectation:
if \(\zeta _{{\varvec{\omega }}}(\mathbf {x}) = e^{i{\varvec{\omega }}^\top \mathbf {x}}\) and \({\varvec{\omega }}\sim \mathscr {D}_\mu \), where \(\mathscr {D}_\mu \) is the probability distribution over Borel sets on \(\mathbb {R}^d\) with measure \(\mu \). This shows that \(\zeta _{{\varvec{\omega }}}(\mathbf {x})\overline{\zeta _{{\varvec{\omega }}}(\mathbf {x}')}\) is an unbiased estimate of \(k(\mathbf {x}  \mathbf {x}')\). Because \(k(\mathbf {x}  \mathbf {x}')\) is real, we know that \(E_{{\varvec{\omega }}}[\zeta _{{\varvec{\omega }}}(\mathbf {x})\overline{\zeta _{{\varvec{\omega }}}(\mathbf {x}')}]\) has no imaginary component. A straightforward Chernofftype argument [35, see Ch.4] shows that averaging \(\zeta _{{\varvec{\omega }}}(\mathbf {x})\overline{\zeta _{{\varvec{\omega }}}(\mathbf {x}')}\) over D samples of \({\varvec{\omega }}\) produces a bound on the error of the estimate that diminishes exponentially in D. The lifting map then becomes \(\varPhi (\mathbf {x})=\sqrt{1/D}(\zeta _{{\varvec{\omega }}_1}(\mathbf {x}),\dots ,\zeta _{{\varvec{\omega }}_D}(\mathbf {x}))\). The inner product \(\langle {\varPhi (\mathbf {x})},{\overline{\varPhi (\mathbf {x}')}}\rangle \) is obviously the desired average.
We can avoid complex numbers by using \(z_{{\varvec{\omega }},b}(\mathbf {x}) = \sqrt{2} \cos ({\varvec{\omega }}^\top \mathbf {x} + b)\) with \({\varvec{\omega }}\sim \mathscr {D}_\mu \) and \(b \sim U[0,2\pi )\), which offers the same unbiased estimate (see [41]). The lifting map in this case is \(\varPhi (\mathbf {x})=\sqrt{2/D}(z_{{\varvec{\omega }}_1,b_1}(\mathbf {x}),\dots ,z_{{\varvec{\omega }}_D,b_D}(\mathbf {x}))\). In this work we will refer to these maps (of the real or complex type) as Fourier embeddings. In [41] these embeddings are called random Fourier features, because they are selected at random from the distribution that is Fourierdual to the approximated kernel. We will demonstrate that Fourier embeddings of this type need not be selected at random, and can in fact be optimized.
Our Approach. Our approach is most similar to that in Băzăvan et al. [10]. Like the authors of [10], we recognize that we can optimize the parameters \(\{{\varvec{\omega }}_i\}\) of a Fourier embedding. Băzăvan et al. [10] decompose \({\varvec{\omega }}_i\) as follows:
where \(\varvec{\sigma }_i\) is the parameter of a shiftinvariant kernel, \(\odot \) is the Hadamard (elementwise) product of two vectors, h is an elementwise nonlinear function (essentially an inverse quantile function), and \(\mathbf {u}_i\) is a sample drawn from a multivariate uniform distribution (cube). The procedure is to optimize \(\varvec{\sigma }_i\) and periodically resample \(\mathbf {u}_i\). This has the advantage of being able to represent the kernel with its parameter \(\varvec{\sigma }_i\), which adds to clarity, but the kernel must be one of a particular class of shiftinvariant kernels that decomposes into this form. A Gaussian kernel, however, does decompose this way.
In contrast, we sample the vectors \({\varvec{\omega }}_i\) from the distribution \(\mathscr {D}_\mu \), and then optimize them directly. The weights \(\{{\varvec{\omega }}_i\}\) become different vectors \(\{{\varvec{\omega }}_i'\} \subset \mathbb {R}^d\) – and are now very unlikely to be drawn i.i.d. from the distribution \(\mathscr {D}_\mu \) anymore. As in prior approaches, by learning the embeddings, we learn the kernel, because the Bochner equivalence between distributions and kernels guarantees this. We use backpropagation to learn the weights, avoiding the need to resample at every step, and allowing us to take advantage of recent neural network technology to perform scalable optimization. While other approaches focus on decomposing the representation of the kernels into individual kernel components and learn their parameters, we avoid this and focus only on producing the final weights \({\varvec{\omega }}_i'\). We lose the clarity and sparsity of individual kernel parameters but gain the flexibility of learning a representation of a shiftinvariant kernel free of individual base kernels, and recent technology allows us to do this training quickly.
For brevity, we refer to the \(d\times D\) matrices \(\mathbf {W}\) (for the \(\{{\varvec{\omega }}_i\}\)) and \(\mathbf {W}'\) (for the \(\{{\varvec{\omega }}_i'\}\)), since there are D samples from \(\mathbb {R}^d\).
3.3 Generalization Bounds in Fourier Embeddings
We now examine the capacity of this class of kernels by analyzing its VCdimension. Note that the cosine function complicates this analysis since it has nontrivial gradient almost everywhere.
Fortunately we can exploit an observation already wellknown in kernel learning that a narrow kernel function, for example, a Gaussian kernel with a small variance, is more likely to overfit (and therefore have higher capacity). This is because a narrow kernel function only allows the model to examine a very small range around each point, so a new point is unlikely to be affected by the model at all. Because the kernel is the Fourier transform of a distribution, a narrow kernel function corresponds to a distribution with high variance – using the same example, a Gaussian kernel with variance parameter \(\sigma ^2\) is the Fourier transform of a Gaussian distribution with variance \(1/\sigma ^2\). So a small variance in the kernel corresponds to a high variance in the distribution, and viceversa. In fact, we can demonstrate that if the norm of the embedding parameter \(\omega \) is high, then this translates to higher capacity.
Let \(z(x) = e^{2\pi i x}\), \(\mathsf {Re}(z)\) and \(\mathsf {Im}(z)\) be the real and imaginary components of z, respectively, let [a..b] refer to the set of integers between a and b, inclusive (i.o.w., \(\{n\in \mathbb {Z}\mid a\le n\le b\}\)), and let \(\mathbf {1}_{P}(x)\) be the indicator (or characteristic) function of \(P:\mathbb {R}\rightarrow \{0,1\}\).
Definition 1
An \(({\varvec{\omega }},\beta ,d)\) range is the set \( \{\mathbf {x}\in \mathbb {R}^d\mid \mathsf {Im}(z({\varvec{\omega }}\cdot \mathbf {x}+\beta ))\ge 0,\,\Vert \mathbf {x}\Vert <1\} \) where \(d\ge 1\) is an integer, \({\varvec{\omega }}\in \mathbb {R}^d\), and \(\beta \in [0,1)\).
Definition 2
Let \(\mathscr {G}_d(R)\) be the set of all \(({\varvec{\omega }},\beta ,d)\)ranges such that \(\Vert {\varvec{\omega }}\Vert _2\le R\).
Lemma 1
The decision function \(\mathbf {1}_{\mathsf {Im}(z(w x+\beta ))\ge 0}\) induces a unique binary labeling for the set \(x\in \{1/2^{i}\}_{i=1}^n\) for every integer value of \(w\in [1..2^n]\), and any \(\beta \in (0,2^{(n+1)})\).
Proof
For any integer \(w\in [1..2^n]\) and \(i\in [1..n]\), choose the binary label as 0 if \(z(w / 2^i + \beta )\) lands in the upper halfplane of \(\mathbb {C}\), and 1 if the lower halfplane. The label can be read as the most significant fractional digit of the binary representation of \(w/2^i\), as long as \(\beta \in (0,2^{(n+1)})\) ^{Footnote 3}. The labeling is then unique for integer values of w up to \(2^n\). \(\square \)
Clearly, every \(({\varvec{\omega }},\beta ,d)\)range corresponds to a binary classifier and the range space \((\mathbb {R}^d,\mathscr {G}_d(R))\) is the hypothesis space of interest. We denote the unbounded range space \(\cup _R \mathscr {G}_d(R)\) by \(\mathscr {G}_d(\infty )\).
Theorem 2
The VCdimension of the range space \((\mathbb {R}^d,\mathscr {G}_d(R))\) is \(\varTheta (\max \{d\log R,d+1\})\).
We prove this theorem in two parts.
Lemma 2
The VCdimension of \((\mathbb {R}^d,\mathscr {G}_d(R))\) is at least \(d\max \{\lfloor \log _2 R\rfloor ,1\} + 1\).
Proof
Let \(n = \lfloor \log _2 R\rfloor \), for \(R\ge 2\). We now construct a set of dn points. Along each axis of \(\mathbb {R}^d\), place n points with corresponding coordinate from the set \(\{1/2^{i}\}_{i=1}^n\). From Lemma 1, we know that we can induce a binary labeling on every axisrestricted set, using integers \([1..2^n]\). Given \({\varvec{\omega }}\in [1..2^n]^d\), each \(\omega _j\in [1..2^n]\) will give a unique labeling to the points on axis \(j\in [1..d]\), independent of any other axis j. Therefore we can uniquely label the whole set of dn points, for all possible labelings.
To add one more point to the set, we select a point \(\mathbf {c}\), the ddimensional vector with all coordinates equal to a constant c, and make sure that we can find values \(\beta _+\) and \(\beta _\) so that \(\langle {\mathbf {c}},{{\varvec{\omega }}}\rangle +\beta _+ \ge 0\) and \(\langle {\mathbf {c}},{{\varvec{\omega }}}\rangle +\beta _ < 0\), independently of \({\varvec{\omega }}\). Observe that \(\langle {\mathbf {c}},{{\varvec{\omega }}}\rangle = c\sum _j\omega _j\), and that \(d\le \sum _j\omega _j\le d2^n\). For \(\langle {\mathbf {c}},{{\varvec{\omega }}}\rangle +\beta _ < 0\) we need that \(\beta _+<\langle {\mathbf {c}},{{\varvec{\omega }}}\rangle \) for all \({\varvec{\omega }}\), since the choice of \(\beta \) must be independent of \({\varvec{\omega }}\). This means that first, \(c<0\) since \(\beta _>0\) and \(\sum _j\omega _j>0\). Then \(cd\le \langle {\mathbf {c}},{{\varvec{\omega }}}\rangle \le cd2^n\), so we need to pick \(\beta _+<cd\). Similarly, we require \(\beta _+\ge cd2^n\), and since \(\beta _+<2^{(n+1)}\), we need \(c<1/d2^{(2n+1)}\). Set \(c=1/d2^{2n+2}\), \(\beta _+=2^{(n+2)}\), and \(\beta _ = 2^{(2n+3)}\). We can now uniquely label \(dn+1\) points for all possible labelings, when \(R>2\).
Regardless of the value of R, there is always a unique labeling of \(d+1\) points induced by the range space, since we can restrict to a ball small enough that \(\mathsf {Im}(z(\omega x+\beta )) = \sin (2\pi (\omega x+\beta ))\) is monotonic for appropriate values of \(\beta \). Within the ball, the range space is effectively the range of halfspaces, which has VCdimension \(d+1\). \(\square \)
Corollary 1
The VCdimension of the range space \((\mathbb {R}^d,\mathscr {G}_d(\infty ))\) is unbounded.
To prove the corresponding upper bound, we use the notion of the shatter function of \((\mathbb {R}^d,\mathscr {G}_d(R))\) [20]. For a positive integer n, the shatter function of a range space is the maximum highest number of subsets induced by the range space on any set of n points \(X_n\). That is, any range \(\mathscr {R}\) induces a subset of \(X_n\) simply by the intersection \(\mathscr {R} \cap X_n\), and the shatter function counts all unique subsets of this type.
Lemma 3
The shatter function of \((\mathbb {R}^d,\mathscr {G}_d(R))\) is \(O(R^dn^{d+1})\).
Proof
We can first observe that \(\Vert {\varvec{\omega }}\Vert _2\le R\) implies that \(\Vert {\varvec{\omega }}\Vert _\infty \le R\). This implies that \(\omega _j\le R\) for every \(j\in [1..d]\). Treating each coordinate separately this way, each term in \(\langle {{\varvec{\omega }}},{\mathbf {x}}\rangle + \beta \) contributes a factor in the growth function.
For a fixed \({\varvec{\omega }}\), the number of subsets of a set of n points selected by \(({\varvec{\omega }},\beta ,d)\)ranges is O(n), because as \(\beta \) changes, at most one point exits or leaves the upper halfplane (because the points all travel at the same speed around the unit circle).
For fixed \(\beta \), and fixed \({\varvec{\omega }}\) save for some coordinate \(\omega _j\), on the other hand, how often a point enters or leaves the upper halfplane as \(\omega _j\) varies in (0, R] depends upon the value of \(x_j\). For higher values of \(x_j\), the mapped point travels more rapidly. In fact, for \(x = 1\), z takes R revolutions around the circle, so enters and exits the upper halfplane 2R times. The number of subsets is bounded by \( \sum _{i=1}^n 2Rx_i = 2R \sum _{i=1}^n x_i \le 2Rn\). We take the absolute value because a negative \(x_i\) simply changes the direction of travel of \(z(\omega _j x_i + \beta )\). Everything else remains the same. For \({\varvec{\omega }}\) and \(\beta \) varying independently, we now have the bound stated in the lemma. \(\square \)
Lemma 4
The VCdimension of \((\mathbb {R}^d,\mathscr {G}_d(R))\) is \(O(d\log R)\).
Proof
Follows directly from the relationship between the shatter function and VCdimension [20]. \(\square \)
With Lemmas 2 and 4, we have proven Theorem 2. The VC dimension also gives us a generalization bound, due to Bartlett and Mendelson [4]:
Theorem 3
Let F be a class of \(\pm 1\)valued functions defined on a set \(\mathscr {X}\). Let P be a probability distribution on \(\mathscr {X}\times \{\pm 1\}\), and suppose that \((X_1,Y_1),\dots ,(X_n,Y_n)\) and (X, Y) are chosen independently according to P. Then for any positive integer n, w.p. \((1\delta )\) over samples of length n, every \(f\in F\) satisfies
Regularization. Theorems 2 and 3 immediately suggest three regularization techniques: First, we limit the norm of the Fourier weights with weight decay (a.k.a. \(L_2\) regularization). Alternatively, we simply cap the norm of each Fourier weight vector to some constant at each round of the training. We can further control the initial capacity by setting the variance of the initializing distribution.
4 From an Embedding to a FeedForward Network
We now return to the single Fourier embedding
If we fix an input \(\mathbf {x}\), then we can view the mapping \(z_{{\varvec{\omega }},b}\) as a neuron with a cosine activation function and biases of the form \(b\in [0,2\pi )\). We call this type of neuron a cosine neuron. Such a neuron, with a cutoff to ensure zero support outside an interval, was introduced in [15]. We impose no such cutoff in this work.
Consider a layer of cosine neurons, each with associated weight vector \({\varvec{\omega }}_j\). Each of these weights can be viewed as a sample from some distribution, and therefore the entire ensemble is a (dual) representation of some shiftinvariant kernel (by Bochner’s theorem). We can then write the associated classifier for such a combination. Denoting the bias vector by \(\mathbf {b}\) and the collection of all the weight vectors \({\varvec{\omega }}_j\) by W, the resulting classifier (with a softmax layer to combine the individual activations and logarithmic loss), can be written as
where \({{\mathrm{softmax}}}(\mathbf {r})_j = e^{r_j}/\sum _k{e^{r_k}}\), and \(\ell _{\log }\) is the log loss.
What we now have is a standard (shallow) 2layer network that we can train using backpropagation and stochastic gradient descent.
5 Experiments
We have designed our experiments to answer the following questions: (1) Does allowing the learning algorithm to pick an arbitrary kernel improve performance over standard MKL techniques that are only allowed to select from a fixed library of kernels? (2) How does the learning algorithm for CKL adapt to large datasets and higher dimensions?
5.1 MKL Vs. CKL on Small Datasets
Since CKL is proposed as an alternative to MKL, we compare CKL to two scalable MKL algorithms, namely SPGGMKL [22] and MWUMKL [34].
Data Sets. All of the datasets used for the experiments are taken from the libsvm repository^{Footnote 4}. See Table 1 for details of the datasets.
Experimental Procedure. The data for Adult and Mushroom datasets consist of binary features (onehot representations of categorical features), so no scaling was applied. Features were scaled to the range [−1, 1] for other datasets.
For MKL experiments, we used the ScikitLearn Python package [40] for much of the testing infrastructure. For testing with MKL methods, the training data is split randomly into 75 % training and 25 % validation data. The random splits were repeated 100 times for all sets except Mushroom, Gisette, and Adult, which received 20 splits for considerations of time. The C parameter was selected through cross validation and for MWUMKL, the \(\epsilon \) parameter was chosen to be 0.005, to achieve high accuracy while allowing all of the experiments to complete (the number of iterations of the algorithm in [34] is proportional to \(1/\epsilon \)). We use two kernels: a linear kernel and a Gaussian kernel. For the Gaussian kernel, a wide range of \(\gamma \) are tried and the the best accuracy observed is used in the results.
For CKL experiments, the same test/train split was applied, and additionally, the training portion was split further into 75 % training and 25 % validation. We apply early stopping and momentum, and random searches for: the width (\(h_0\)) of the hidden layer, a parameter (\(\sigma \)) used for initializing the weights of the hidden layer, and the learning rate (\(\ell \)) hyperparameters. Training was stopped if the validation objective did not decrease within 100 epochs and was otherwise permitted to run for up to 10, 000 epochs. Momentum was applied from the first epoch with a value of 0.5 that was increased to 0.99 over the course of 10 epochs.
Values for \(h_0\) were selected from \(\{2^i\}\) with i sampled uniformly from [0..9], except for Gisette, where i was sampled uniformly from [0..14]. The weights of the hidden layer were sampled from a Gaussian distribution with variance \(\sigma \) selected from \(\{2^i\}\) where i was sampled uniformly from \([6..0]\). The weights of the softmax layer were selected from \(U[0.1, 0.1]\). Finally \(\ell \) was sampled from \(LU[10^{5}, 0.2)\) ^{Footnote 5}. 100 models with random hyperparameters were trained, and then the one with the highest performance was chosen and validated with 100 random splits (as described in the previous paragraph).
Results. The results are shown in Table 2. CKL is not different in any significant capacity from either GMKL or MWUMKL on very small datasets. Letting the learning algorithm pick an arbitrary kernel improves performance over standard MKL techniques that only choose a mixture of kernels. Additionally, we see that CKL adapts to large datasets and higher dimensions better than MKL.
5.2 MKL Vs. CKL on Million Song Datasets
In this section, we compare MKL methods with CKL on the Million Song Dataset [6]. The Million Song Dataset consists of audio features and metadata of one million contemporary popular music tracks. For the experiments, we utilized three different subsets of the Million Song Dataset, all binary. The features are the average and covariance of the pitch and timbre vectors for each track:

Genre 1: The two most common genres in Million Song Dataset  “classic pop and rock” and “folk.” The tracks which have both genres as tags are removed to avoid confusion.

Genre 2: The ten most common genres in the Million Song Dataset. Since the “classic pop and rock” genre has significantly more tracks than any other genre, “classic pop and rock” is considered as one class and everything else together as another class.

Year Prediction: Taken from the UCI Machine Learning Repository. All tracks prior to the year 2000 are considered as one class and all tracks after and including the year 2000 are considered as the other class. The dimensions of the dataset are summarized in Table 1.
Results. The results are shown in Table 2. CKL is clearly superior to the scalable MKL methods that we tested against, adding to the evidence that higherdimensional and larger datasets can benefit from our technique.
5.3 MKL Vs. CKL on Images
We compare MKL and CKL on CIFAR10. CIFAR10 [27] is a labeled image dataset containing 60,000 1,024dimensional (\(32\times 32\)) images and 10 classes used extensively for testing image classification algorithms. While image classification is an important benchmark for neural networks, we wish to point out that our objective is not to classify the CIFAR10 dataset better than all other previous techniques. Instead, we wish to provide comparisons between the methods described in this paper on a large and very challenging task using a simple convolutional neural architecture.
Preprocessing. We first centered the CIFAR10 training set by mean, and then used Pylearn2 [19] to apply two transformations: global contrast normalization [12] and ZCA whitening [5]^{Footnote 6}. We applied the same transformations computed for the training set to the testing set.
Feature extraction. For MKL, we used a convolutional neural network (CNN) [30] to learn a representation from the data. In total, we trained 100 models and we extracted the features from the model with the best performance. All of the models had the form \({{\mathrm{conv}}}_{{{\mathrm{ReLU}}}} \rightarrow {{\mathrm{pool}}}_{max} \rightarrow {{\mathrm{fc}}}_{{{\mathrm{ReLU}}}} \rightarrow {{\mathrm{softmax}}}\) where \({{\mathrm{conv}}}_{{{\mathrm{ReLU}}}}\) is a convolutional layer using \({{\mathrm{ReLU}}}\) nonlinearities, \({{\mathrm{pool}}}_{max}\) is a maxpool layer, \({{\mathrm{fc}}}_{{{\mathrm{ReLU}}}}\) was a fullyconnected layer using \({{\mathrm{ReLU}}}\) nonlinearities, and softmax was a softmax layer.
We trained the models with (1) momentum, initialized to 0.5 and increased to 0.99 over the first 100 epochs, and (2) early stopping: we set aside the last 10, 000 samples of the training set as a validation set for early stopping, and trained the models for at most 5, 000 epochs. We initialized the weights of all layers by selecting values uniformly at random from the range \([0.01, 0.01]\). The parameters of best performing model were as follows: (1) the convolutional layer (with ReLU activations): a \(5\times 5\) kernel with \(1\times 1\) stride, 32 channels, a max kernel norm of 1.8, and cross channel normalization with \(\alpha = 3.2\times 10^{4}\) and \(\beta = 0.75\), (2) the max pooling layer: a \(3\times 3\) kernel with \(2\times 2\) stride, (3) the fully connected layer: 1, 000 rectified linear units, and (4) the softmax layer: one output for each CIFAR10 class. Each sample of CIFAR10 was passed through the CNN and the activations of the fully connected layer were recorded as the new representation.
CIFAR10 with MKL. For MKL experiments, the testing infrastructure and the experimental procedures are similar to the experimental procedure of Sect. 5.1 except for the following details: (1) Onevsone multiclass strategy is used for the classification task, (2) Random \(75\,\%\) of the training data is used for training and tested on the standard test data. The runs were repeated 20 times, and (3) We used two Gaussian kernels, one with \(\gamma = 1\) and the other with a range of \(\gamma \) from \(2^{7}\) to \(2^{7}\). The best accuracy observed is used in Table 3.
CIFAR10 with CKL. For comparison with MKL, we trained a network of the form \({{\mathrm{conv}}}_{{{\mathrm{ReLU}}}} \rightarrow {{\mathrm{pool}}}_{max} \rightarrow {{\mathrm{fc}}}_{{{\mathrm{ReLU}}}} \rightarrow {{\mathrm{fc}}}_{\cos } \rightarrow {{\mathrm{softmax}}}\). A CKL model of this form uses the same structure as the CNN used for the MKL/CKL experiments (defined in the paragraph “Feature Extraction”), up to and including the fully connected layer of rectified linear units. Instead of a softmax layer, the units of the fully connected layer were connected to a CKL model with 1, 000 hidden units (untuned).
The primary difference between this model and MKL trained on features extracted from a CNN (see Sect. 5.3) is that this model is trained all at once, while in the MKL experiments the CNN used for feature learning and the MKL model were trained separately. This endtoend learning allows the features of each layer to adapt to the features that appear later in the network. It is also important to note that the MKL experiments were trained on a onevs.one basis, while the CKL model uses multinomial (softmax) regression with log loss.
Experimental procedure. The models in these experiments were trained using stochastic gradient descent for a maximum of 1, 000 epochs with early stopping and momentum. The initial momentum rate was 0.5 and was adjusted from the first epoch to 0.99 over the first 500 epochs of the training.
Results. The CKL model outstrips the MKL methods by a wide margin. We conjecture that this is due to two effects: (1) the endtoend training allows for better adaptation in the training process and (2) the search space of kernels is much larger. The first effect demonstrates that CKL is more adaptable than MKL in these settings. It is also important to note that training is a crucial component for CKL models when operating on large datasets. For CIFAR10, evaluating any random model upon initialization yielded an accuracy of only 10.1 % with standard deviation of 0.235 %. In contrast, evaluating random models on smaller datasets frequently yields accuracies that are better than chance.
CIFAR10 with Two Layer Convnets. One might ask whether stacking two cosine layers has any beneficial effect, since stacking two cosine layers is similar to composing two lifting maps, which if defined, yields a kernel. Zhuang et al. [53] construct an algorithm specifically for the composition of two kernels – essentially layering the kernels. Lu et al. [31] discuss extensions to [41] that cover products, sums, and compositions of kernels. Since these are based on the sampling methodology of [41], there is a direct analogy to composing two cosine layers (fixed, in this case). We did not observe significant improvement in accuracy when we employed combinations of two cosine layers. One possible explanation is that since the composition of a kernel is itself a kernel, it can be argued that optimizing a network that contains two consecutive cosine layers accomplishes no more than doing so with one individual cosine layer.
6 Related Work
Multiple kernel learning. The general area of kernel learning was initiated by Lanckriet et al. [28] who proposed to simultaneously train an SVM as well as learn a convex combination of kernel functions. The key contribution was to frame the learning problem as an optimization over positive semidefinite kernel matrices which in turn reduces to a QCQP. Soon after, Bach et al. [3] proposed a blocknorm regularization method based on second order cone programming (SOCP).
For efficiency, researchers started using optimization methods that alternate between updating the classifier parameters and the kernel weights. Many authors then explored the MKL landscape, including Rakotomamonjy et al. [42], Sonnenburg et al. [43], Xu et al. [48, 49]. However, as pointed out by Cortes [13], most of these methods do not compare favorably (both in accuracy as well as speed) even with the simple uniform heuristic. More recently, Moeller et al. [34] developed a multiplicativeweightupdate based approach that has a much smaller memory footprint and scales far more effectively. Other kernel learning methods include [14, 33, 38, 39, 44] and notably methods using the \(\ell _p\)norm [25, 26, 45].
Infinitewidth networks. Early work on infinitewidth networks was done by Neal [36], who tied infinite networks to Gaussian processes, assuming that the distribution is Gaussian. Cho and Saul [11] analyzed the case where the network is either a step network (the output is 1 if the input is positive, 0 otherwise) or a rectified linear unit (ReLU), a type of network used frequently in deep networks (the input z is passed through the function \(\max \{0,z\}\)). They showed that if the distribution is Gaussian in these settings, the function \(\phi _{\mathbf {x}}\) output by the network is a lifting map corresponding to a kernel they dub the arccosine kernel. Hazan and Jaakkola [21] extended this result further, and analyzed the kernel corresponding to two infinite layers stacked in series. They showed that such a network, when the distribution of the first layer is Gaussian, and the second layer is treated as a Gaussian process, (a process is a distribution of distributions), corresponds to a kernel that can be computed explicitly. Globerson and Livni [16] produce an online algorithm for infinitelayer networks that avoids the kernel trick. They demonstrate a sample complexity equal to methods that use the kernel trick, demonstrating that sampling can be as effective as methods that have access to kernel values.
Layered kernels. Zhuang et al. [53] develop a multiple kernel learning technique where they use a layered kernel to combine the output of several other kernels. Their algorithm alternates the use of standard SVM and stochastic gradient descent. Lu et al. [31] scale up [41] by making some interesting mathematical observations about kernels and distributions. Their work relies heavily on the correspondence between distributions and kernels, a theme that we explore as well. Yu et al. [52] also seek to optimize a kernel, using alternating optimization and also based on Bochner’s theorem. Jiu and Sahbi [23, 24] exploit kernel map networks and Laplacians of nearestneighbor graphs [24] to produce “deep” kernels for use in SVMs.
Neural networks as kernels. Yang et al. [50] exploit the correspondence between ReLUs and arccosine kernels [11], and the sparsity of the Fastfood transform [29] to reduce the complexity of a convolutional neural net.
Aslan et al. [2] seek to make the optimization of neural networks convex through kernels and matrix techniques. Mairal et al. [32] extend hierarchical kernel descriptors [7, 8] to act as convolutional layers. Very recently, Wilson et al. [47] combine neural networks with Gaussian processes, drawing on the infinitewidth network setting, to produce “deep” kernels.
Notes
 1.
A kernel \(\kappa (x,y)\) expressible as \(\kappa (x,y) = k(xy)\).
 2.
 3.
To avoid ambiguity, we require \(\beta >0\), to prevent \(z(w / 2^i)\) from landing on the real axis when \(2^i\) divides w.
 4.
 5.
A random variable X is drawn from LU[a, b] if \(X = e^Y\), where \(Y\sim U[\ln (a),\ln (b))\).
 6.
PCA whitening attempts to decorrelate features and normalize singular values (“whitening”) of the original data by rotating the data by singular vectors, and then normalizing singular values. ZCA whitening, in contrast, attempts to do the same, but make the resulting data as close to the original as possible, in a leastsquares sense. The ZCA transformation is simply to multiply by the inverse square root of the covariance matrix of the data.
References
Aiolli, F., Donini, M.: EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing 169, 215–224 (2015)
Aslan, O., Zhang, X., Schuurmans, D.: Convex deep learning via normalized kernels. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) NIPS, pp. 3275–3283. Curran Associates, Inc. (2014)
Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: ICML, Banff, Canada (2004)
Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. JMLR 3, 463–482 (2003)
Bell, A.J., Sejnowski, T.J.: Edges are the ‘Independent Components’ of natural scenes. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) NIPS, pp. 831–837. MIT Press (1997)
BertinMahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: ISMIR (2011)
Bo, L., Lai, K., Ren, X., Fox, D.: Object recognition with hierarchical kernel descriptors. In: CVPR, pp. 1729–1736, June 2011
Bo, L., Ren, X., Fox, D.: Kernel descriptors for visual recognition. In: Lafferty, J.D., Williams, C.K.I., ShaweTaylor, J., Zemel, R.S., Culotta, A. (eds.) NIPS, pp. 244–252. Curran Associates, Inc. (2010)
Bochner, S.: Lectures on Fourier Integrals. Annals of Mathematics Studies, vol. 42. Princeton University Press, Princeton (1959)
Băzăvan, E.G., Li, F., Sminchisescu, C.: Fourier kernel learning. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 459–473. Springer, Heidelberg (2012)
Cho, Y., Saul, L.K.: Kernel methods for deep learning. In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.) NIPS, pp. 342–350. Curran Associates, Inc. (2009)
Coates, A., Ng, A.Y., Lee, H.: An analysis of singlelayer networks in unsupervised feature learning. In: AIStats, pp. 215–223 (2011)
Cortes, C.: Invited talk: can learning kernels help performance? In: ICML, Montreal, Canada (2009)
Cortes, C., Mohri, M., Rostamizadeh, A.: Learning nonlinear combinations of kernels. In: NIPS, Vancouver, Canada (2009)
Gallant, A., White, H.: There exists a neural network that does not make avoidable mistakes. In: ICNN, vol. 1, pp. 657–664, July 1988
Globerson, A., Livni, R.: Learning infinitelayer networks: beyond the kernel trick. arXiv:1606.05316 [cs], June 2016
Gönen, M., Alpaydın, E.: Localized multiple kernel learning. In: ICML, Helsinki, Finland (2008)
Gönen, M., Alpaydın, E.: Localized algorithms for multiple kernel learning. Pattern Recogn. 46(3), 795–807 (2013)
Goodfellow, I.J., WardeFarley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., Bengio, Y.: Pylearn2: a machine learning research library. arXiv:1308.4214 [cs, stat], August 2013
HarPeled, S.: Geometric Approximation Algorithms. American Mathematical Society, Boston (2011)
Hazan, T., Jaakkola, T.: Steps toward deep kernel methods from infinite neural networks. arXiv:1508.05133 [cs], August 2015
Jain, A., Vishwanathan, S.V.N., Varma, M.: SPGGMKL: generalized multiple kernel learning with a million kernels. In: KDD, pp. 750–758 (2012)
Jiu, M., Sahbi, H.: Deep kernel map networks for image annotation. In: ICASSP, pp. 1571–1575, March 2016
Jiu, M., Sahbi, H.: Laplacian deep kernel learning for image annotation. In: ICASSP, pp. 1551–1555, March 2016
Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., Müller, K.R., Zien, A.: Efficient and accurate Lpnorm multiple kernel learning. In: NIPS, Vancouver, Canada (2009)
Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A.: Lpnorm multiple kernel learning. JMLR 12, 953–997 (2011)
Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Citeseer (2009)
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. JMLR 5, 27–72 (2004)
Le, Q., Sarlos, T., Smola, A.: Fastfood  computing Hilbert space expansions in loglinear time. In: ICML, pp. 244–252 (2013)
LeCun, Y.: Generalization and network design strategies. In: Pfeifer, R., Schreter, Z., Fogelman, F., Steels, L. (eds.) Connectionism in Perspective. Elsevier, Zurich (1989). An extended version was published as a technical report of the University of Toronto
Lu, Z., May, A., Liu, K., Garakani, A.B., Guo, D., Bellet, A., Fan, L., Collins, M., Kingsbury, B., Picheny, M., Sha, F.: How to scale up kernel methods to be as good as deep neural nets. arXiv:1411.4000 [cs, stat], November 2014
Mairal, J., Koniusz, P., Harchaoui, Z., Schmid, C.: Convolutional kernel networks. In: NIPS, pp. 2627–2635 (2014)
Micchelli, C.A., Pontil, M.: Learning the kernel function via regularization. JMLR 6, 1099–1125 (2005)
Moeller, J., Raman, P., Venkatasubramanian, S., Saha, A.: A geometric algorithm for scalable multiple kernel learning. In: AIStats, pp. 633–642 (2014)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)
Neal, R.M.: Priors for infinite networks. Bayesian Learning for Neural Networks. Lecture Notes in Statistics, vol. 118, pp. 29–53. Springer, New York (1996)
Oliva, J., Dubey, A., Poczos, B., Schneider, J., Xing, E.P.: Bayesian Nonparametric Kernellearning. arXiv:1506.08776 [stat], June 2015
Ong, C.S., Smola, A.J., Williamson, R.C.: Learning the kernel with hyperkernels. JMLR 6, 1043–1071 (2005)
Orabona, F., Luo, J.: Ultrafast optimization algorithm for sparse multi kernel learning. In: ICML, Bellevue, USA (2011)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikitlearn: machine learning in python. JMLR 12, 2825–2830 (2011)
Rahimi, A., Recht, B.: Random features for largescale kernel machines. In: NIPS, pp. 1177–1184 (2007)
Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: More efficiency in multiple kernel learning. In: ICML, Corvalis, USA (2007)
Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. JMLR 7, 1531–1565 (2006)
Varma, M., Babu, B.R.: More generality in efficient multiple kernel learning. In: ICML, Montreal, Canada (2009)
Vishwanathan, S.V.N., Sun, Z., Ampornpunt, N., Varma, M.: Multiple kernel learning and the SMO algorithm. In: NIPS, Vancouver, Canada (2010)
Wilson, A., Adams, R.: Gaussian process kernels for pattern discovery and extrapolation. In: ICML, pp. 1067–1075 (2013)
Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P.: Deep Kernel Learning. arXiv:1511.02222 [cs, stat], November 2015
Xu, Z., Jin, R., King, I., Lyu, M.R.: An extended level method for efficient multiple kernel learning. In: NIPS, Vancouver, Canada (2008)
Xu, Z., Jin, R., Yang, H., King, I., Lyu, M.R.: Simple and efficient multiple kernel learning by group lasso. In: ICML, Haifa, Israel (2010)
Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., Wang, Z.: Deep Fried Convnets. arXiv: 1412.7149, December 2014
Yang, Z., Wilson, A., Smola, A., Song, L.: À la Carte  learning fast kernels. In: AIStats, pp. 1098–1106 (2015)
Yu, F.X., Kumar, S., Rowley, H., Chang, S.F.: Compact Nonlinear Maps and Circulant Extensions. arXiv:1503.03893 [cs, stat], March 2015
Zhuang, J., Tsang, I.W., Hoi, S.: Twolayer multiple kernel learning. In: AIStats, pp. 909–917 (2011)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Moeller, J., Srikumar, V., Swaminathan, S., Venkatasubramanian, S., Webb, D. (2016). Continuous Kernel Learning. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9852. Springer, Cham. https://doi.org/10.1007/9783319462271_41
Download citation
DOI: https://doi.org/10.1007/9783319462271_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783319462264
Online ISBN: 9783319462271
eBook Packages: Computer ScienceComputer Science (R0)