1 Introduction

It has been long appreciated in the machine learning community that learning sparse models can bring multiple benefits such as better interpretability, improved accuracy by reducing the curse of dimensionality, computational efficiency at prediction times, reduced costs for gathering and storing measurements, etc. A plethora of sparse learning methods has been proposed for linear models [16]. However, developing similar methods in the nonlinear setting proves to be a challenging task.

Generalized additive models [15] can use similar sparse techniques as their linear counterparts. However, the function class of linear combinations of nonlinear transformations is too limited to represent general nonlinear functions. Kernel methods [29] have for long been the workhorse of nonlinear modelling. Recently, a substantial effort has been invested into developing kernel methods with feature selection capabilities [5]. The most successful approaches within the filter methods are based on mapping distributions into the reproducing kernel Hilbert spaces (RKHS) [23]. Amongst the embedded methods, multiple algorithms use the feature-scaling weights proposed in [32]. The authors in [28] follow an alternative strategy based on the function and kernel partial derivatives.

All the kernel-based approaches above suffer from a common problem: they do not scale well for large data sets. The kernel methods allow for nonlinear modelling by applying high dimensional (possibly infinite-dimensional) nonlinear transformations \(\phi : \mathcal {X}\rightarrow \mathcal {H}\) to the input data. Due to what is known as the kernel trick, these transformations do not need to be explicitly evaluated. Instead, the kernel methods operate only over the inner products between pairs of data points that can be calculated quickly by the use of positive definite kernel functions \(k : \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}, \, k(\mathbf {x},\tilde{\mathbf {x}})=\langle \phi (\mathbf {x}),\phi (\tilde{\mathbf {x}}) \rangle \). Given that these inner products need to be calculated for all data-point pairs, the kernel methods are generally costly for datasets with a large number n of training points both in terms of computation and memory. This is further exacerbated for the kernel variable selection methods, which typically need to perform the \(\mathcal {O}(n^2)\) kernel evaluations multiple times (per each input dimension, or with each iterative update).

In this work we propose a novel kernel-based method for input variable selection in nonlinear regression that can scale to datasets with large numbers of training points. The method builds on the idea of approximating the kernel evaluations by Fourier random features [24]. Instead of fixing the distributions generating the random features a priori, it learns them together with the predictive model such that they degenerate for the irrelevant input dimensions. The method falls into the category of embedded approaches that seek to improve predictive accuracy of the learned models through sparsity [18]. This is the first kernel-based variable selection method for general nonlinear functions that can scale to large datasets of tens of thousands of training data.

2 Background

We formulate the problem of nonlinear regression as follows: given a training set of n input-output pairs \(\mathcal {S}_n = \{ (\mathbf {x}_i,y_i) \in (\mathcal {X}\times \mathcal {Y}) \, : \, \mathcal {X}\subseteq \mathbb {R}^d, \mathcal {Y}\subseteq \mathbb {R}, \, i \in \mathbb {N}_n \}\) sampled i.i.d. according to some unknown probability measure \(\rho \), our task is to estimate the regression function \(f : \mathcal {X}\rightarrow \mathcal {Y}, \, f(\mathbf {x}) = \mathbb {E}(y|\mathbf {x})\) that minimizes the expected squared error loss \(\mathcal {L}(f) = \mathbb {E} \left( y - f(\mathbf {x}) \right) ^2 = \int \left( y - f(\mathbf {x}) \right) ^2 d\rho (\mathbf {x},y)\).

In the variable selection setting, we assume that the regression function does not depend on all the d input variables. Instead, it depends only on a subset \(\mathcal {I}\) of these of size \(l < d\), so that \(f(\mathbf {x}) = f(\tilde{\mathbf {x}})\) if \(x^s = \tilde{x}^s\) for all dimensions \(s \in \mathcal {I}\).

We follow the standard theory of regularised kernel learning and estimate the regression function as the solution to the following problem

$$\begin{aligned} \widehat{f} = \mathop {\mathrm {arg\,min}}\limits _{f \in \mathcal {H}} \, \mathcal {\widehat{L}}(f) + \lambda ||f||^2_\mathcal {H}. \end{aligned}$$
(1)

Here the function hypothesis space \(\mathcal {H}\) is a reproducing kernel Hilbert space (RKHS), \(||f||_\mathcal {H}\) is the norm induced by the inner product in that space, and \(\mathcal {\widehat{L}}(f) = \frac{1}{n} \sum _i^n \left( y_i - f(\mathbf {x}_i) \right) ^2\) is the empirical loss replacing the intractable expected loss above.

From the standard properties of the RKHS, the classical result (e.g. [29]) states that the evaluation of the minimizing function \(\widehat{f}\) at any point \(\tilde{\mathbf {x}}\in \mathcal {X}\) can be represented as a linear combination of the kernel functions k over the n training points

$$\begin{aligned} \widehat{f}(\tilde{\mathbf {x}}) = \sum _i^n c_i \, k(\mathbf {x}_i,\tilde{\mathbf {x}}) . \end{aligned}$$
(2)

The parameters \(\mathbf {c}\) are obtained by solving the linear problem

$$\begin{aligned} (\mathbf {K}+ \lambda \mathbf {I}_n) \, \mathbf {c}= \mathbf {y}, \end{aligned}$$
(3)

where \(\mathbf {K}\) is the \(n \times n\) kernel matrix with the elements \(K_{ij} = k(\mathbf {x}_i,\mathbf {x}_j)\) for all \(\mathbf {x}_i,\mathbf {x}_j \in \mathcal {S}_n\).

2.1 Random Fourier Features

Equations (2) and (3) point clearly to the scaling bottlenecks of the kernel regression. In principal, at training it needs to construct and keep in memory the \((n \times n)\) kernel matrix and solve an n dimensional linear system (\(\propto \mathcal {O}(n^3)\)). Furthermore, the whole training set \(\mathcal {S}_n\) needs to be stored and accessed at test time so that the predictions are of the order \(\mathcal {O}(n)\).

To address these scaling issues, the authors in [24] proposed to map the data into a low-dimensional Euclidean space \(\mathbf {z}: \mathcal {X}\rightarrow \mathbb {R}^D\) so that the inner products in this space are close approximations of the corresponding kernel evaluation \(\langle \mathbf {z}(\mathbf {x}), \mathbf {z}(\tilde{\mathbf {x}}) \rangle _{\mathbb {R}^D} \approx \langle \phi (\mathbf {x}), \phi (\tilde{\mathbf {x}}) \rangle _{\mathcal {H}} = k(\mathbf {x},\tilde{\mathbf {x}})\). Using the nonlinear features \(\mathbf {z}(\mathbf {x}) \in \mathbb {R}^D\) the evaluations of the minimising function can be approximated by

$$\begin{aligned} \widehat{f}(\tilde{\mathbf {x}}) \approx \langle \mathbf {z}(\tilde{\mathbf {x}}), \mathbf {a}\rangle _{\mathbb {R}^D} , \end{aligned}$$
(4)

where the coefficients \(\mathbf {a}\) are obtained from solving the linear system

$$\begin{aligned} (\mathbf {Z}^T \mathbf {Z}+ \lambda \mathbf {I}_D) \, \mathbf {a}= \mathbf {Z}^T \mathbf {y}, \end{aligned}$$
(5)

where \(\mathbf {Z}\) is the \((n \times D)\) matrix of the random features for all the data points. The above approximation requires the construction of the \(\mathbf {Z}\) matrix and solving the D-dimensional linear problem, hence significantly reducing the training costs if \(D \ll n\). Moreover, access to training points is no longer needed at test time and the predictions are of the order \(\mathcal {O}(D) \ll \mathcal {O}(n)\).

To construct well-approximating features, the authors in [24] called upon Bochner’s theorem which states that a continuous function \(g : \mathbb {R}^d \rightarrow \mathbb {R}\) with \(g(\mathbf {0}) = 1\) is positive definite if and only if it is a Fourier transform of some probability measure on \(\mathbb {R}^d\). For translation-invariant positive definite kernels we thus have

$$\begin{aligned} k(\mathbf {x},\tilde{\mathbf {x}}) = g(\mathbf {x}-\tilde{\mathbf {x}}) = g(\pmb {\lambda }) = \int _{\mathbb {R}^d} e^{i \, \pmb {\omega }^T\pmb {\lambda }} \, \mathsf {d}\mu (\pmb {\omega }) , \end{aligned}$$
(6)

where \(\mu (\pmb {\omega })\) is the probability measure on \(\mathbb {R}^d\). In the above, g is the characteristic function of the multivariate random variable \(\pmb {\omega }\) defined by the expectation \(g(\pmb {\lambda }) = \mathbb {E}_{\pmb {\omega }} ( e^{i \, \pmb {\omega }^T\pmb {\lambda }} ) = \mathbb {E}_{\pmb {\omega }} ( e^{i \, \pmb {\omega }^T (\mathbf {x}- \tilde{\mathbf {x}})} ) = \mathbb {E}_{\pmb {\omega }} ( e^{i \, \pmb {\omega }^T \mathbf {x}} e^{-i \, \pmb {\omega }^T \tilde{\mathbf {x}}} ) = k(\mathbf {x},\tilde{\mathbf {x}})\).

It is straightforward to show that the expectation over the complex exponential can be decomposed into an expectation over an inner product \(\mathbb {E}_{\pmb {\omega }} (e^{i \, \pmb {\omega }^T (\mathbf {x}- \tilde{\mathbf {x}})}) = \mathbb {E}_{\pmb {\omega }} \langle \psi _{\pmb {\omega }}(\mathbf {x}), \psi _{\pmb {\omega }}(\tilde{\mathbf {x}}) \rangle \) where the nonlinear mappings are defined as \(\psi _{\pmb {\omega }}: \mathcal {X}\rightarrow \mathbb {R}^2\), \(\psi _{\pmb {\omega }}(\mathbf {x}) = [ \cos (\pmb {\omega }^T\mathbf {x}) \, \sin (\pmb {\omega }^T\mathbf {x}) ]^T\). In [24] the authors proposed an even lower-dimensional transformation \(\varphi _{\pmb {\omega },b}(\mathbf {x}): \mathcal {X}\rightarrow \mathbb {R}\)

$$\begin{aligned} \varphi _{\pmb {\omega },b}(\mathbf {x}) = \sqrt{2} \cos (\pmb {\omega }^T\mathbf {x}+ b) , \end{aligned}$$
(7)

where b is sampled uniformly from \([0,2\pi ]\) and that satisfies the expectation equality \(\mathbb {E}_{\pmb {\omega }} (e^{i \, \pmb {\omega }^T (\mathbf {x}- \tilde{\mathbf {x}})}) = \mathbb {E}_{\pmb {\omega },b} \langle \varphi _{\pmb {\omega },b}(\mathbf {x}), \varphi _{\pmb {\omega },b}(\tilde{\mathbf {x}}) \rangle \). We chose to work with the mapping \(\varphi \) (dropping the subscripts \(\pmb {\omega },b\) when there is no risk of confusion) in the remainder of the text. The approximating nonlinear feature \(\mathbf {z}(\mathbf {x})\) for each data-point \(\mathbf {x}\) is obtained by concatenating D instances of the random mappings \(\mathbf {z}(\mathbf {x}) = [\varphi ^1(\mathbf {x}), \ldots , \varphi ^D(\mathbf {x})]^T\) with \(\pmb {\omega }\) and b sampled according to their probability distribution so that the expectation is approximated by the sample sum.

2.2 Variable Selection Methods

In this section we position our research with respect to other nonlinear methods for variable selection with an emphasis on kernel methods.

In the class of generalized additive models, lessons learned from the linear models can be reused to construct sparse linear combinations of the nonlinear functions of each variable or, taking into account also possible interactions, of all possible pairs, triplets, etc., e.g. [20, 26, 31, 34]. Closely related to these are the multiple kernel learning methods that seek to learn a sparse linear combination of kernel bases, e.g. [2, 3, 19]. While these methods have shown some encouraging results, their simplifying additive assumption and the fast increasing complexity when higher-order interactions shall be considered (potentially \(2^d\) additive terms for d input variables) clearly present a serious limitation.

Recognising these shortcomings, multiple alternative approaches for general nonlinear functions were explored in the literature. They can broadly be grouped into three categories [18]: filters, wrappers and embedded methods.

The filter methods consider the variable selection as a preprocessing step that is then followed by an independent algorithm for learning the predictive model. Many traditional methods based on information-theoric or statistical measures of dependency (e.g. information gain, Fisher-score, etc.) fall into this category [6]. More recently, significant advancement has been achieved in formulating criteria more appropriate for non-trivial nonlinear dependencies [8, 9, 12, 27, 30, 33]. These are based on the use of (conditional) cross-covariance operators arising from embedding probability measures into the reproducing kernel Hilbert spaces (RKHS) [23]. However, they are still largely disconnected from the predictive model learning procedure and oblivious of the effects the variable selection has on the final predictive performance.

The wrapper methods perform variable selection on top of the learning algorithm treating it as a black box. These are practical heuristics (such as greedy forward or backward elimination) for the search in the \(2^d\) space of all possible subsets of the d input variables [18]. Classical example in this category is the SVM with Recursive Feature Elimination [14]. The wrappers are universal methods that can be used on top of any learning algorithm but they can become expensive for large dimensionalities d, especially if the complexity of the underlying algorithm is high.

Finally, the embedded methods link the variable selection to the training of the predictive model with the view to achieve higher predictive accuracy stemming from the learned sparsity. Our method falls into this category. There are essentially just two branches of kernel-based methods here: methods based on feature rescaling [1, 10, 21, 25, 32], and derivative-based methods [11, 28]. We discuss the feature rescaling methods in more detail in Sect. 3.2. The derivative based methods use regularizers over the partial derivatives of the function and exploit the derivative reproducing property [36] to arrive at an alternative finite-dimensional representation of the function. Though theoretically intriguing, these methods scale rather badly as in addition to the \((n \times n)\) kernel matrix they construct also the \((nd \times n)\) and \((nd \times nd)\) matrices of first and second order partial kernel derivatives and use their concatenations to formulate the sparsity constrained optimisation problem.

There exist two other large groups of sparse nonlinear methods. These address the sparsity in either the latent feature representation, e.g. [13], or in the data instances, e.g. [7]. While their motivation partly overlaps with ours (control of overfitting, lower computational costs at prediction), their focus is on a different notion of sparsity that is out of the scope of our discussion.

3 Towards Sparsity in Input Dimensions

As stated above, our objective in this paper is learning a regression function that is sparse with respect to the input variables. Stated differently, the function shall be insensitive to the values of the inputs in the \(d-l\) large complement set \({\mathcal {I}}^{\mathsf {c}}\) of the irrelevant dimensions so that \(f(\mathbf {x}) = f(\tilde{\mathbf {x}})\) if \(x^s = \tilde{x}^s\) for all \(s \in \mathcal {I}\).

From Eq. (4) we observe that the function evaluation is a linear combination of the D random features \(\varphi \). The random features (7) are in turn constructed from the input \(\mathbf {x}\) through the inner product \(\pmb {\omega }^T \mathbf {x}\). Intuitively, if the function \(\widehat{f}\) is insensitive to an input dimension s, the value of the corresponding input \(x^s\) shall not enter the inner product \(\pmb {\omega }^T \mathbf {x}\) generating the D random features. Formally, we require \(\omega ^s x^s = 0\) for all \(s \in {\mathcal {I}}^{\mathsf {c}}\) which is obviously achieved by \(\omega ^s = 0\). We therefore identify the problem of learning sparse predictive models with sparsity in vectors \(\pmb {\omega }\).

3.1 Learning Through Random Sampling

Though in Eq. (7) \(\pmb {\omega }\) appears as a parameter of the nonlinear transformation \(\varphi \), it cannot be learned directly as it is the result of random sampling from the probability distribution \(\mu (\pmb {\omega })\). In order to ensure the same sparse pattern in the D random samples of \(\pmb {\omega }\), we use a procedure similar to what is known as the reparametrization trick in the context of variational auto-encoders [17].

We begin by expanding Eq. (6) of the Bochner’s theorem into the marginals across the d dimensionsFootnote 1

$$\begin{aligned} g(\pmb {\lambda }) = \int _{\mathbb {R}^d} e^{i \, \pmb {\omega }^T\pmb {\lambda }} \, \mathsf {d}\mu (\pmb {\omega }) = \int _{\mathbb {R}} e^{i \, \omega ^1 \lambda ^1} \mathsf {d}\mu (\omega ^1) \ldots \int _{\mathbb {R}} e^{i \, \omega ^d \lambda ^d} \mathsf {d}\mu (\omega ^d) . \end{aligned}$$
(8)

To ensure that \(\omega ^s = 0\) when \(s \in {\mathcal {I}}^{\mathsf {c}}\) in all the D random samples, the corresponding probability measure (distribution) \(\mu (\omega ^s)\) needs to degenerate to \(\delta (\omega ^s)\). The distribution \(\delta (\omega ^s)\) has all its mass concentrated at the point \(\omega ^s = 0\), and has the property \(\int _{\mathcal {X}} h(\omega ^s) \, \mathsf {d}\delta (\omega ^s) = h(0)\). In particular for h the complex exponential we have \(\int _{\mathcal {X}} e^{i \, \omega ^s \lambda ^s} \, \mathsf {d}\delta (\omega ^s) = 1\) so that the value of \(\lambda ^s\) has no impact on the product in Eq. (8), and therefore no impact on \(g(\pmb {\lambda })\).Footnote 2

Reparametrization Trick. To ensure that all the D random samples of \(\pmb {\omega }\) have the same sparse pattern we need to be able to optimise through its random sampling. For each element \(\omega \) of the vector \(\pmb {\omega }\), we parametrize the sampling distributions \(\mu _\gamma (\omega )\) by its scale \(\gamma \) so that \(\lim _{\gamma \rightarrow 0}\) \(\mu _\gamma (\omega ) = \delta (\omega )\). We next express each of the univariate random variables \(\omega \) as a deterministic transformation of the form \(\omega = q_{\gamma }(\epsilon ) = \gamma \epsilon \) (scaling) of an auxiliary random variable \(\epsilon \) with a fixed probability distribution \(\mu _1(\epsilon )\) with the scale parameter \(\gamma = 1\). For example, for the Gaussian and Laplace kernels the auxiliary distribution \(\mu _1(\epsilon )\) are the standard Gaussian and Cauchy respectively.

By the above reparametrization of the random variable \(\pmb {\omega }\) we disconnect the sampling operation over \(\pmb {\epsilon }\) from the rescaling operation \(\pmb {\omega }= \mathbf {q}_{\pmb {\gamma }}(\pmb {\epsilon }) = \pmb {\epsilon }\odot \pmb {\gamma }\) with a deterministic parameter vector \(\pmb {\gamma }\). Sparsity in \(\pmb {\omega }\) (and therefore the learned model) can now be achieved by learning sparse parameter vector \(\pmb {\gamma }\).

Though in principle it would be possible to learn the sparsity in the sampled \(\pmb {\omega }\)’s directly, this would mean sparsifying instead of one vector \(\pmb {\gamma }\) the D sampled vectors \(\pmb {\omega }\). Moreover, the procedure would need to cater for the additional constraint that all the samples have the same sparse pattern. While theoretically possible, we find our reparametrization approach more elegant and practical.

3.2 Link to Feature Scaling

In the previous section we have built our strategy for sparse learning using the inverse Fourier transform of the kernels and the degeneracy of the associated probability measures. When we plug the rescaling operation into the random feature mapping (7)

$$\begin{aligned} \varphi (\mathbf {x}) = \sqrt{2} \cos (\pmb {\omega }^T\mathbf {x}+ b) = \sqrt{2} \cos ((\pmb {\epsilon }\odot \pmb {\gamma })^T\mathbf {x}+ b) = \sqrt{2} \cos (\pmb {\epsilon }^T (\pmb {\gamma }\odot \mathbf {x}) + b), \end{aligned}$$
(9)

we see that the parameters \(\pmb {\gamma }\) can be interpreted as weights scaling the input variables \(\mathbf {x}\). This makes a link to the variable selection methods based on feature scaling. These are rather straightforward when the kernel is simply linear, or when the nonlinear transformations \(\phi (\mathbf {x})\) can be evaluated explicitly (e.g. polynomial) [10, 32]. In essence, instead of applying the weights to the input features, they are applied to the associated model parameters and suitably constrained to approximate the zero-norm problem.

More complex kernels, for which the nonlinear features \(\phi (\mathbf {x})\) cannot be directly evaluated (may be infinite dimensional), are considered in [1, 21, 25]. Here the scaling is applied within the kernel function \(k(\pmb {\gamma }\odot \mathbf {x},\pmb {\gamma }\odot \tilde{\mathbf {x}})\). The methods typically apply a two-step iterative procedure: they fix the rescaling parameters \(\pmb {\gamma }\) and learn the corresponding n-long model parameters vector \(\mathbf {c}\) (Eq. (2)); fix \(\mathbf {c}\) and learn the d-long rescaling vector \(\pmb {\gamma }\) under some suitable zero-norm approximating constraint. The naive formulation for \(\pmb {\gamma }\) is a nonconvex problem that requires calculating derivatives of the kernel functions with respect to \(\pmb {\gamma }\) (which depending on the kernel function may become rather expensive). In [1], the author proposed a convex relaxation based on linearization of the kernel function. Nevertheless, all the existing methods applying the feature scaling within the kernel functions scale badly with the number of instances as they need to recalculate the \((n \times n)\) kernel matrix and solve the corresponding optimisation (typically \(\mathcal {O}(n^3)\)) with every update of the weights \(\pmb {\gamma }\).

4 Sparse Random Fourier Features Algorithm

In this section we present our algorithm for learning with Sparse Random Fourier Features (SRFF).

figure a

Similarly to the feature scaling methods we propose a two-step alternative procedure to learn the model parameters \(\mathbf {a}\) and the distribution scalings \(\pmb {\gamma }\). For a fixed \(\pmb {\gamma }\) we generate the random features for all the input training points \(\mathcal {O}(nD)\), and solve the linear problem (5) \(\mathcal {O}(D^3)\) to get the D-long model parameters \(\mathbf {a}\). Given that in our large-sample settings we assume \(D \ll n\), this step is significantly cheaper than the corresponding step for learning the \(\mathbf {c}\) parameters in the existing kernel feature scaling methods described in Sect. 3.2.

In the second step, we fix the model parameters \(\mathbf {a}\) and learn the d-long vector of the distribution scalings \(\pmb {\gamma }\). We formulate the optimisation problem as the minimisation of the empirical squared error loss with \(\pmb {\gamma }\) constrained on the probability simplex \(\varDelta \) to encourage the sparsity.

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\pmb {\gamma }\in \varDelta } J(\pmb {\gamma }), \qquad \qquad J(\pmb {\gamma }) := || \mathbf {y}- \mathbf {Z}\mathbf {a}||_2^2 \end{aligned}$$
(10)

Here the \((n \times D)\) matrix \(\mathbf {Z}\) is constructed by concatenating the D random features \(\varphi \) with the \(\pmb {\gamma }\) rescaling (9).

We solve problem (10) by the projected gradient method with accelerated FISTA line search [4]. The gradient can be constructed from the partial derivatives as follows

$$\begin{aligned} \begin{aligned} \frac{\partial J(\pmb {\gamma })}{\gamma ^s}&= -(\mathbf {y}- \mathbf {Z}\, \mathbf {a})^T \, \frac{\partial \mathbf {Z}}{\partial \gamma ^s} \, \mathbf {a}\qquad \qquad \forall s \in \mathbb {N}_d \\ \frac{\partial Z_{ij}}{\partial \gamma ^s}&= - \sqrt{2} \sin (\pmb {\epsilon }^T (\pmb {\gamma }\odot \mathbf {x}) + b) \, \epsilon ^s x^s, \qquad \epsilon ^s = \omega _s / \gamma _s . \end{aligned} \end{aligned}$$
(11)

Unlike in the other kernel feature scaling methods, the form of the gradient (11) is always the same irrespective of the choice of the kernel. The particular kernel choice is reflected only in the probability distribution from which the auxiliary variable \(\pmb {\epsilon }\) is sampled and has no impact on the gradient computations. In our implementation (https://bitbucket.org/dmmlgeneva/srff_pytorch), we leverage the automatic differentiation functionality of pytorch in order to obtain the gradient values directly from the objective formulation.

5 Empirical Evaluation

We implemented our algorithm in pytorch and made it executable optionally on CPUs or GPUs. All of our experiments were conducted on GPUs (single p100). The code including the settings of our experiments amenable for replication is publicly available at https://bitbucket.org/dmmlgeneva/srff_pytorch.

In our empirical evaluation we compare to multiple baseline methods. We included the nonsparse random Fourier features method (RFF) of [24] in our main SRFF code as a call option. For the naive mean and ridge regression we use our own matlab implementation. For the linear lasso we use the matlab PASPAL package [22]. For the nonlinear Sparse Additive Model (SPAM) [26] we use the R implementation of [35]. For the Hilbert-Schmidt Independence Criterion lasso method (HSIC) [33], and the derivative-based embedded method of [28] (Denovas) we use the authors’ matlab implementation.

Except SPAM, all of the baseline sparse learning methods use a two step procedure for arriving at the final model. They first learn the sparsity using either predictive-model-dependent criteria (lasso, Denovas) or in a completely disconnected fashion (HSIC). In the second step (sometimes referred to as de-biasing [28]), they use a base non-sparse learning method (ridge, or kernel ridge) to learn a model over the selected variables (including hyper-parameter search and cross-validation). For HSIC, which is a filter method that does not natively predict the regression outputs, we use as the second step our implementation of the RFF. It searches through the candidate sparsity patterns HSIC produces and uses the validation mean square error as a criteria for the final model selection. In contrast to these, our SRFF method is a single step procedure that does not necessitate this extra re-learning phase.

Experimental Protocol. In all our experiments we use the same protocol. We randomly split the data into three independent subsets: train, validation and test. We use the train subset for training the models, we use the validation subset to perform the hyper-parameter search, and we use the test set to evaluate the predictive performance. We repeat all the experiments 30 times, each with a different random train/validation/test split.

We measure the predictive performance in terms of the root mean squared error (RMSE) over the test samples, averaged over the 30 random replications of the experiments. The regularization hyper-parameter \(\lambda \) (exists in ridge, lasso, Denovas, HSIC, RFF and SRFF) is searched within a 50-long data-dependent grid (automatically established by the methods). The smoothing parameter in Denovas is fixed to 10 following the authors’ default [28]. We use the Gaussian kernel for all the experiments with the width \(\sigma \) set as the median of the Euclidean distances amongst the 20 nearest neighbour instances. We use the same kernel in all the kernel methods and the corresponding scale parameter \(\gamma = 1/\sigma \) in the random feature methods for comparability of results. We fix the number of random features to \(D = 300\) for all the experiments in both RFF and SRFF.

We provide the results of the baseline nonlinear sparse methods (SPAM, HSIC, Denovas) only for the smallest experiments. As explained in the previous sections, the motivation for our paper is to address the poor scaling properties of the existing methods. Indeed, none of the baseline kernel sparse methods scales to the problems we consider here. HSIC [33] creates a \((n \times n)\) kernel matrix per each dimension d and solves a linear lasso problem over the concatenated vectorization of these with memory requirements \((n^2 \times d)\) and complexity \(\mathcal {O}(n^4)\). In our tests, it did not finish within 24 h running on 20 CPUs (Dual Core Intel Xeon E5-2680 v2/2.8 GHz) for the smallest training size of 1000 instances in our SE3 experiment. Within the same time span it did not arrive at a solution for any of the experiments with \(n > 1000\). Denovas constructs, stores in memory, and operates over the \((n \times n)\), \((nd \times n)\) and \((nd \times nd)\) kernel matrix and the matrices of the first and second order derivatives. In our tests the method finished with an out-of-memory error (with 32 GB RAM) for the SE1 with 5k training samples and for SE2 problem already with 1k training instances. SPAM finished with errors for most of the real-data experiments.

5.1 Synthetic Experiments

We begin our empirical evaluation by exploring the performance over a set of synthetic experiments. The purpose of these is to validate our method under controlled conditions when we understand the true sparsity of the generating model. We also experiment with various nonlinear functions and increasing data sizes in terms of both the sample numbers n and the dimensionality d.

Table 1. Summary of synthetic experiments

SE1: The very first of our experiments is a rather small problem with only \(d=18\) input dimensions of which only 5 are relevant for the regression function. In Table 2 we compare our SRFF method to the baselines for the smallest sample setting with \(n = 1000\). Most of the methods (linear, additive or non-sparse) do not succeed in learning a model for the complex nonlinear relationships between the inputs and outputs and fall back to predicting simple mean.

The general nonlinear models with sparsity (HSIC, Denovas and SRFF) divert from the simple mean prediction. They all discover and use in the predictive model the same sparse pattern (see Fig. 1 for SRFF). Denovas and SRFF achieve almost identical results which confirms that our method is competitive with the state of the art methods in terms of predictive accuracy and variable selection.Footnote 3

Table 2. SE1 - Test RMSE for \(n = 1000\)

In Table 3 we document how the increasing training size contributes to improving the predictive performance even in the case of several thousands instances. The performance of the SRFF method for the largest 50k sample is by about \(6\%\) better than for the 1k training size. For the other methods the problem remains out of their reachFootnote 4 and they stick to the mean prediction even for higher training sizes.Footnote 5 We do not provide any comparisons with the nonlinear sparse methods because, as explained above, they do not scale to the sample sizes we consider here.

Table 3. SE1 - Test RMSE for increasing train size n

The improved predictive performance for the larger training sizes goes hand in hand with the variable selection, Fig. 1. For the smallest 1k training sample, SRFF identifies only the 7th, 8th and 9th relevant dimensions. They enter the sine in the generative function in a product and therefore have a larger combined effect on the function outcome than the squared sum of dimensions 1 and 3. These two dimensions are picked up by the method from the larger training sets and this contributes to the increase in the predictive performance.

Fig. 1.
figure 1

Learned sparsity pattern \(\pmb {\gamma }\) by the SRFF method for the 1k and 50k training size in the SE1 experiment (the median of the 30 replications). The other nonlinear sparse methods learn the same pattern for the 1k problem but cannot solve the 50k problem.

SE2: In the second experiment we increase the dimensionality to \(d=100\) and change the nonlinear function (see Table 1). The overall outcomes are rather similar to the SE1 experiment. Again, it’s only the nonlinear sparse models that predict something else than mean, SPAM marginally better, HSIC marginally worse. Our SRFF method clearly outperforms all the other methods in the predictive accuracy. It also correctly discovers the 5 relevant variables with the median value of \(\gamma \) for these dimensions between 0.92–1.04 while the maximum for all the irrelevant variables is 0.06.Footnote 6 The advantage of SRFF over the baselines for large sample sizes (Tables  4 and 5) is even more striking than in the SE1 experiment (Tables  3 and 4).

Table 4. SE2 - Test RMSE for \(n = 1000\)
Table 5. SE2 - Test RMSE for increasing train size n

SE3: In this final synthetic experiment we increase the dimensionality to \(d=1000\) to further stretch our SRFF method. There are only 10 relevant input variables in this problem. The first 5 were generated as random perturbations of the random variable \(z_1\), e.g. \(x_1 = z_1 + N(0,0.1)\), the second 5 by the same procedure from \(z_2\), e.g. \(x_5 = z_2 + N(0,0.1)\). The remaining 990 input variables were generated by the same process from the other 198 standard normal z’s.

We summarise the results for the 1k and 50k training instances in Table 6. As in the other synthetic experiments, the baseline methods are not able to capture the nonlinear relationships of this extremely sparse problem and instead predict a simple mean. Our SRFF method achieves significantly better accuracy for the 1k training set, and it further considerably improves with 50k samples to train on. These predictive gains are possible due to SRFF correctly discovering the set of relevant variables. In the 1k case, the medians across the 30 data resamples of the learned \(\pmb {\gamma }\) parameters are between 0.37–0.71 for the 10 relevant variables and maximally 0.05 for the remaining 990 irrelevant variables. In the 50k case, the differences are even more clearly demarcated: 1.19–1.64 for the relevant, and 0.03 maximum for the irrelevant (bearing in mind that the total sum over the vector \(\pmb {\gamma }\) is the same in both cases).

Table 6. SE3 - Test RMSE for increasing train size n

5.2 Real Data Experiments

We experiment on four real datasets (Table 7): three from the LIACCFootnote 7 regression repository, and one Kaggle datasetFootnote 8. The summary of these is presented in Table 8. The RFF results illustrate the advantage nonlinear modelling has over simple linear models. Our sparse nonlinear SRFF method clearly outperforms all the linear as well as the non-sparse nonlinear RFF method. Moreover, it is the only nonlinear sparse learning method that can handle problems of these large-scale datasets.

Table 7. Summary of real-data experiments
Table 8. Real-data experiments - Test RMSE for increasing train size n

6 Summary and Conclusions

We present here a new kernel-based method for learning nonlinear regression function with relevant variable subset selection. The method is unique amongst the state of the art as it can scale to tens of thousands training instances, way beyond what any of the existing kernel-based methods can handle. For example, while none of the tested sparse method worked over datasets with more than 1k instances, the CPU version of our SRFF finished the full validation search over 50 hyper-parameters \(\lambda \) in the 50k SE1 experiment within two hours on a laptop with a Dual Intel Core i3 (2nd Gen) 2350M/2.3 GHz and 16 GB RAM.

We focus here on nonlinear regression but the extension to classification problems is straightforward by replacing appropriately the objective loss function. We used the Gaussian kernel for our experiments as one of the most popular kernels in practice. But the principals hold for other shift-invariant kernels as well, and the method and the algorithm can be applied to them directly as soon as the corresponding probability measure \(\mu (\pmb {\omega })\) is recovered and the reparametrization of \(\pmb {\omega }\) explained in Sect. 3.1 can be applied.