Machine Learning

, Volume 87, Issue 3, pp 303–355 | Cite as

Learning sparse gradients for variable selection and dimension reduction

  • Gui-Bo Ye
  • Xiaohui XieEmail author


Variable selection and dimension reduction are two commonly adopted approaches for high-dimensional data analysis, but have traditionally been treated separately. Here we propose an integrated approach, called sparse gradient learning (SGL), for variable selection and dimension reduction via learning the gradients of the prediction function directly from samples. By imposing a sparsity constraint on the gradients, variable selection is achieved by selecting variables corresponding to non-zero partial derivatives, and effective dimensions are extracted based on the eigenvectors of the derived sparse empirical gradient covariance matrix. An error analysis is given for the convergence of the estimated gradients to the true ones in both the Euclidean and the manifold setting. We also develop an efficient forward-backward splitting algorithm to solve the SGL problem, making the framework practically scalable for medium or large datasets. The utility of SGL for variable selection and feature extraction is explicitly given and illustrated on artificial data as well as real-world examples. The main advantages of our method include variable selection for both linear and nonlinear predictions, effective dimension reduction with sparse loadings, and an efficient algorithm for large p, small n problems.


Gradient learning Variable selection Effective dimension reduction Forward-backward splitting 

1 Introduction

Datasets with many variables have become increasingly common in biological and physical sciences. In biology, it is nowadays a common practice to measure the expression values of tens of thousands of genes, genotypes of millions of SNPs, or epigenetic modifications at tens of millions of DNA sites in one single experiment. Variable selection and dimension reduction are increasingly viewed as a necessary step in dealing with these high-dimensional data.

Variable selection aims at selecting a subset of variables most relevant for predicting responses. Many algorithms have been proposed for variable selection (Guyon and Ellsseeff 2003). They typically fall into two categories: Feature Ranking and Subset Selection. Feature Ranking scores each variable according to a metric, derived from various correlation or information theoretic criteria (Guyon and Ellsseeff 2003; Weston et al. 2003; Dhillon et al. 2003), and eliminates variables below a threshold score. Because Feature Ranking methods select variables based on individual prediction power, they are ineffective in selecting a subset of variables that are marginally weak but in combination strong in prediction. Subset Selection aims to overcome this drawback by considering and evaluating the prediction power of a subset of variables as a group. One popular approach to subset selection is based on direct object optimization, which formalizes an objective function of variable selection and selects variables by solving an optimization problem. The objective function often consists of two terms: a data fitting term accounting for prediction accuracy, and a regularization term controlling the number of selected variables. LASSO proposed by Tibshirani (1996) and elastic net by Zou and Hastie (2005) are two examples of this type of approach. The two methods are widely used because of their implementation efficiency (Efron et al. 2004; Zou and Hastie 2005) and the ability of performing simultaneous variable selection and prediction; however, a linear prediction model is assumed by both methods. The component smoothing and selection operator method (COSSO) proposed in Lin and Zhang (2006) tries to overcome this shortcoming by using a functional LASSO penalty. However, COSSO is based on the framework of smoothing spline ANOVA which is difficult in handling high dimensional data.

Dimension reduction is another commonly adopted approach in dealing with high-dimensional data. Rooting in dimension reduction is the common belief that many real-world high-dimensional data are concentrated on a low-dimensional manifold embedded in the underlying Euclidean space. Therefore mapping the high-dimensional data into the underlying low-dimensional manifold should be able to improve prediction accuracy, to help visualize the data, and to construct better statistical models. A number of dimension reduction methods have been proposed, ranging from principle component analysis to manifold learning for non-linear settings (Belkin and Niyogi 2003; Zou et al. 2006; Mackey 2009; Roweis and Saul 2000; Tenenbaum et al. 2000; Donoho and Grimes 2003). However, most of these dimension reduction methods are unsupervised, and therefore are likely suboptimal with respect to predicting responses. In supervised settings, most recent work focuses on finding a subspace \(\mathcal{S}\) such that the projection of the high dimensional data x onto \(\mathcal{S}\) captures the statistical dependency of the response y on x. The space \(\mathcal{S}\) is called effective dimension reduction (EDR) space (Xia et al. 2002).

Several methods have been proposed to identify EDR space. The research goes back to sliced inverse regression (SIR) proposed by Li (1991), where the covariance matrix of the inverse regression is explored for dimension reduction. The main idea is that if the conditional distribution ρ(y|x) concentrates on a subspace \(\mathcal{S}\), then the inverse regression E(x|y) should lie in that same subspace. However, SIR imposes specific modeling assumptions on the conditional distribution ρ(y|x) or the regression E(y|x). These assumptions hold in particular if the distribution of x is elliptic. In practice, however, we do not necessarily expect that x will follow an elliptic distribution, nor is it easy to assess departures from ellipticity in a high-dimensional setting. A further limitation of SIR is that it yields only a one-dimensional subspace for binary classifications. Other reverse regression based methods, including principal Hessian directions (pHd, Li 1992), sliced average variance estimation (SAVE, Cook and Yin 2001) and contour regression (Li et al. 2005), have been proposed, but they have similar limitations. To address these limitations, Xia et al. (2002) proposed a method called the (conditional) minimum average variance estimation (MAVE) to estimate the EDR directions. The assumption underlying MAVE is quite weak and only a semiparametric model is used. Under the semiparametric model, conditional covariance is estimated by linear smoothing and EDR directions are then estimated by minimizing the derived conditional covariance estimation. In addition, a simple outer product gradient (OPG) estimator is proposed as an initial estimator. Other related approaches include methods that estimate the derivative of the regression function (Hristache et al. 2001; Samarov 1993). Recently, Fukumizu et al. (2009) proposed a new methodology which derives EDR directly from a formulation of EDR in terms of the conditional independence of x from the response y, given the projection of x on the EDR space. The resulting estimator is shown to be consistent under weak conditions. However, all these EDR methods cannot be directly applied to the large p, small n case, where p is the dimension of the underlying Euclidean space in which the data lie, and n is the number of samples. To deal with the large p, small n case, Mukherjee and co-workers (2006) introduced a gradient learning method (which will be referred to as GL) for estimating EDR by introducing a Tikhonov regularization term on the gradient functions. The EDR directions were estimated using the eigenvectors of the empirical gradient covariance matrix.

Although both variable selection and dimension reduction offer valuable tools for statistical inference in high-dimensional space and have been prominently researched, few methods are available for combining them into a single framework where both variable selection and dimensional reduction can be done. One notable exception is the sparse principle component analysis (SPCA), which produces modified principle components with sparse loadings (Zou et al. 2006). However, SPCA is mainly used for unsupervised linear dimension reduction, our focus here is the variable selection and dimension reduction in supervised and potentially nonlinear settings. To motivate the reason why a combined approach might be interesting in a supervised setting, consider a microarray gene expression data measured in both normal and tumor samples. Out of 20,000 genes measured in microarray, only a small number of genes (e.g. oncogenes) are likely responsible for gene expression changes in tumor cells. Variable selection chooses more relevant genes and dimension reduction further extracts features based on the subset of selected genes. Taking a combined approach could potentially improve prediction accuracy by removing irrelevant noisy variables. Additionally, by focusing on a small number of most relevant genes and extracting features among them, it could also provide a more interpretable and manageable model regarding genes and biological pathways involved in the carcinogenesis.

In this article, we extend the gradient learning framework introduced by Mukherjee and co-workers (2006), and propose a sparse gradient learning approach (SGL) for integrated variable selection and dimension reduction in a supervised setting. The method adopts a direct object optimization approach to learn the gradient of the underlying prediction function with respect to variables, and imposes a regularization term to control the sparsity of the gradient. The gradient of the prediction function provides a natural interpretation of the geometric structure of the data (Guyon et al. 2002; Mukherjee and Zhou 2006; Mukherjee and Wu 2006; Mukherjee et al. 2010). If a variable is irrelevant to the prediction function, the partial derivative with respect to that variable is zero. Moreover, for non-zeros partial derivatives, the larger the norm of the partial derivative with respect to a variable is, the more important the corresponding variable is likely to be for prediction. Thus the norms of partial derivatives give us a criterion for the importance of each variable and can be used for variable selection. Motivated by LASSO, we encourage the sparsity of the gradient by adding an 1 norm based regularization term to the objective vector function. Variable selection is automatically achieved by selecting variables with non-zero partial derivatives. The sparse empirical gradient covariance matrix (S-EGCM) constructed based on the learned sparse gradient reflects the variance of the data conditioned on the response variable. The eigenvectors of S-EGCM are then used to construct the EDR directions. A major innovation of our approach is that the variable selection and dimension reduction are achieved within a single framework. The features constructed by the eigenvectors of S-EGCM are sparse with non-zero entries corresponding only to selected variables.

The rest of this paper is organized as follows. In Sect. 2, we describe the sparse gradient learning algorithm for regression, where an automatic variable selection scheme is introduced. The derived sparse gradient is an approximation of the true gradient of regression function under certain conditions, which we give in Sect. 2.3 and their proofs are delayed in Sect. 3. We describe variable selection and feature construction using the learned sparse gradients in Sect. 2.4. As our proposed algorithm is an infinite dimensional minimization problem, it can not be solved directly. We provide an efficient implementation for solving it in Sect. 4. In Sect. 4.1, we give a representer theorem, which transfers the infinite dimensional sparse gradient learning problem to a finite dimensional one. In Sect. 4.3, we solve the transferred finite dimensional minimization problem by a forward-backward splitting algorithm. In Sect. 5, we generalize the sparse gradient learning algorithm to a classification setting. We illustrate the effectiveness of our gradient-based variable selection and feature extraction approach in Sect. 6 using both simulated and real-world examples.

2 Sparse gradient learning for regression

2.1 Basic definitions

Let y and x be respectively ℝ-valued and ℝ p -valued random variables. The problem of regression is to estimate the regression function \(f_{\rho}(\mathbf{x})=\mathbb {E}(y|\mathbf{x})\) from a set of observations \(\mathcal{Z} :=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}\), where \(\mathbf{x}_{i}:=(x_{i}^{1},\ldots,x_{i}^{p})^{T}\in\mathbb{R}^{p}\) is an input, and y i ∈ℝ is the corresponding output.

We assume the data are drawn i.i.d. from a joint distribution ρ(x,y), and the response variable y depends only on a few directions in ℝ p as follows
$$ y=f_\rho(\mathbf{x})+\epsilon=g\bigl(b_1^T\mathbf{x},\ldots ,b_r^T\mathbf{x}\bigr)+\epsilon,$$
where ϵ is the noise, B=(b 1,…,b r ) is a p×r orthogonal matrix with r<p, and \(\mathbb{E}(\epsilon|\mathbf{x})=0\) almost surely. We call the r dimensional subspace spanned by \(\{b_{i}\}_{i=1}^{r}\) the effective dimension reduction (EDR) space (Xia et al. 2002). For high-dimensional data, we further assume that B is a sparse matrix with many rows being zero vectors, i.e. the regression function only depends on a subset of variables in x.
Suppose the regression function f ρ (x) is differentiable. The gradient of f ρ with respect to variables is
$$\nabla f_\rho:= \biggl(\frac{\partial f_\rho}{\partial x^1},\ldots, \frac {\partial f_\rho}{\partial x^p}\biggr)^T.$$
A quantity of particular interest is the gradient outer product matrix G=(G ij ), a p×p matrix with elements
$$ G_{ij} := \biggl \langle \frac{\partial f_\rho}{\partial x^i},\frac {\partial f_\rho}{\partial x^j}\biggr \rangle _{L^2_{\rho_X}},$$
where ρ X is the marginal distribution of x. As pointed out by Li (1991) and Xia et al. (2002), under the assumption of the model in Eq. (1), the gradient outer product matrix G is at most of rank r, and the EDR spaces are spanned by the eigenvectors corresponding to non-zero eigenvalues of G. This observation has motivated the development of gradient-based methods for inferring the EDR directions (Xia et al. 2002; Mukherjee and Zhou 2006; Mukherjee and Wu 2006), and also forms the basis of our approach.

2.2 Regularization framework for sparse gradient learning

The optimization framework for sparse gradient learning includes a data fitting term and a regularization term. We first describe the data fitting term. Given a set of observations \(\mathcal{Z}\), a commonly used data fitting term for regression is the mean square error \(\frac{1}{n}\sum_{i=1}^{n} (y_{i} - f_{\rho}(\mathbf{x}_{i}))^{2}\). However, because our primary goal is to estimate the gradient of f ρ , we adopts the framework of local linear regression (Ruppert and Ward 1994), which involves using a kernel to weight the data points so that only the data points within the neighborhood of an evaluation point effectively contribute on function estimation. More specifically, we use the first order Taylor expansion to approximate f ρ by f ρ (x)≈f ρ (x 0)+∇f ρ (x 0)⋅(x-x 0). When x j is close to x i , f ρ (x j )≈y i +∇f ρ (x i )⋅(x j -x i ). Define f:=(f 1,…,f p ), where f j =∂f ρ /∂x j for j=1,…,p. The mean square error used in our algorithm is
$$ \mathcal{E}_\mathcal{Z}(\mathbf{f})=\frac{1}{n^2}\sum_{i,j=1}^n\omega_{i,j}^s \bigl(y_i-y_j+\mathbf{f}(\mathbf{x}_i)\cdot(\mathbf {x}_j-\mathbf{x}_i) \bigr)^2$$
considering Taylor expansion between all pairs of observations. Here \(\omega_{i,j}^{s}\) is a weight kernel function that ensures the locality of the approximation, i.e. \(\omega_{i,j}^{s}\to0\) when |x i -x j | is large. We can use, for example, the Gaussian with standard deviation s as a weight function. Let \(\omega^{s}(\mathbf{x})=\exp\{-\frac{|\mathbf{x}|^{2}}{2s^{2}}\}\). Then the weights are given by
$$ \omega_{i,j}^s=\omega^s(\mathbf{x}_j-\mathbf{x}_i)=\exp \biggl\{ -\frac{|\mathbf{x}_j-\mathbf{x}_i|^2}{2s^2} \biggr\},$$
for all i,j=1,…,n, with parameter s controlling the bandwidth of the weight function. In this paper, we view s as a parameter and is fixed in implementing our algorithm, although it is possible to tune s using a greedy algorithm as RODEO in Lafferty and Wasserman (2008).
At first glance, this data fitting term might not appear very meaningful for high-dimensional data as samples are typically distributed sparsely in a high dimensional space. However, the term can also be explained in the manifold setting (Mukherjee et al. 2010), in which case the approximation is well defined as long as the data lying in the low dimensional manifold are relatively dense. More specifically, assume X is a d-dimensional connected compact C submanifold of ℝ p which is isometrically embedded. Then X is a metric space with a metric d X and the inclusion map Φ:(X,d X )↦(ℝ p ,∥⋅∥2) is well defined and continuous (actually it is C ). Note that the empirical data \(\{\mathbf{x}_{i}\}_{i=1}^{n}\) given in the Euclidean space ℝ p are images of the points \(\{\mathbf{q}_{i}\}_{i=1}^{n}\subset X\) under Φ:x i =Φ(q i ). Then the data fitting term (4) can be explained in the manifold setting. From the first order Taylor expansion, if q i and q j are close enough, we can expect that \(y_{j}\approx y_{i}+\langle\nabla_{X}f_{\rho}(\mathbf{q}_{i}),v_{ij}\rangle_{\mathbf{q}_{i}}\), where \(v_{ij}\in T_{\mathbf{q}_{i}}X\) is the tangent vector such that \(\mathbf{q}_{j}=\exp_{\mathbf{q}_{i}}(v_{ij})\). However, v ij is not easy to compute, so we would like to represent the term \(\langle\nabla_{X}f_{\rho}(\mathbf{q}_{i}),v_{ij}\rangle_{\mathbf{q}_{i}}\) in the Euclidean space ℝ p . Suppose x=Φ(q) and ξ=Φ(exp q (v)) for qX and vT q X. Since Φ is an isometric embedding, i.e. q :T q XT x p ≅ℝ p is an isometry for every qX, the following holds
$$\bigl\langle\nabla_Xf(\mathbf{q}),v\bigr\rangle_\mathbf{q}=\bigl\langle d\varPhi_\mathbf{q}\bigl(\nabla_Xf(\mathbf{q})\bigr),d\varPhi_\mathbf{q}(v)\bigr\rangle_{\mathbb{R}^p},$$
where q (v)≈ϕ(exp q (v))-ϕ(q)=ξ-x for v≈0. Applying these relations to the observations \(\mathcal{Z}=\{(\mathbf {x}_{i},y_{i})\}_{i=1}^{n}\) and denoting f=(∇ X f) yields
$$ \mathcal{E}_\mathcal{Z}(\mathbf{f})=\frac{1}{n^2}\sum_{i,j=1}^n\omega_{i,j}^s \bigl(y_i-y_j+\mathbf{f}(\mathbf{x}_i)\cdot(\mathbf {x}_j-\mathbf{x}_i) \bigr)^2.$$
This is exactly the same as the one in the Euclidean setting.

Now we turn to the regularization term on ∇f ρ . As discussed above, we impose a sparsity constraint on the gradient vector f. The motivation for the sparse constraint is based on the following two considerations: (1) Since most variables are assumed to be irrelevant for prediction, we expect the partial derivatives of f ρ with respect to these variables to be zero; and (2) If variable x j is important for prediction, we expect the function f ρ should show significant variation along x j , and as such the norm of \(\frac{\partial f_{\rho}}{\partial x^{j}}\) should be large. Thus we will impose the sparsity constraint on the vector \((\|\frac{\partial f_{\rho}}{\partial x^{1}}\|,\ldots,\|\frac{\partial f_{\rho}}{\partial x^{p}}\|)^{T}\in \mathbb{R}^{p}\), where ∥⋅∥ is a function norm, to regularize the number of non-zeros entries in the vector.

In this work, we specify the function norm ∥⋅∥ to be \(\|\cdot\|_{\mathcal{K}}\), the norm in the reproducing kernel Hilbert space (RKHS) \(\mathbb{H}_{\mathcal{K}}\) associated with a Mercer kernel \(\mathcal{K}(\cdot,\cdot)\) (see Aronszajn 1950 and Sect. 4.1). The sparsity constraint on the gradient norm vector implies that the 0 norm of the vector \((\|f^{1}\|_{\mathcal{K}},\ldots, \|f^{p}\|_{\mathcal{K}})^{T}\) should be small. However, because the 0 norm is difficult to work with during optimization, we instead use the 1 norm of the vector (Donoho 1995, 2006; Daubechies et al. 2004; Micchelli and Pontil 2007) as our regularization term
$$ \varOmega(\mathbf{f}):=\lambda\sum _{j=1}^p\|f^j\|_{\mathcal{K}},$$
where λ is a sparsity regularization parameter. This functional LASSO penalty has been used in Lin and Zhang (2006) as COSSO penalty. However, our component here is quite different from theirs, which makes our algorithm useful for high dimensional problems.

The norm \(\|\cdot\|_{\mathcal{K}}\) is widely used in statistical inference and machine learning (see Vapnik 1998). It can ensure each approximated partial derivative \(f^{j}\in\mathbb{H}_{\mathcal{K}}\), which in turn imposes some regularity on each partial derivative. It is possible to replace the hypothesis space \(\mathbb{H}_{\mathcal{K}}^{p}\) for the vector f in (7) by some other space of vector-valued functions (Micchelli and Pontil 2005) in order to learn the gradients.

Combining the data fidelity term (4) and the regularization term (7), we propose the following optimization framework, which will be referred as sparse gradient learning, to learn ∇f ρ
$$ \mathbf{f}_{\mathcal{Z}}:=\arg \min_{\mathbf{f}\in{\mathbb{H}}_{\mathcal{K}}^p}\ \frac{1}{n^2}\sum_{i,j=1}^n\omega_{i,j}^s \bigl(y_i-y_j+\mathbf{f}(\mathbf{x}_i)\cdot(\mathbf {x}_j-\mathbf{x}_i) \bigr)^2 + \lambda\sum _{j=1}^p\|f^j\|_{\mathcal{K}}.$$

A key difference between our framework and the one in Mukherjee and Zhou (2006) is that our regularization is based on 1 norm, while the one in Mukherjee and Zhou (2006) is based on ridge regularization. The difference may appear minor, but makes a significant impact on the estimated ∇f ρ . In particular, ∇f ρ derived from Eq. (8) is sparse with many components potentially being zero functions, in contrast to the one derived from Mukherjee and Zhou (2006), which is comprised of all non-zero functions. The sparsity property is desirable for two primary reasons: (1) In most high-dimensional real-world data, the response variable is known to depend only on a subset of the variables. Imposing sparsity constraints can help eliminate noisy variables and thus improve the accuracy for inferring the EDR directions; (2) The resulting gradient vector provides a way to automatically select and rank relevant variables.

Remark 1

The OPG method introduced by Xia et al. (2002) to learn EDR directions can be viewed as a special case of the sparse gradient learning, corresponding to the case of setting K(x,y)=δ x,y and λ=0 in Eq. (8). Thus the sparse gradient learning can be viewed as an extension of learning gradient vectors only at observed points by OPG to a vector function of gradient over the entire space. Note that OPG cannot be directly applied to the data with p>n since the problem is then underdetermined. Imposing a regularization term as in Eq. (8) removes such a limitation.

Remark 2

The sparse gradient learning reduces to a special case that is approximately LASSO (Tibshirani 1996) if we choose K(x,y)=δ x,y and additionally require f(x i ) to be invariant for different i (i.e. linearity assumption). Note that LASSO assumes the regression function is linear, which can be problematic for variable selection when the prediction function is nonlinear (Efron et al. 2004). The sparse gradient learning makes no linearity assumption, and can thus be viewed as an extension of LASSO for variable selection with nonlinear prediction functions.

Remark 3

A related framework is to learn the regression function directly, but impose constraints on the sparsity of the gradient as follows
$$\min_{f\in \mathbb{H}_\mathcal{K}}\frac{1}{n}\sum_{i=1}^n\bigl(f(\mathbf {x}_i)-y_i\bigr)^2+\lambda\sum _{i=1}^p \bigg\|\frac{\partial f}{\partial x^i}\bigg\|_\mathcal{K}.$$
This framework is however difficult to solve because the regularization term \(\sum_{i=1}^{p} \|\frac{\partial f}{\partial x^{i}}\|_{\mathcal{K}}\) is both nonsmooth and inseparable, and the representer theorem introduced later to solve Eq. (8) cannot be applied here. Note that our primary goal is to select variables and identify the EDR directions. Thus we focus on learning gradient functions rather than the regression function itself.

Remark 4

We can also use the regularization term \((\sum_{j=1}^{p}\|f^{j}\|_{\mathcal{K}} )^{2}\), which is more widely used in literature (Bach 2008). Our framework is equivalent to, with a different choice of λ,
$$ \mathbf{f}_{\mathcal{Z}}:=\arg \min_{\mathbf{f}\in{\mathbb{H}}_{\mathcal{K}}^p}\ \frac{1}{n(n-1)}\sum_{i,j=1}^n\omega_{i,j}^s \bigl(y_i-y_j+\mathbf{f}(\mathbf{x}_i)\cdot(\mathbf {x}_j-\mathbf{x}_i) \bigr)^2 + \lambda \Biggl(\sum _{j=1}^p\|f^j\|_{\mathcal{K}}\Biggr)^2.$$

2.3 Error analysis

Next we investigate the statistical performance of the sparse gradient learning with a Gaussian weight in Eq. (5). Throughout the paper on error analysis, we only consider the case where p is fixed and does not change with n. Assume that the data \(\mathcal{Z}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}\) are i.i.d. drawn from a joint distribution ρ, which can be divided into a marginal distribution ρ X and a conditional distribution ρ(y|x). Denote f ρ to be the regression function given by

We show that under certain conditions, \(\mathbf{f}_{\mathcal{Z}}\rightarrow\nabla f_{\rho}\) as n→∞ for suitable choices of the parameters λ and s that go to zero as n→∞. In order to derive the learning rate for the algorithm, some regularity conditions on both the marginal distribution and ∇f ρ are required.

Denote ∂X be the boundary of X and d(x,∂X)(xX) be the shortest Euclidean distance from x to ∂X, i.e., d(x,∂X)=inf y∂X d(x,y). Denote \(\kappa=\sup_{\mathbf{x}\in X}\sqrt{\mathcal{K}(\mathbf{x},\mathbf{x})}\).

Theorem 1

Suppose the data \(\mathcal{Z}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}\) are i.i.d. drawn from a joint distribution ρ and |y i |≤M for all i for a positive constant M. Assume that for some constants c ρ >0 and 0<θ≤1, the marginal distribution ρ X satisfies
$$ \rho_X\bigl(\bigl\{\mathbf{x}\in X:d(\mathbf{x},\partial X)<t\bigr\}\bigr)\leq c_\rho t$$
and the density p(x) of ρ X satisfies
$$ \sup_{\mathbf{x}\in X}p(\mathbf{x})\leq c_\rho \quad \mbox{\textit{and}}\quad |p(\mathbf{x})-p(\mathbf{u})|\leq c_\rho |\mathbf{x}-\mathbf{u}|^\theta,\quad\forall\mathbf{u},\mathbf{x}\in X.$$
Let \(\mathbf{f}_{\mathcal{Z}}\) be the estimated gradient function given by Eq. (8) andf ρ be the true gradient of the regression function f ρ . Suppose that \(\mathcal {K}\in C^{2}\) and \(\nabla f_{\rho}\in\mathbb{H}_{\mathcal{K}}^{p}\). Choose \(\lambda=\lambda(n)=n^{-\frac{\theta}{p+2+2\theta}}\) and \(s=s(n)=n^{-\frac{1}{2(p+2+2\theta)}}\). Then there exists a constant C>0 such that for any 0<η≤1 with confidence 1-η
$$ \|\mathbf{f}_\mathcal{Z}-\nabla f_\rho\|_{L^2_{\rho_X}}\leq C \log\frac{4}{\eta} \biggl(\frac{1}{n} \biggr)^{\frac{\theta }{4(p+2+2\theta)}}.$$

Condition (12) means the density of the marginal distribution is Hölder continuous with exponent θ. Condition (14) specifies the behavior of ρ X near the boundary ∂X of X. Both are common assumptions for error analysis. When the boundary ∂X is piecewise smooth, Eq. (12) implies Eq. (14). Here we want to emphasize that our terminology sparse gradient for the derived \(\mathbf{f}_{\mathcal{Z}}\) comes from this approximation property. Since we treat each component of the gradient separately in our estimation algorithm, \(\mathbf{f}_{\mathcal{Z}}\) does not necessarily satisfy the gradient constraint \(\frac{\partial^{2}f}{\partial x^{i}\partial x^{j}}=\frac{\partial^{2} f}{\partial x^{j}\partial x^{i}}\) for all i and j. However, we note that it is possible to add these constraints explicitly into the convex optimization framework that we will describe later.

The convergence rate in Eq. (13) can be greatly improved if we assume that the data are lying in or near a low dimensional manifold (Ye and Zhou 2008; Mukherjee et al. 2010; Bickel and Li 2007). In this case, the learning rate in the exponent of 1/n depends only on the dimension of the manifold, not the actual dimension of the Euclidean space. The improved convergence rate for local linear regression under manifold assumption appeared in Bickel and Li (2007). Here we would like to emphasize that our result is different from theirs in two points of view. First, our algorithm is different from the one discussed in Bickel and Li (2007). Second, we focus on the case where the distribution of the predictor variables is concentrated on a manifold and our criterion of performance is the integral of pointwise mean error with respect to the underlying distribution of the variables; by contrast, the discussion in Bickel and Li (2007) is more restrictive by applying only to predictors taking values in a low dimensional manifold and discussing estimation of the regression function at a point.

Denote d X be the metric on X and dV be the Riemannian volume measure of X. Let ∂X be the boundary of X and d X (x,∂X)(xX) be the shortest distance from x to ∂X on the manifold X. Note that the inclusion map Φ:(X,d X )↦(ℝ p ,∥⋅∥2) is an isometric embedding and the empirical data \(\{\mathbf{x}_{i}\}_{i=1}^{n}\) are given in the Euclidean space R p which are images of the points \(\{\mathbf{q}_{i}\}_{i=1}^{n}\subset X\) under Φ:x i =Φ(q i ). Denote \((d\varPhi)^{*}_{\mathbf{q}}\) the dual of q and () maps a p-dimensional vector valued function f to a vector field with \((d\varPhi)^{*}\mathbf{f}(\mathbf{q})=(d\varPhi)_{\mathbf{q}}^{*}(\mathbf {f}(\mathbf{q}))\) (Do Carmo and Flaherty 1992).

Theorem 2

Let X be a connected compact C submanifold of p which is isometrically embedded and of dimension d. Suppose the data \(\mathcal{Z}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}\) are i.i.d. drawn from a joint distribution ρ defined on X×Y and there exists a positive constant M such that y i M for all i. Assume that for some constants c ρ >0 and 0<θ≤1, the marginal distribution ρ X satisfies
$$ \rho_X\bigl(\bigl\{\mathbf{x}\in X:d_X(\mathbf{x},\partial X)<t\bigr)\bigr\}\leq c_\rho t$$
and the density \(p(\mathbf{x})=\frac{d\rho_{X}(\mathbf{x})}{dV}\) exists and satisfies
$$ \sup_{\mathbf{x}\in X}p(\mathbf{x})\leq c_\rho\quad \mbox{\textit{and}}\quad |p(\mathbf{x})-p(\mathbf{u})|\leq c_\rho d_X(\mathbf{x},\mathbf{u})^\theta,\quad \forall\mathbf{u},\mathbf {x}\in X.$$
Let \(\mathbf{f}_{\mathcal{Z}}\) be the estimated gradient function given by Eq. (8) and X f ρ be the true gradient of the regression function f ρ . Suppose that \(\mathcal {K}\in C^{2}(X\times X)\), f ρ C 2(X) and \(d\varPhi(\nabla_{X} f_{\rho})\in \mathbb{H}_{\mathcal{K}}^{p}\). Choose \(\lambda=\lambda(n)=n^{-\frac{\theta}{d+2+2\theta}}\) and \(s=s(n)=n^{-\frac{1}{2(d+2+2\theta)}}\). Then there exists a constant C>0 such that for any 0<η≤1 with confidence 1-η
$$ \|(d\varPhi)^*\mathbf{f}_\mathcal{Z}-\nabla_X f_\rho\|_{L^2_{\rho _X}}\leq C \log \frac{4}{\eta} \biggl(\frac{1}{n} \biggr)^{\frac{\theta }{4(d+2+2\theta)}}.$$

Note that the convergence rate in Theorem 2 is exactly the same as the one in Theorem 1 except that we replaced the Euclidean dimension p by the intrinsic dimension d. The constraints \(\nabla_{X} f_{\rho}\in\mathcal {H}^{p}_{\mathcal{K}}\) in Theorem 1 and \(d\varPhi (\nabla_{X}f_{\rho})\in\mathcal{H}^{p}_{\mathcal{K}}\) are somewhat restrictive, and extension to mild conditions is possible (Mukherjee et al. 2010). Here we confine ourself to these conditions in order to avoid introducing more notations and conceptions. The proof of Theorems 1 and 2 are somewhat complicated and will be given in Sect. 3. The main idea behind the proof is to simultaneously control the sample error and the approximation error; see Sect. 3 for details.

Note that the main purpose of our method is for variable selection. If f ρ depends only on a few coordinates of X, we can further improve the convergence rate to \(n^{-\frac{\theta}{2(|\mathbf {J}|+2+3\theta)}}\), where \(\mathbf{J}=\mathbf{J}(\nabla f_{\rho})=\{j:\frac{\partial f_{\rho}}{\partial x^{j}}\neq0\}\) and |J| is the number of elements in set J, i.e., the number of variables relevant to f ρ (Bertin and Lecué 2008). Let f=(f 1,f 2,…,f p ) T and f J be the concatenation of the loading function vectors indexed by J, that is, f J =(f j ) jJ . Similarly, we define x J =(x j ) jJ and \(\mathbf{x}_{i,\mathbf{J}}=(x_{i}^{j})_{j\in\mathbf{J}}\).

Next we show how the improved convergence rate can be derived using the framework based in (10), which is equivalent to (8) but easier to handle in proving the convergence results. To derive the results, we will adopt a two-step procedure: (1) we show that (10) selects, with high probability, only the relevant variables x j ,jJ; and (2) using the selected variables \(\widehat{\mathbf{J}}\), we run the following algorithm
$$ \mathbf{f}_{\mathcal{Z},\widehat{\mathbf{J}}}:=\arg \min_{\mathbf {f}_{\widehat{\mathbf{J}}}\in{\mathbb{H}}_{\mathcal{K}}^{|\widehat{\mathbf{J}}|}}\ \frac{1}{n(n-1)}\sum_{i,j=1}^n\omega_{i,j}^s \bigl(y_i-y_j+\mathbf{f}_{\widehat{\mathbf{J}}}(\mathbf {x}_{i,\widehat{\mathbf{J}}})\cdot(\mathbf{x}_{j,\widehat{\mathbf {J}}}-\mathbf{x}_{i,\widehat{\mathbf{J}}}) \bigr)^2 + \lambda\sum _{j\in\widehat{\mathbf{J}}}\|f^j\|^2_{\mathcal{K}},$$
to obtain an improved convergence rate that depends only on |J|. We will assume that the regression function and the distributions satisfy some regularity properties slightly different from the ones in Theorem 1:

Assumption 1

Assume |y|≤M almost surely. Suppose that for some 0<θ≤2/3 and C ρ >0, the marginal distribution ρ X satisfies
$$\rho_X \Bigl( \Bigl\{ \mathbf{x}\in X: \inf_{\mathbf{u}\in\mathbb {R}^p\backslash X}|\mathbf{u}-\mathbf{x}|\leq s \Bigr\} \Bigr)\leq C_\rho^2s^{4\theta}, \quad\forall s>0.$$
and the density p(x) of X (x) exists and satisfies
$$\sup_{\mathbf{x}\in X}p(\mathbf{x})\leq C_\rho,\quad|p(\mathbf {x})-p(\mathbf{u})|\leq C_\rho|\mathbf{u}-\mathbf{x}|^\theta, \quad \forall\,\mathbf{u},\mathbf{x}\in X.$$
Denote V ρ =∫ X (p(x))2 d x>0 and \(\mathbf {L}_{\mathcal{K}}\mathbf{f}=\int_{X}\mathcal{K}_{\mathbf{x}}\mathbf {f}(\mathbf{x})\frac{p(\mathbf{x})}{V_{\rho}}d\rho_{X}(\mathbf{x})\). Supposef ρ lies in the range of \(\mathbf{L}_{\mathcal{K}}^{r}, r>\frac{1}{2}\) (i.e. \(\|\mathbf{L}_{\mathcal{K}}^{-r}\nabla f_{\rho}\|_{L^{2}(\rho_{X})}< \infty\)) and has a sparsity pattern \(\mathbf {J}=\mathbf{J}(\nabla f_{\rho})=\{j: \frac{\partial f_{\rho}}{\partial x^{j}}\neq0\}\). For some \(C_{\rho}^{\prime}>0\), f ρ satisfies
$$|f_\rho(\mathbf{u})-f_\rho(\mathbf{x})-\nabla f_\rho(\mathbf {x})\cdot(\mathbf{u}-\mathbf{x})|\leq C_\rho^\prime|\mathbf {u}-\mathbf{x}|^2,\quad \forall\,\mathbf{u},\mathbf{x}\in X.$$

Let \(B_{1}=\{\mathbf{f}_{\mathbf{J}}:\|\mathbf{f}_{\mathbf{J}}\|_{\mathcal{K}}\leq1\}\) and \(J_{\mathcal{K}}\) be the inclusion map from B 1 to C(X). Let \(0<\eta<\frac{1}{2}\). We define the covering number \(\mathcal{N}(J_{\mathcal{K}}(B_{1}),\eta)\) to be the minimal ∈ℕ such that there exists disks in \(J_{\mathcal{K}}(B_{1})\) with radius η covering S. The following Theorem 3 tells us that, with probability tending to 1, Eq. (10) selects the true set of relevant variables. Theorem 4 shows the improved convergence rate which only depends on |J| if we use the two-step procedure to learn gradients. The proofs of Theorems 3 and 4 are postponed to Sect. 3.

Theorem 3

Suppose the data \(\mathcal{Z}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}\) are i.i.d. drawn from a joint distribution ρ defined on X×Y. Let \(\mathbf{f}_{\mathcal{Z}}\) be defined as (10) and \(\widehat{\mathbf{J}}=\{j:f^{j}_{\mathcal{Z}}\neq0\}\). Choose \(\lambda=\widetilde{C}_{M,\theta}s^{p+2+\theta }\), where \(\widetilde{C}_{M,\theta}\) is a constant defined in (35). Then under Assumption 1, there exists a constant 0<s 0≤1 such that, for all s,n satisfying 0<s<s 0 and \(n^{r}s^{(2p+4+\theta)(1/2-r)+1-\theta}\geq \max\{C_{D,\theta}, \widetilde{C}_{D,\theta}\}\), where \(C_{D,\theta },\widetilde{C}_{D,\theta}\) are two constants defined in Propositions 7 and 9 separately, we have \(\widehat{\mathbf{J}}=\mathbf{J}\) with probability greater than
$$1-\widetilde{C}_1\mathcal{N} \biggl(J_\mathcal{K}(B_1),\frac {s^{p+2+\theta}}{8(\kappa\operatorname{Diam}(X))^2} \biggr)\exp\bigl\{ -\widetilde{C}_2ns^{p+4}\bigr\},$$
where \(\widetilde{C}_{1}\) and \(\widetilde{C}_{2}\) are two constants independent of n or s.

Remark 5

The estimation of covering number \(\mathcal{N}(J_{\mathcal{K}}(B_{1}),\eta)\) is dependent of the smoothness of the Mercer Kernel \(\mathcal{K}\) (Cucker and Zhou 2007). If \(\mathcal{K}\in C^{\beta}(X\times X)\) for some β>0 and X has piecewise smooth boundary, then there is C>0 independent of β such that \(\mathcal{N}(J_{\mathcal{K}}(B_{1}),\eta)\leq C(\frac {1}{\eta})^{2p/\beta}\). In this case, if we choose \(s>(\frac{1}{n})^{\frac{\beta}{(p+4)\beta+2p(p+2+\theta)}}\), then \(1-\mathcal{N}(J_{\mathcal{K}}(B_{1}),\frac{s^{p+2+\theta }}{8(\kappa\operatorname{Diam}(X))^{2}})\exp\{-\widetilde {C}_{2}ns^{p+4}\}\) will goes to 1. In particular, if we choose \(s=(\frac{1}{n})^{\frac{\beta}{2(p+4)\beta+4p(p+2+\theta)}}\), then \(1-\widetilde{C}_{1}\mathcal{N}(J_{\mathcal{K}}(B_{1}),\frac {s^{p+2+\theta}}{8(\kappa\operatorname{Diam}(X))^{2}})\exp\{-\widetilde{C}_{2}ns^{p+4}\}\geq1-\widetilde{C}_{1}^{\prime}\exp\{\widetilde{C}_{2}^{\prime}\sqrt{n}\}\), where \(\widetilde{C}_{1}^{\prime},\widetilde{C}_{2}^{\prime}\) are two constants independent of n.

The following Theorem shows that the convergence rate depends on \(|\textbf{J}|\), instead of p, if we use the two-step procedure to learn the gradients.

Theorem 4

Suppose the data \(\mathcal{Z}=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}\) are i.i.d. drawn from a joint distribution ρ defined on X×Y. Let \(\mathbf {f}_{\mathcal{Z},\widehat{\mathbf{J}}}\) be defined as in (17). Assume that there exists α>0 and C α >0 such that \(\ln\mathcal{N}(J_{\mathcal{K}}(B_{1}),\frac{\epsilon}{8\kappa^{2}(\operatorname{Diam}(X))^{2}})\leq C_{\alpha}(\frac{1}{\epsilon})^{\alpha}\). Let \(\nabla_{\widehat{\mathbf{J}}} f_{\rho}=(\frac{\partial f_{\rho}}{\partial x^{j}})_{j\in\widehat{\mathbf{J}}}\). Under the same conditions as in Theorem 3, we have
$$\operatorname{Prob} \bigl\{\Vert \mathbf{f}_{\mathcal{Z},\widehat{\mathbf {J}}}-\nabla_{\widehat{\mathbf{J}}} f_\rho \Vert _{L^2(\rho _X)}\geq\epsilon \bigr \}\leq\widetilde{C}_3\exp \bigl(-\widetilde {C}_4n^{\frac{\theta}{2(|\mathbf{J}|+2+3\theta)}}\epsilon \bigr),\quad\forall\epsilon>0,$$
where \(\widetilde{C}_{3}\) and \(\widetilde{C}_{4}\) are two constants independent of n or s.

2.4 Variable selection and effective dimension reduction

Next we describe how to do variable selection and extract EDR directions based on the learned gradient \(\mathbf{f}_{\mathcal{Z}}=(f_{\mathcal{Z}}^{1},\ldots,f_{\mathcal{Z}}^{p})^{T}\).

As discussed above, because of the l 1 norm used in the regularization term, we expect many of the entries in the gradient vector \(\mathbf{f}_{\mathcal{Z}}\) be zero functions. Thus, a natural way to select variables is to identify those entries with non-zeros functions. More specifically, we select variables based on the following criterion.

Definition 1

Variable selection via sparse gradient learning is to select variables in the set
$$ \mathcal{S}:=\bigl\{j: \|f_{\mathcal{Z}}^j\|_{\mathcal{K}} \neq 0,~j=1,\ldots,p\bigr\}$$
where \(\mathbf{f}_{\mathcal{Z}}=(f_{\mathcal{Z}}^{1},\ldots,f_{\mathcal {Z}}^{p})^{T}\) is the estimated gradient vector.
To select the EDR directions, we focus on the empirical gradient covariance matrix defined below
$$ \varXi:= \bigl[\bigl\langle f^i_{\mathcal{Z}},f^j_{\mathcal{Z}}\bigr\rangle_{\mathcal{K}} \bigr]_{i,j=1}^p.$$
The inner product \(\langle f_{\mathcal{Z}}^{i},f_{\mathcal{Z}}^{j}\rangle_{\mathcal{K}}\) can be interpreted as the covariance of the gradient functions between coordinate i and j. The larger the inner product is, the more related the variables x i and x j are. Given a unit vector u∈ℝ p , the RKHS norm of the directional derivative \(\|\mathbf{u}\cdot \mathbf{f}_{\mathcal{Z}}\|_{\mathcal{K}}\) can be viewed as a measure of the variation of the data \(\mathcal{Z}\) along the direction u. Thus the direction u 1 representing the largest variation in the data is the vector that maximizes \(\|\mathbf {u}\cdot\mathbf{f}_{\mathcal{Z}}\|^{2}_{\mathcal{K}}\). Notice that
$$\|\mathbf{u}\cdot\mathbf{f}_{\mathcal{Z}}\|^2_{\mathcal{K}}= \|\sum_iu_if_{\mathcal{Z}}^i\|_{\mathcal{K}}^2=\sum_{i,j}u_iu_j\bigl\langle f_{\mathcal{Z}}^i,f_{\mathcal{Z}}^j\bigr \rangle_{\mathcal{K}}=\mathbf {u}^T\varXi\mathbf{u}.$$
So u 1 is simply the eigenvector of Ξ corresponding to the largest eigenvalue. Similarly, to construct the second most important direction u 2, we maximize \(\|\mathbf{u}\cdot\mathbf{f}_{\mathcal{Z}}\|_{\mathcal{K}}\) in the orthogonal complementary space of span{u 1}. By Courant-Fischer Minimax Theorem (Golub and Van Loan 1989), u 2 is the eigenvector corresponding to the second largest eigenvalue of Ξ. We repeat this procedure to construct other important directions. In summary, the effective dimension reduction directions are defined according to the following criterion.

Definition 2

The d EDR directions identified by the sparse gradient learning are the eigenvectors {u 1,…,u d } of Ξ corresponding to the d largest eigenvalues.

As we mentioned in Sect. 2.1, the EDR space is spanned by the eigenvectors of the gradient outer product matrix G defined in Eq. (3). However, because the distribution of the data is unknown, G cannot be calculated explicitly. The above definition provides a way to approximate the EDR directions based on the empirical gradient covariance matrix.

Because of the sparsity of the estimated gradient functions, matrix Ξ will appear to be block sparse. Consequently, the identified EDR directions will be sparse as well with non-zeros entries only at coordinates belonging to the set S. To emphasize the sparse property of both Ξ and the identified EDR directions, we will refer to Ξ as the sparse empirical gradient covariance matrix (S-EGCM), and the identified EDR directions as the sparse effective dimension reduction directions (S-EDRs).

3 Convergence analysis

In this section, we will give the proof of Theorems 1, 2, 3 and 4.

3.1 Convergence analysis in the Euclidean setting

Note that our energy functional in (8) involves an nonsmooth regularization term \(\sum_{i}\|f^{i}\|_{\mathcal{K}}\). The method for the convergence analysis used in Mukherjee and Zhou (2006) can no longer be applied any more since it need explicit form of the solution which is only possible for the 2 regularization. However, we can still simultaneously control a sample or estimation error term and a regularization or approximation error term which is widely used in statistical learning theory (Vapnik 1998; Mukherjee and Wu 2006; Zhang 2004).

3.1.1 Comparative analysis

Recall the empirical error for a vector function f:=(f 1,…,f p ),
$$\mathcal{E}_\mathcal{Z}(\mathbf{f})=\frac{1}{n^2}\sum _{i,j=1}^n \omega_{i,j}^s\bigl(y_i-y_j+\mathbf{f}(\mathbf{x}_i)\cdot(\mathbf {x}_j-\mathbf{x}_i) \bigr)^2.$$
One can similarly define the expected error
$$\mathcal{E}(\mathbf{f})=\int_Z\int_Z\omega^s(\mathbf{x}-\mathbf {u}) \bigl(y-v+\mathbf{f}(\mathbf{x})\cdot (\mathbf{u}-\mathbf{x})\bigr)^2d\rho (\mathbf{x},y)d\rho(\mathbf{u},v).$$
$$\sigma_s^2=\int_X\int _Z\omega^s(\mathbf{x}-\mathbf{u})\bigl(y-f_\rho (\mathbf{x})\bigr)^2d\rho(\mathbf{x},y)d\rho_X(\mathbf{u}).$$
$$\mathcal{E}(\mathbf{f})=2\sigma_s^2+\int_X\int_X\omega(\mathbf {x}-\mathbf{u})[f_\rho(\mathbf{x})-f_\rho(\mathbf{u})+\mathbf{f}(\mathbf{x})\cdot(\mathbf{u}-\mathbf{x})]^2d\rho_X(\mathbf {x})d\rho_X(\mathbf{u}).$$

Note that our goal is to bound the \(L_{\rho_{X}}^{2}\) differences of f and ∇f ρ . We have the following comparative theorem to bound the \(L_{\rho_{X}}^{2}\) differences of f and ∇f ρ in terms of the excess error, \(\mathcal{E}(\mathbf {f})-2\sigma_{s}^{2}\) using the following comparative theorem.

For r>0, denote
$$\mathcal{F}_r= \Biggl\{\mathbf{f}\in\mathcal{H}_\mathcal{K}^p:\sum_{i=1}^p\| f^i\|_\mathcal{K}\leq r \Biggr\}.$$

Theorem 5

Assume ρ X satisfies the condition (11) and (12) and \(\nabla f_{\rho}\in \mathcal{H}_{\mathcal{K}}^{p}\). For \(\mathbf{f}\in\mathcal{F}_{r}\) with some r≥1, there exists a constant C 0>0 such that
$$\|\mathbf{f}-\nabla f_\rho\|_{L_{\rho_X}^2}\leq C_0\biggl(r^2s^\theta +s^{2-\theta}+\frac{1}{s^{p+2+\theta}}\bigl(\mathcal{E}(\mathbf {f})-2\sigma_s^2\bigr)\biggr).$$

The proof of Theorem 5 is given in Appendix A.

3.1.2 Error decomposition

Now we turn to bound the quantity \(\mathcal{E}(\mathbf{f}_{\mathcal{Z}})-2\sigma_{s}^{2}\). Note that unlike the standard setting of regression and classification, \(\mathcal{E}_{\mathcal{Z}}(\mathbf{f})\) and \(\mathcal{E}(\mathbf{f})\) are not respectively the empirical and expected mean of a random variable. This is due to the extra (u,v) in the expected error term. However, since
$$E_\mathcal{Z}\mathcal{E}_\mathcal{Z}(\mathbf{f})=\frac {n-1}{n}\mathcal{E}({\mathbf{f}}),$$
\(\mathcal{E}_{\mathcal{Z}}(\mathbf{f})\) and \(\mathcal{E}(\mathbf{f})\) should be close to each other if the empirical error concentrates with n increasing. Thus, we can still decompose \(\mathcal{E}(\mathbf {f}_{\mathcal{Z}})-2\sigma_{s}^{2}\) into a sample error term and an approximation error term.
Note that \(\varOmega(\mathbf{f})=\lambda\sum_{i}\|f^{i}\|_{\mathcal{K}}\) with f=(f 1,…,f p ), so the minimizer of \(\mathcal{E}(\mathbf {f})+\varOmega(\mathbf{f})\) in \(\mathbb{H}_{\mathcal{K}}^{p}\) depends on λ. Let
$$ \mathbf{f}_\lambda=\arg \min_{\mathbf{f}\in\mathbb{H}_\mathcal{K}^p}\bigl\{ \mathcal{E}(\mathbf{f})+ \varOmega(\mathbf{f})\bigr\}.$$

By a standard decomposition procedure, we have the following result.

Proposition 1

$$\varphi(\mathcal{Z})=\bigl(\mathcal{E}(\mathbf{f}_{\mathcal{Z}})-\mathcal {E}_\mathcal{Z}(\mathbf{f}_\mathcal{Z})\bigr) +\bigl(\mathcal{E}_\mathcal{Z}(\mathbf{f}_\lambda)-\mathcal{E}(\mathbf {f}_\lambda)\bigr)$$
$$\mathcal{A}(\lambda)=\inf_{\mathbf{f}\in \mathbb{H}_\mathcal{K}^p}\bigl\{\mathcal{E}(\mathbf{f})-2\sigma_s^2+\varOmega (\mathbf{f})\bigr\}.$$
Then, we have
$$\mathcal{E}(\mathbf{f}_\mathcal{Z})-2\sigma_s^2\leq\mathcal{E}(\mathbf {f}_\mathcal{Z})-2\sigma_s^2+\varOmega(\mathbf{f}_\mathcal{Z})\leq \varphi(\mathcal{Z})+\mathcal{A}(\lambda)$$

The quantity \(\varphi(\mathcal{Z})\) is called the sample error and \(\mathcal{A}(\lambda)\) is the approximation error.

3.1.3 Sample error estimation

Note that the sample error \(\varphi(\mathcal{Z})\) can be bounded by controlling
$$S(\mathcal{Z},r):=\sup_{\mathbf{f}\in\mathcal{F}_r}|\mathcal {E}_\mathcal{Z}(\mathbf{f})-\mathcal{E}(\mathbf{f})|.$$
In fact, if both \(\mathbf{f}_{\mathcal{Z}}\) and f λ are in \(\mathcal{F}_{r}\) for some r>0, then
$$ \varphi(\mathcal{Z})\leq2S(\mathcal{Z},r).$$
We use McDiarmid’s inequality in McDiarmid (1989) to bound \(S(\mathcal{Z},r)\). Denote \(\kappa=\sup_{x\in X}\sqrt{K(x,x)},\operatorname{Diam}(X)=\max_{\mathbf{x},\mathbf{u}\in X}|\mathbf{x}-\mathbf{u}|\).

Lemma 1

For every r>0,
$$\operatorname{Prob}\bigl\{|S(\mathcal{Z},r)-ES(\mathcal{Z},r)|\geq\epsilon\bigr \} \leq2\exp \biggl(-\frac{n\epsilon^2}{32(M+\kappa \operatorname{Diam}(X)r)^4} \biggr).$$


Let (x′,y′) be a sample i.i.d. drawn from the distribution ρ(x,y). Denote by \(\mathcal{Z}_{i}^{\prime}\) the sample which coincides with \(\mathcal{Z}\) except that the i-th entry (x i ,y i ) is replaced by (x′,y′). It is easy to verify that Interchange the roles of \(\mathcal{Z}\) and \(\mathcal{Z}_{i}^{\prime}\) gives
$$|S(\mathcal{Z},r)-S\bigl(\mathcal{Z}_i^\prime,r\bigr)|\leq \frac{8(M+\kappa \operatorname{Diam}(X)r)^2}{n}.$$
By McDiarmid’s inequality, we obtain the desired estimate. □

In order to bound \(S(\mathcal{Z},r)\) using Lemma 1, we need a bound of \(ES(\mathcal{Z},r)\).

Lemma 2

For every r>0,
$$ES(\mathcal{Z},r)\leq\frac{11(\kappa \operatorname{Diam}(X)r+M)^2}{\sqrt{n}}.$$

The proof of Lemma 2 is given in Appendix B.

Now we can derive the following proposition by using inequality (23), Lemmas 1 and 2.

Proposition 2

Assume r>1. There exists a constant C 3>0 such that with confidence at least 1-δ,
$$\varphi(\mathcal{Z})\leq C_3\frac{(\kappa \operatorname{Diam}(X)r+M)^2\log\frac{2}{\delta}}{\sqrt{n}}.$$

Note that in order to use this Proposition, we still need a bound on \(\varOmega(\mathbf{f}_{\mathcal{Z}})=\lambda\sum_{i}\|f_{\mathcal{Z}}^{i}\|_{\mathcal{K}}\). We first state a rough bound.

Lemma 3

For every s>0 and λ>0, \(\varOmega(\mathbf{f}_{\mathcal{Z}})\leq M^{2}\).


The conclusion follows from the fact
$$\varOmega(\mathbf{f}_{\mathcal{Z}})\leq\mathcal{E}_\mathcal{Z}(\mathbf {f}_{\mathcal{Z}})+ \varOmega(\mathbf{f}_\mathcal{Z})\leq\mathcal {E}_\mathcal{Z}(\mathbf{0}) \leq M^2.$$

However, using this quantity the bound in Theorem 5 is at least of order \(O(\frac{1}{\lambda^{2}s^{2p+4-\theta}})\) which tends to ∞ as s→0 and λ→0. So a sharper bound is needed. It will be given in Sect. 3.1.5.

3.1.4 Approximation error estimation

We now bound the approximation error \(\mathcal{A}(\lambda)\).

Proposition 3

If \(\nabla f_{\rho}\in\mathbb{H}_{\mathcal{K}}^{p}\), then \(\mathcal{A}(\lambda)\leq C_{4}(\lambda+s^{4+p})\) for some C 4>0.


By the definition of \(\mathcal{A}(\lambda)\) and the fact that \(\nabla f_{\rho}\in\mathbb{H}_{\mathcal{K}}^{p}\),
$$\mathcal{A}(\lambda)\leq\mathcal{E}(\nabla f_\rho)-2\sigma_s^2+\varOmega(\nabla f_\rho).$$
Since Taking \(C_{4}=\max\{(C_{\mathcal{K}})^{2}c_{\rho}M_{4}, \sum_{i=1}^{p}\|(\nabla f_{\rho})^{i}\|_{\mathcal{K}}\}\), we get the desired result. □

3.1.5 Convergence rate

Following directly from Propositions 1, 2 and 3, we get

Theorem 6

If \(\nabla f_{\rho}\in\mathbb{H}_{\mathcal{K}}^{p}, \mathbf{f}_{\mathcal{Z}}\) and f λ are in \(\mathcal{F}_{r}\) for some r≥1, then with confidence 1-δ
$$\mathcal{E}(\mathbf{f}_\mathcal{Z})-2\sigma_s^2\leq C_2 \biggl(\frac {(M+\kappa \operatorname{Diam}(X)r)^2\log\frac{2}{\delta}}{\sqrt {n}}+s^{4+p}+\lambda \biggr),$$
where C 2 is a constant independent of r,s or λ.

In order to apply Theorem 5, we need a sharp bound on \(\varOmega(\mathbf{f}_{\mathcal{Z}}):=\lambda\sum_{i}\|f^{i}_{\mathcal{Z}}\|_{\mathcal{K}}\).

Lemma 4

Under the assumptions of Theorem 1,
$$\varOmega(\mathbf{f}_\lambda)\leq C_4\bigl(\lambda+s^{4+p}\bigr),$$
where C 4 is a constant independent of λ or s.


Since \(\mathcal{E}(\mathbf{f}_{\lambda})-2\sigma_{s}^{2}\) is non-negative for all f, we have
$$\varOmega(\mathbf{f}_\lambda)\leq \mathcal{E}(\mathbf{f}_\lambda)-2\sigma_s^2+\lambda \varOmega(\mathbf{f}_\lambda)=\mathcal{A}(\lambda).$$
This in conjunction with Proposition 3 implies the conclusion. □

Lemma 5

Under the assumptions of Theorem 1, with confidence at least 1-δ
$$\varOmega(\mathbf{f}_{\mathcal{Z}})\leq C_5 \biggl(\lambda+s^{4+p}+ \biggl(1+\frac{\kappa \operatorname{Diam}(X)M}{\lambda } \biggr)^2\frac{M^2\log\frac{2}{\delta}}{\sqrt{n}} \biggr)$$
for some C 5>0 independent of s or λ.


By the fact \(\mathcal{E}(\mathbf{f}_{\mathcal{Z}})-2\sigma_{s}^{2}>0\) and Proposition 1, we have \(\varOmega(\mathbf{f}_{\mathcal{Z}})\leq \frac{1}{\lambda}(\varphi(\mathcal{Z})+\mathcal{A}(\lambda))\). Since both \(\mathbf{f}_{\mathcal{Z}}\) and f λ are in \(\mathcal{F}_{\frac{M^{2}}{\lambda}}\), using Proposition 2, we have with probability at least 1-δ,
$$\varphi(\mathcal{Z})\leq C_3 \biggl(1+\frac{\kappa \operatorname{Diam}(X)M}{\lambda }\biggr)^2 \frac{M^2\log\frac{2}{\delta}}{\sqrt{n}}.$$
Together with Proposition 3, we obtain the desired estimate with C 5=max{C 3,C 4}. □

Now we will use Theorems 5 and 6 to prove Theorem 1.

Proof of Theorem 1

By Theorems 5 and 6, we have with at least probability \(1-\frac{\delta}{2}\),
$$\|\mathbf{f}_\mathcal{Z}-\nabla f_\rho\|_{L^2_{\rho_X}}^2\leq C_0 \biggl\{r^2s^\theta +s^{2-\theta}+\frac{C_2}{s^{p+2+\theta}} \biggl(\frac{(M+\kappa \operatorname{Diam}(X)r)^2\log\frac{4}{\delta}}{\sqrt{n}}+s^{4+p}+\lambda \biggr)\biggr\},$$
if both \(\mathbf{f}_{\mathcal{Z}}\) and f λ are in \(\mathcal{F}_{r}\) for some r>1. By Lemmas 4 and 5, we can state that both \(\mathbf{f}_{\mathcal{Z}}\) and f λ are in \(\mathcal{F}_{r}\) with probability \(1-\frac{\delta}{2}\) if
$$r=\max \biggl\{1+\frac{s^{4+p}}{\lambda}, \biggl(1+\frac{\kappa \operatorname{Diam}(X)M}{\lambda }\biggr)^2\frac{M^2\log\frac{4}{\delta}}{\lambda\sqrt{n}} \biggr\}.$$
Choose \(s= (\frac{1}{n} )^{\frac{1}{2(p+2+2\theta)}},\lambda = (\frac{1}{n} )^{\frac{\theta}{p+2+2\theta}}\), we obtain with confidence 1-δ,
$$\|\mathbf{f}_\mathcal{Z}-\nabla f_\rho\|_{L^2_{\rho_X}}\leq C\biggl(\frac{1}{n} \biggr)^{\frac{\theta}{4(p+2+2\theta)}}.$$

3.2 Convergence analysis in the manifold setting

The convergence analysis in the Manifold setting can be derived in a similar way as the one in the Euclidean setting. The idea behind the proof for the convergence of the gradient consists of simultaneously controlling a sample or estimation error term and a regularization or approximation error term.

As done in the convergence analysis in the Euclidean setting, we first use the excess error, \(\mathcal{E}(\mathbf{f})-2\sigma_{s}^{2}\), to bound the \(L_{\rho_{X}}^{2}\) differences of ∇ X f ρ and ()(f).

$$\mathcal{F}_r= \Biggl\{\mathbf{f}\in\mathcal{H}_\mathcal{K}^p:\sum_{i=1}^p\| f^i\|_\mathcal{K}\leq r \Biggr\},\quad r>0.$$

Theorem 7

Assume ρ X satisfies the condition (14) and (15) and X f ρ C 2(X). For \(\mathbf {f}\in\mathcal{F}_{r}\) with some r≥1, there exist a constant C 0>0 such that
$$\|(d\varPhi)^*(\mathbf{f})-\nabla_Xf_\rho \|_{L_{\rho_X}^2}^2\leq C_0 \biggl(r^2s^\theta+\frac{1}{s^{d+2+\theta}}\bigl(\mathcal{E}(\mathbf {f})-2\sigma_s^2\bigr) \biggr).$$


It can be directly derived from Lemma B.1 in Mukherjee et al. (2010) by using the inequality \(\sum_{i=1}^{n} |v_{i}|^{2}\leq(\sum_{i=1}^{n} |v_{i}|)^{2}\). □

3.2.1 Excess error estimation

In this subsection, we will bound \(\mathcal{E}(\mathbf{f}_{\mathcal{Z}})-2\sigma_{s}^{2}\). First, we decompose the excess error into sample error and approximation error.

Proposition 4

Let f λ be defined as (22),
$$\varphi(\mathcal{Z})=\bigl(\mathcal{E}(\mathbf{f}_\mathcal{Z})-\mathcal {E}_\mathcal{Z} (\mathbf{f}_\mathcal{Z})\bigr)+\bigl(\mathcal{E}_\mathcal{Z}(\mathbf{f}_\lambda )-\mathcal{E} (\mathbf{f}_\lambda)\bigr)$$
$$\mathcal{A}(\lambda)=\inf_{\mathbf{f}\in\mathbb{H}_\mathcal {K}^p} \bigl\{ \mathcal{E}(\mathbf{f})-2\sigma_s^2+\varOmega(\mathbf{f}) \bigr\}.$$
Then, we have
$$\mathcal{E}(\mathbf{f}_\mathcal{Z})-2\sigma_s^2+\varOmega(\mathbf {f}_\mathcal{Z})\leq \varphi(\mathcal{Z})+\mathcal{A}(\lambda).$$

Since the proof of Proposition 2 doesn’t need any structure information of X, it is still true in the manifold setting. Thus we have the same sample error bound as the one in the Euclidean setting. What left is to give an estimate for the approximation error \(\mathcal {A}(\lambda)\) in the manifold setting.

Proposition 5

Let X be a connected compact C submanifold of p which is isometrically embedded and of dimension d. If f ρ C 2(X) and \(d\varPhi(\nabla_{X}f_{\rho})\in\mathcal{H}_{\mathcal{K}}^{p}\), then
$$\mathcal{A}(\lambda)\leq C_6\bigl(\lambda+s^{4+d}\bigr)$$
for some C 6>0.


By the definition of \(\mathcal{A}(\lambda)\) and the fact that \(d\varPhi (\nabla_{X}f_{\rho})\in\mathcal{H}_{\mathcal{K}}^{p}\),
$$\mathcal{A}(\lambda)\leq\mathcal{E}\bigl(d\varPhi(\nabla_Xf_\rho )\bigr)-2\sigma_s^2+\varOmega\bigl(d\varPhi(\nabla_Xf_\rho)\bigr).$$
Note that f ρ C 2(X) and \(d\varPhi(\nabla_{X}f_{\rho})\in\mathcal {H}_{\mathcal{K}}^{p}\). By Lemma B.2 in Mukherjee et al. (2010), we have
$$\mathcal{E}\bigl(d\varPhi(\nabla_Xf_\rho)\bigr)-2\sigma_s^2\leq C_7 s^{4+d},$$
where C 7 is a constant independent of s. Taking \(C_{6}=\max\{C_{7},\sum_{i=1}^{p}\|(d\varPhi(\nabla_{X}f_{\rho}))^{i}\|_{\mathcal{K}}\}\), we get the desired result. □

Combining Propositions 4, 2 and 5, we get the estimate for the excess error.

Theorem 8

If \(d\varPhi(\nabla f_{\rho})\in\mathbb{H}_{\mathcal{K}}^{p}, \mathbf {f}_{\mathcal{Z}}\) and f λ are in \(\mathcal{F}_{r}\) for some r≥1, then with confidence 1-δ,
$$\mathcal{E}(\mathbf{f}_\mathcal{Z})-2\sigma_s^2\leq C_8 \biggl(\frac {(M+\kappa \operatorname{Diam}(X)r)^2\log\frac{2}{\delta}}{\sqrt {n}}+s^{d+4}+\lambda \biggr),$$
where C 8 is a constant independent of s,λ,δ or r.

3.2.2 Convergence rate

In order to use Theorems 7 and 8, we need sharp estimations for \(\sum_{i=1}^{p}\|(d\varPhi(\nabla_{X}f_{\rho}))^{i}\|_{\mathcal{K}}\) and \(\sum_{i=1}^{p}\|f^{i}_{\lambda}\|_{\mathcal{K}}\). This can be done using the same argument as the one in the Euclidean setting, we omit the proof here.

Lemma 6

Under the assumptions of Theorem 2, with confidence at least 1-δ,
$$\varOmega(\mathbf{f}_{\mathcal{Z}})\leq C_9 \biggl(\lambda+s^{4+d}+ \biggl(1+\frac{\kappa \operatorname{Diam}(X)M}{\lambda} \biggr)^2\frac{M^2\log\frac {2}{\delta}}{\sqrt{n}} \biggr)$$
$$\varOmega(\mathbf{f}_\lambda)\leq C_9\bigl(\lambda+s^{4+d}\bigr),$$
where C 9 is a constant independent of λ or s.

Now we prove Theorem 2.

Proof of Theorem 2

By the same argument as the one in proving Theorem 1, we can derive the convergence rate using Theorems 7, 8 and Lemma 6. □

3.3 Proof of Theorems 3 and 4

In order to prove Theorem 3, we need to characterize the solution of (10).

Proposition 6

A vector function \(\mathbf{f}\in\mathbb{H}_{\mathcal{Z}}^{p}\) with sparsity pattern J=J(f)={j:f j ≠0} is optimal for problem (10) if and only if

The proof of the Proposition can be derived as the same way as Proposition 10 in Bach (2008), we omit the details here.

It is easy to see that the problem (10) has a unique solution. If we can construct a solution \(\tilde {\mathbf{f}}_{\mathcal{Z}}\) satisfies the two conditions in Proposition 6 with high probability, then Theorem 3 holds.

Let \(\mathbf{J}=\mathbf{J}(\nabla f_{\rho})=\{j: \frac{\partial f_{\rho}}{\partial x^{j}}\neq0\}\) and \(\tilde{\mathbf{f}}_{\mathcal {Z},\mathbf{J}}\) be any minimizer of
$$ \min_{\mathbf{f}_\mathbf{J}\in\mathbb{H}_\mathcal{K}^{|\mathbf {J}|}}\frac{1}{n(n-1)}\sum _{i=1}^n \bigl(y_i-y_j+\mathbf{f}_\mathbf {J}(\mathbf{x}_i)\cdot(\mathbf{x}_{j,\mathbf{J}}-\mathbf {x}_{i,\mathbf{J}}) \bigr)^2+\lambda \biggl(\sum_{j\in\mathbf{J}}\| f^j\|_\mathcal{K} \biggr)^2.$$
We extend it by zeros on J c and denote it by \(\tilde {\mathbf{f}}_{\mathcal{Z}}\). Then \(\tilde{\mathbf{f}}_{\mathcal{Z}}\) satisfies (25) by the first optimality condition for \(\tilde{\mathbf{f}}_{\mathcal{Z},\mathbf{J}}\). So we only need to prove that \(\tilde{\mathbf{f}}_{\mathcal{Z}}\) satisfies (26) with probability tending to 1. For this, we construct the following events. where and \(\mathbf{D}_{n}=\sum_{i=1}^{p}\|\tilde{\mathbf{f}}_{\mathcal{Z}}^{i}\|_{\mathcal{K}} \operatorname{diag} (1/\|\tilde{\mathbf {f}}_{\mathcal{Z}}^{j}\|_{\mathcal{K}} )_{j\in\mathbf{J}}\). The main idea of the proof of Theorem 3 is how to bound the probability of those five sets since the probability of (26) holds can be lowerbounded by the probability of events Ω 0,Ω 1,Ω 2,Ω 3 and Ω 4.
Note that, on event Ω 0,
$$\|\partial f_\rho/\partial x^j\|_\mathcal{K}-\frac{1}{2}\min_{j\in \mathbf{J}}\|\partial f_\rho/\partial x^j\|_\mathcal{K}\leq\|\tilde {\mathbf{f}}_{\mathcal{Z}}^j\|_\mathcal{K}\leq\frac{1}{2}\min_{j\in\mathbf{J}}\|\partial f_\rho/\partial x^j\|_\mathcal{K}+\| \partial f_\rho/\partial x^j\|_\mathcal{K}, \quad\forall j\in \mathbf{J}$$
holds. Then on the event Ω 0, D n is well-defined and satisfies \(\frac{1}{3}D_{\min}\mathbf{I}\preceq\mathbf {D}_{n}\preceq3 D_{\max} \mathbf{I}\), where
$$D_{\min}=\min_{j^\prime \in\mathbf{J}} \biggl(\sum_{j\in\mathbf{J}}\bigg\|\frac{\partial f_\rho }{\partial x^j}\bigg\|_\mathcal{K} \biggr) \Big/\|\partial f_\rho/\partial x^{j^\prime}\|_\mathcal{K}$$
$$D_{\max}=\max_{j^\prime\in \mathbf{J}} \biggl(\sum_{j\in\mathbf{J}}\bigg\|\frac{\partial f_\rho }{\partial x^j}\bigg\|_\mathcal{K} \biggr) \Big/ \|\partial f_\rho/\partial x^{j^\prime} \|_\mathcal{K}.$$
In addition,
$$\frac{1}{2}\|\nabla f_\rho\|_\mathcal{K}\leq\sum _{j=1}^p\|\tilde {\mathbf{f}}_{\mathcal{Z}}^j\|_\mathcal{K}\leq\frac{3\sqrt {|\mathbf{J}|}}{2}\|\nabla f_\rho \|_\mathcal{K}.$$
Using those notations, we only need to prove that \(\tilde{\mathbf {f}}_{\mathcal{Z},\mathbf{J}}\) satisfies according to Proposition 6 and the uniqueness of the solution. By the first optimality condition for \(\tilde{\mathbf{f}}_{\mathcal {Z},\mathbf{J}}\), the condition (32) holds. We can now put that back into \(\widetilde{\mathbf{S}}_{\mathcal {Z},\mathbf{J}}\tilde{\mathbf{f}}_{\mathcal{Z},\mathbf{J}}-\tilde {Y}_{\mathbf{J}}\) and get Together with the fact that on the event Ω 0, \(\frac{1}{2}\|\nabla f_{\rho}\|_{\mathcal{K}}\leq\sum_{j=1}^{p}\|\tilde{\mathbf {f}}_{\mathcal{Z}}^{j}\|_{\mathcal{K}}\leq\frac{3\sqrt{|\mathbf {J}|}}{2}\|\nabla f_{\rho}\|_{\mathcal{K}}\),
$$\varOmega_1\cap\varOmega_2\cap\varOmega_3|\mathcal{Z}\in\varOmega_0\subseteq \Biggl\{\|\widetilde{\mathbf{S}}_{\mathcal{Z},\mathbf {J}}\tilde{\mathbf{f}}_{\mathcal{Z},\mathbf{J}}-\widetilde{Y}_{\mathbf {J}}\|_\mathcal{K}\leq\lambda\sum _{j=1}^p\|\mathbf{f}_\mathcal {Z}^j\|_\mathcal{K}|\mathcal{Z}\in\varOmega_0\Biggr\}$$
holds. Therefore,

The following Propositions provide the estimates for probability of those events.

Proposition 7

Let J=J(∇f ρ ) and \(\tilde{\mathbf {f}}_{\mathcal{Z},\mathbf{J}}\) be any minimizer of (27). Assume Assumption 1 holds. Let \(\lambda =\widetilde{C}_{M,\theta}s^{p+2+\theta}\), where Then there exists a constant 0<s 0≤1 such that, for all s,n satisfying 0<s<s 0 and \(n^{r}s^{(2p+4+\theta)(1/2-r)+1-\theta}\geq C_{D,\theta}:=4D_{\max}(D_{\min})^{r-\frac{3}{2}}(\widetilde {C}_{M,\theta})^{r-\frac{1}{2}}C_{\rho,r}\), we have
$$ \operatorname{Prob}(\varOmega_0) \geq1- 4\exp \bigl\{-C_\mathbf{J}ns^{p+4}\bigr\},$$
where C J is a constant independent of n or s.

Proposition 8

Let \(\mathcal{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n}\) be i.i.d. draws from a probability distribution ρ on Z. Under Assumption 1,
$$\operatorname{Prob}\{\varOmega_4\}\geq1-\mathcal{N}\biggl(J_\mathcal {K}(B_1),\frac{s^{p+2+\theta}}{8(\kappa\operatorname{Diam}(X))^2} \biggr)\exp \bigl \{-C_{\tilde{\mathbf{s}}}ns^{p+2+\theta} \bigr\},$$
where \(C_{\tilde{\mathbf{s}}}\) is a constant independent of n or s.

Proposition 9

Let \(\mathcal{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n}\) be i.i.d. draws from a probability distribution ρ on Z. Choose \(\lambda=\widetilde{C}_{M,\theta}s^{p+2+\theta}\). Under Assumption 1, if
$$n^rs^{(2p+4+\theta)(\frac{1}{2}-r)+p+2+\theta}\geq \widetilde{C}_{D,\theta}:=36D_{\max}(\kappa^2C_\rho M_{|\mathbf {J}|,1}M_{|\mathbf{J}^c|,1}+1)C_{\rho,r} \biggl(\frac{1}{3}D_{\min}\widetilde{C}_{M,\theta} \biggr)^{r-\frac {3}{2}} /\|\nabla f_\rho\|_\mathcal{K}$$
then there exists a constant \(C_{\varOmega_{1}}\) such that
$$\operatorname{Prob}\bigl(\varOmega_1^c|\mathcal{Z}\in \varOmega_0\cap\varOmega_4\bigr)\leq 2\exp \bigl \{-C_{\varOmega_1}ns^{p+2+\theta} \bigr\}.$$

Proposition 10

Let \(\mathcal{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n}\) be i.i.d. draws from a probability distribution ρ on Z. Choose \(\lambda=\widetilde {C}_{M,\theta}s^{p+2+\theta}\) with \(\widetilde{C}_{M,\theta}\) defined in (35). Then under Assumption 1, there exists a constant \(C_{\varOmega_{2}}>0\) such that
$$\operatorname{Prob}(\varOmega_2)\geq1-2\exp \bigl \{-C_{\varOmega _2}ns^{p+2+\theta} \bigr\}.$$

Proposition 11

Let \(\mathcal{Z}=\{\mathbf{z}_{i}\}_{i=1}^{n}\) be i.i.d. draws from a probability distribution ρ on Z. Choose \(\lambda=\widetilde {C}_{M,\theta}s^{p+2+\theta}\) with \(\widetilde{C}_{M,\theta}\) defined in (35). Then under Assumption 1, there exists a constant \(C_{\varOmega_{3}}>0\) such that
$$\operatorname{Prob}\bigl(\varOmega_3^c|\mathcal{Z}\in \varOmega_0\cap\varOmega_4\bigr)\leq 4\exp \bigl \{-C_{\varOmega_3}ns^{p+2+\theta} \bigr\}.$$

Proof of Theorem 3

The result of Theorem 3 follows directly from inequality (34), Propositions 7, 8, 9, 10 and 11. □

Proof of Theorem 4

For any ϵ>0, we have Using Proposition 9 in Mukherjee and Zhou (2006), we have
$$\operatorname{Prob} \bigl\{\Vert \mathbf{f}_{\mathcal{Z},\mathbf {J}}-\nabla_{\mathbf{J}} f_\rho \Vert _{L^2(\rho_X)}\geq\epsilon \bigr \}\leq\exp \bigl(-C_{\rho,K} n^{\frac{\theta}{2(|\mathbf {J}|+2+3\theta)}}\epsilon \bigr),$$
where C ρ,K is a constant independent of n or s. Theorem 3 together with the assumption \(\ln\mathcal{N}(J_{\mathcal{K}}(B_{1}),\frac{\epsilon}{8(\kappa \operatorname{Diam}(X))^{2}})\leq C_{\alpha}(\frac{1}{\epsilon })^{\alpha}\) implies
$$\operatorname{Prob}(\widehat{\mathbf{J}}\neq\mathbf{J})\leq C_1\exp \biggl\{ \biggl(\frac{8(\kappa \operatorname{Diam}(X))^2}{s^{p+2+\theta}} \biggr)^\alpha -C_2ns^{p+4} \biggr\}.$$
Choosing \(s= (\frac{1}{n} )^{\frac{1}{2((p+2+\theta )\alpha+p+4)}}\), the desired result follows. □

4 Algorithm for solving sparse gradient learning

In this section, we describe how to solve the optimization problem in Eq. (8). Our overall strategy is to first transfer the convex functional from the infinite dimensional to a finite dimensional space by using the reproducing property of RHKS, and then develop a forward-backward splitting algorithm to solve the reduced finite dimensional problem.

4.1 From infinite dimensional to finite dimensional optimization

Let \(\mathcal{K}:\mathbb{R}^{p}\times \mathbb{R}^{p}\rightarrow\mathbb{R}\) be continuous, symmetric and positive semidefinite, i.e., for any finite set of distinct points {x 1,…,x n }⊂ℝ p , the matrix \([\mathcal{K}(\mathbf{x}_{i},\mathbf{x}_{j}) ]_{i,j=1}^{n}\) is positive semidefinite (Aronszajn 1950). Such a function is called a Mercer kernel. The RKHS \(\mathbb{H}_{\mathcal{K}}\) associated with the Mercer kernel \(\mathcal{K}\) is defined to be the completion of the linear span of the set of functions \(\{\mathcal{K}_{\mathbf{x}}:=\mathcal{K}(\mathbf{x},\cdot):\mathbf {x}\in \mathbb{R}^{n}\}\) with the inner product \(\langle \cdot,\cdot\rangle_{\mathcal{K}}\) satisfying \(\langle \mathcal{K}_{\mathbf{x}},\mathcal{K}_{\mathbf{u}}\rangle_{\mathcal{K}}=\mathcal{K}(\mathbf{x},\mathbf{u})\). The reproducing property of \(\mathbb{H}_{\mathcal{K}}\) states that
$$\langle \mathcal{K}_\mathbf{x},h\rangle_\mathcal{K}=h(\mathbf{x})\quad \forall \mathbf{x}\in\mathbb{R}^p,h\in \mathbb{H}_\mathcal{K}. $$

By the reproducing property (38), we have the following representer theorem, which states that the solution of (8) exists and lies in the finite dimensional space spanned by \(\{\mathcal{K}_{\mathbf{x}_{i}}\}_{i=1}^{n}\). Hence the sparse gradient learning in Eq. (8) can be converted into a finite dimensional optimization problem. The proof of the theorem is standard and follows the same line as done in Schölkopf and Smola (2002), Mukherjee and Zhou (2006).

Theorem 9

Given a data set \(\mathcal{Z}\), the solution of Eq. (8) exists and takes the following form
$$f^j_{\mathcal{Z}}(\mathbf{x})=\sum_{i=1}^nc^j_{i,\mathcal{Z}}\mathcal{K}(\mathbf{x},\mathbf{x}_i), $$
where \(c^{j}_{i,\mathcal{Z}}\in\mathbb{R}\) for j=1,…,p and i=1,…,n.


The existence follows from the convexity of functionals \(\mathcal{E}_{\mathcal{Z}}(\mathbf{f})\) and Ω(f). Suppose \(\mathbf{f}_{\mathcal{Z}}\) is a minimizer. We can write functions \(\mathbf{f}_{\mathcal{Z}}\in\mathcal{H}_{K}^{p}\) as
where each element of f is in the span of \(\{K_{\mathbf{x}_{1}},\ldots,K_{\mathbf{x}_{n}}\}\) and f are functions in the orthogonal complement. The reproducing property yields f(x i )=f (x i ) for all x i . So the functions f do not have an effect on \(\mathcal{E}_{\mathcal {Z}}(\mathbf{f})\). But \(\|\mathbf{f}_{\mathcal{Z}}\|_{K}=\|\mathbf{f}_{\|}+\mathbf{f}_{\bot}\|_{K}>\|\mathbf{f}_{\|}\|_{K}\) unless f =0. This implies that \(\mathbf{f}_{\mathcal{Z}}=\mathbf {f}_{\|}\), which leads to the representation of \(\mathbf{f}_{\mathcal{Z}}\) in Eq. (39). □
Using Theorem 9, we can transfer the infinite dimensional minimization problem (8) to an finite dimensional one. Define the matrix \(C_{\mathcal{Z}}:=[c^{j}_{i,\mathcal{Z}}]_{j=1,i=1}^{p,n}\in\mathbb {R}^{p\times n}\). Therefore, the optimization problem in (8) has only p×n degrees of freedom, and is actually an optimization problem in terms of a coefficient matrix \(C:=[c_{i}^{j}]_{j=1,i=1}^{p,n}\in\mathbb{R}^{p\times n}\). Write C into column vectors as C:=(c 1,…,c n ) with c i ∈ℝ p for i=1,…,n, and into row vectors as C:=(c 1,…,c p ) T with c j ∈ℝ n for j=1,…,p. Let the kernel matrix be \(K:=[\mathcal{K}(\mathbf{x}_{i},\mathbf{x}_{j})]_{i=1,j=1}^{n,n}\in \mathbb{R}^{n\times n}\). After expanding each component f j of f in (8) as \(f^{j}(\mathbf{x})=\sum_{i=1}^{n}c^{j}_{i}\mathcal{K}(\mathbf{x},\mathbf {x}_{i})\), the objective function in Eq. (8) becomes a function of C as where k i ∈ℝ n is the i-th column of K, i.e., K=(k 1,…,k n ). Then, by Theorem 9,
$$ C_{\mathcal{Z}}=\arg\min_{C\in\mathbb{R}^{p\times n}}\varPhi(C).$$

4.2 Change of optimization variables

The objective function Φ(C) in the reduced finite dimensional problem convex is a non-smooth function. As such, most of the standard convex optimization techniques, such as gradient descent, Newton’s method, etc, cannot be directly applied. We will instead develop a forward-backward splitting algorithm to solve the problem. For this purpose, we fist convert the problem into a simpler form by changing the optimization variables.

Note that K is symmetric and positive semidefinite, so its square root K 1/2 is also symmetric and positive semidefinite, and can be easily calculated. Denote the i-th column of K 1/2 by \(\mathbf{k}_{i}^{1/2}\), i.e., \(K^{1/2}=(\mathbf{k}_{1}^{1/2},\ldots,\mathbf{k}_{n}^{1/2})\). Let \(\widetilde{C}=CK^{1/2}\) and write \(\widetilde{C}=(\tilde{\mathbf{c}}_{1},\ldots,\tilde{\mathbf {c}}_{n})=(\tilde{\mathbf{c}}^{1},\ldots,\tilde{\mathbf{c}}^{p})^{T}\), where \(\tilde{\mathbf{c}}_{i}\) and \(\tilde{\mathbf{c}}^{j}\) are the i-th column vector and j-th row vector respectively. Then Φ(C) in Eq. (40) can be rewritten as a function of \(\widetilde{C}\)
$$ \varPsi(\widetilde{C})=\frac{1}{n^2}\sum _{i,j=1}^n \omega_{i,j}^s\bigl(y_i-y_j+(\mathbf{x}_j-\mathbf{x}_i)^T\widetilde {C}\mathbf{k}_i^{1/2}\bigr)^2+\lambda\sum_{j=1}^p\|\tilde {\mathbf{c}}^j\|_{2},$$
where ∥⋅∥2 is the Euclidean norm of ℝ p . Thus finding a solution \(C_{\mathcal{Z}}\) of (41) is equivalent to identifying
$$ \widetilde{C}_{\mathcal{Z}}=\arg\min_{\widetilde{C}\in\mathbb {R}^{p\times n}}\varPsi(\widetilde{C}),$$
followed by setting \(C_{\mathcal{Z}}=\widetilde{C}_{\mathcal{Z}}K^{-1/2}\), where K -1/2 is the (pseudo) inverse of K 1/2 when K is (not) invertible.

Note that the problem we are focusing on is of large p small n, so the computation of \(K^{\frac{1}{2}}\) is trivial as it is an n×n matrix. However, if we meet with the case that n is large, we can still solve (41) by adopting other algorithms such as the one used in Micchelli et al. (2010).

Given matrix \(\widetilde{C}_{\mathcal{Z}}\), the variables selected by the sparse gradient learning as defined in Eq. (20) is simply
$$S=\bigl\{j: \| \tilde{\mathbf{c}}^j\|_2 \ne0, j=1,\ldots,n\bigr\}.$$
And similarly, the S-EDR directions can also be directly derived from \(\widetilde{C}_{\mathcal{Z}}\) by noting that the sparse gradient covariance matrix is equal to
$$\varXi=C^T_{\mathcal{Z}}KC_\mathcal{Z}=\widetilde{C}_\mathcal {Z}^T\widetilde{C}_\mathcal{Z}.$$

4.3 Forward-backward splitting algorithm

Next we propose a forward-backward splitting to solve Eq. (43). The forward-backward splitting is commonly used to solve the 1 related optimization problems in machine learning (Langford et al. 2009) and image processing (Daubechies et al. 2004; Cai et al. 2008). Our algorithm is derived from the general formulation described in Combettes and Wajs (2005).

We first split the objective function Ψ into a smooth term and a non-smooth term. Let Ψ=Ψ 1+Ψ 2, where
$$\varPsi_1(\widetilde{C})=\lambda\sum_{i=1}^p\|\tilde{\mathbf {c}}^i\|_{2}\quad\mbox{and}\quad \varPsi_2(\widetilde{C})=\frac{1}{n^2}\sum _{i,j=1}^n \omega_{i,j}^s\bigl(y_i-y_j+(\mathbf{x}_j-\mathbf{x}_i)^T\widetilde {C}\mathbf{k}_i^{1/2}\bigr)^2 .$$
The forward-backward splitting algorithm works by iteratively updating \(\widetilde{C}\). Given a current estimate \(\widetilde{C}^{(k)}\), the next one is updated according to
$$ \widetilde{C}^{(k+1)}=\mathrm{prox}_{\delta\varPsi_1}\bigl(\widetilde {C}^{(k)}-\delta\nabla\varPsi_2\bigl(\widetilde{C}^{(k)}\bigr)\bigr),$$
where δ>0 is the step size, and \(\mathrm{prox}_{\delta\varPsi _{1}}\) is a proximity operator defined by
$$ \mathrm{prox}_{\delta\varPsi_1}(D)=\arg\min_{\widetilde{C}\in \mathbb{R}^{p\times n}} \frac{1}{2}\|D-\widetilde{C}\|_F^2+\delta\varPsi_1(\widetilde{C}),$$
where ∥⋅∥ F is the Frobenius norm of ℝ p×n .
To implement the algorithm (46), we need to know both ∇Ψ 2 and \(\mathrm{prox}_{\delta\varPsi_{1}}(\cdot)\). The term ∇Ψ 2 is relatively easy to obtain,
$$ \nabla\varPsi_2(\widetilde{C}) =\frac{2}{n^2}\sum_{i,j=1}^n\omega_{i,j}^s\bigl(y_i-y_j+(\mathbf {x}_j-\mathbf{x}_i)^T \widetilde{C}\mathbf{k}_i^{1/2}\bigr) (\mathbf{x}_j-\mathbf {x}_i) \bigl(\mathbf{k}_i^{1/2}\bigr)^T.$$
The proximity operator \(\mathrm{prox}_{\delta\varPsi_{1}}\) is given in the following lemma.

Lemma 7

Let \(T_{\lambda\delta}(D)=\mathrm{prox}_{\delta\varPsi_{1}}(D)\), where D=(d 1,…,d p ) T with d j being the j-th row vector of D. Then
$$ T_{\lambda\delta}(D)= \bigl(t_{\lambda\delta}\bigl(\mathbf{d}^1\bigr),\ldots ,t_{\lambda\delta}\bigl(\mathbf{d}^p\bigr) \bigr)^T,$$
$$ t_{\lambda\delta}\bigl(\mathbf{d}^j\bigr)=\begin{cases}\mathbf{0},&\mbox{\textit{if}}\ \|\mathbf{d}^j\|_2\leq \lambda\delta,\\\frac{\|\mathbf{d}^j\|_2-\lambda\delta}{\|\mathbf{d}^j\|_2}\mathbf {d}^j,&\mbox{\textit{if}}\ \|\mathbf{d}^j\|_2>\lambda\delta.\end{cases} $$


From (47), one can easily see that the row vectors \(\tilde{\mathbf{c}}^{j}\), j=1,…,n, of \(\widetilde{C}\) are independent of each others. Therefore, we have
$$ t_{\lambda\delta}\bigl(\mathbf{d}^j\bigr)=\arg \min_{\mathbf{c}\in\mathbb {R}^{n}}\frac{1}{2}\|\mathbf{d}^j-\mathbf{c}\|_2^2+\lambda\delta\| \mathbf{c}\|_2.$$
The energy function in the above minimization problem is strongly convex, hence has a unique minimizer. Therefore, by the subdifferential calculus (c.f. Hiriart-Urruty and Lemaréchal 1993), t λδ (d j ) is the unique solution of the following equation with unknown c
$$ \mathbf{0}\in\mathbf{c}-\mathbf{d}^j+\lambda\delta\partial(\| \mathbf{c}\|_2),$$
$$\partial(\|\mathbf{c}\|_2)=\bigl\{\mathbf{p}:\mathbf{p}\in\mathbb {R}^{n};~\|\mathbf{u}\|_2-\|\mathbf{c}\|_2 -(\mathbf{u}-\mathbf{c})^T\mathbf{p}\geq0,~\forall \mathbf{u}\in \mathbb{R}^{n}\bigr\}$$
is the subdifferential of the function ∥c2. If ∥c2>0, the function ∥c2 is differentiable, and its subdifferential contains only its gradient, i.e., \(\partial(\|\mathbf{c}\|_{2})=\{\frac{\mathbf{c}}{\|\mathbf{c}\|_{2}}\}\). If ∥c2=0, then (∥c2)={p:p∈ℝ n ; ∥u2-u T p≥0, ∀u∈ℝ n }. One can check that (∥c2)={p:p∈ℝ n ;∥p2≤1} for this case. Indeed, for any vector p∈ℝ n with ∥p2≤1, ∥u2-u T p≥0 by the Cauchy-Schwartz inequality. On the other hand, if there is an element p of (∥c2) such that ∥p2>1, then, by setting u=p, we get ∥p2-p T p=∥p2(1-∥p2)<0, which contradicts the definition of (∥c2). In summary,
$$ \partial(\|\mathbf{c}\|_2)= \begin{cases}\{\frac{\mathbf{c}}{\|\mathbf{c}\|_2}\},&\mbox{if}\ \|\mathbf{c}\|_2>0,\\\{\mathbf{p}: \mathbf{p}\in\mathbb{R}^{n};\|\mathbf{p}\|_2\leq1\},&\mbox{if}\ \|\mathbf{c}\|_2=0.\end{cases}$$
With (53), we see that t λδ (d j ) in (50) is a solution of (52) hence (49) is verified. □
Now, we obtain the following forward-backward splitting algorithm to find the optimal \(\widetilde{C}\) in Eq. (41). After choosing a random initialization, we update \(\widetilde{C}\) iteratively until convergence according to
$$ \begin{cases}D^{(k+1)}=\widetilde{C}^{(k)}-\frac{2\delta}{n^2}\sum_{i,j=1}^n\omega_{i,j}^s(y_i-y_j+(\mathbf{x}_j-\mathbf{x}_i)^T\widetilde{C}^{(k)}\mathbf {k}_i^{1/2} )(\mathbf{x}_j-\mathbf{x}_i)(\mathbf{k}_i^{1/2})^T,\\\widetilde{C}^{(k+1)}=T_{\lambda\delta}(D^{(k+1)}).\end{cases} $$

The iteration alternates between two steps: (1) an empirical error minimization step, which minimizes the empirical error \(\mathcal{E}_{\mathcal{Z}}(\mathbf{f})\) along gradient descent directions; and (2) a variable selection step, implemented by the proximity operator T λδ defined in (49). If the norm of the j-th row of D (k), or correspondingly the norm \(\|f^{j}\|_{\mathcal{K}}\) of the j-th partial derivative, is smaller than a threshold λδ, the j-th row of D (k) will be set to 0, i.e., the j-th variable is not selected. Otherwise, the j-th row of D (k) will be kept unchanged except to reduce its norm by the threshold λδ.

Since \(\varPsi_{2}(\widetilde{C})\) is a quadratic function of the entries of \(\widetilde{C}\), the operator norm of its Hessian ∥∇2 Ψ 2∥ is a constant. Furthermore, since the function Ψ 2 is coercive, i.e., \(\|\widetilde{C}\|_{F}\to\infty\) implies that \(\varPsi(\widetilde{C})\to\infty\), there exists at least one solution of (43). By applying the convergence theory for the forward-backward splitting algorithm in Combettes and Wajs (2005), we obtain the following theorem.

Theorem 10

If \(0<\delta<\frac{2}{\|\nabla^{2}\varPsi_{2}\|}\), then the iteration (54) is guaranteed to converge to a solution of Eq. (43) for any initialization \(\widetilde{C}^{(0)}\).

The regularization parameter λ controls the sparsity of the optimal solution. When λ=0, no sparsity constraint is imposed, and all variables will be selected. On the other extreme, when λ is sufficiently large, the optimal solution will be \(\tilde{C}=0\), and correspondingly none of the variables will be selected. The following theorem provides an upper bound of λ above which no variables will be selected. In practice, we choose λ to be a number between 0 and the upper bound usually through cross-validation.

Theorem 11

Consider the sparse gradient learning in Eq. (43). Let
$$\lambda_{\max} = \max_{1\le k\le p} \frac{2}{n^2} \Biggl \Vert \sum_{i,j=1}^n\omega_{i,j}^s(y_i-y_j)\bigl(x_i^k-x_j^k\bigr)\mathbf{k}_i^{1/2} \Biggr \Vert _2$$
Then the optimal solution is \(\tilde{C}=0\) for all λλ max, that is, none of the variables will be selected.


Obviously, if λ=∞, the minimizer of Eq. (42) is a p×n zero matrix.

When λ<∞, the minimizer of Eq. (42) could also be a p×n zero matrix as long as λ is large enough. Actually, from iteration (54), if we choose C (0)=0, then
$$D^{(1)}=-\frac{2\delta}{n^2}\sum_{i,j=1}^n\omega_{i,j}^s(y_i-y_j) (\mathbf{x}_j-\mathbf{x}_i) \bigl(\mathbf{k}_i^{\frac{1}{2}}\bigr)^T$$
and \(\widetilde{C}^{(1)}=T_{\lambda\delta}(D^{(1)})\).
$$\lambda_{\max}=\max_{1\leq k\leq p}\frac{2}{n^2}\Biggl \Vert \sum _{i,j=1}^n\omega_{i,j}^s(y_i-y_j)\bigl(\mathbf{x}_j^k-\mathbf {x}_i^k\bigr) \bigl(\mathbf{k}_i^{\frac{1}{2}}\bigr)^T\Biggr \Vert _2.$$
Then for any λλ max, we have \(\widetilde {C}^{(1)}=\mathbf{0}_{p\times n}\) by the definition of T λδ . By induction, \(\widetilde{C}^{(k)}=\mathbf{0}_{p\times n}\) and the algorithm converge to \(\widetilde{C}^{(\infty)}=\mathbf{0}_{p\times n}\) which is a minimizer of Eq. ( 42 ) when \(0<\delta<\frac{2}{\|\nabla^{2} \varPsi_{2}\|}\). We get the desired result. □

Remark 6

In the proof of Theorem 11, we choose C (0)=0 p×n as the initial value of iteration (54) for simplicity. Actually, our argument is true for any initial value as long as \(0<\delta<\frac {2}{\|\nabla^{2} \varPsi_{2}\|}\) since the algorithm converges to the minimizer of Eq. (42) when \(0<\delta<\frac{2}{\|\nabla^{2}\varPsi_{2}\|}\). Note that the convergence is independent of the choice of the initial value.

It is not the first time to combine an iterative algorithm with a thresholding step to derive solutions with sparsity (see, e.g., Daubechies et al. 2004). However, different from the previous work, the sparsity we focus here is a block sparsity, that is, the row vectors of C (corresponding to partial derivatives f j ) are zero or nonzero vector-wise. As such, the thresholding step in (49) is performed row-vector-wise, not entry-wise as in the usual soft-thresholding operator (Donoho 1995).

4.4 Matrix size reduction

The iteration in Eq. (54) involves a weighted summation of n 2 number of p×n matrices as defined by \((\mathbf{x}_{j}-\mathbf{x}_{i})(\mathbf{k}_{i}^{1/2})^{T}\). When the dimension of the data is large, these matrices are big, and could greatly influence the efficiency of the algorithm. However, if the number of samples is small, that is, when np, we can improve the efficiency of the algorithm by introducing a transformation to reduce the size of these matrices.

The main motivation is to note that the matrix
$$\mathbf{M}_{\mathbf{x}}:=(\mathbf{x}_1-\mathbf{x}_n,\mathbf {x}_2-\mathbf{x}_n,\ldots,\mathbf{x}_{n-1}-\mathbf{x}_n, \mathbf{x}_n-\mathbf{x}_n)\in \mathbb{R}^{p\times n}$$
is of low rank when n is small. Suppose the rank of M x is t, which is no higher than min(n-1,p).
We use singular value decomposition to matrix M x with economy size. That is, M x =UΣV T , where U is a p×n unitary matrix, V is n×n unitary matrix, and \(\varSigma=\operatorname {diag}(\sigma_{1},\ldots,\sigma_{t},0,\ldots,0)\in\mathbb{R}^{n\times n}\). Let β=ΣV T , then
$$ \mathbf{M}_{\mathbf{x}}=U\beta.$$
Denote β=(β 1,…,β n ). Then x j -x i =U(β j -β i ). Using these notations, Eq. (54) is equivalent to Note that now the second term in the right hand side of the first equation in (57) involves the summation of n 2 number of n×n matrix rather than p×n matrices. More specifically, \(w_{i,j}^{s}(y_{i}-y_{j}+(\mathbf{x}_{j}-\mathbf {x}_{i})^{T}\widetilde{C}^{(k)}\mathbf{k}_{i}^{\frac{1}{2}})\) is a scalar in both (39) and (42). So the first equation in (39) involves the summation of n 2 matrix \((\mathbf{x}_{j}-\mathbf{x}_{i})(\mathbf {k}_{i}^{\frac{1}{2}})^{T}\) which is p×n, while the ones in (42) are \((\beta_{j}-\beta_{i})(\mathbf{k}_{i}^{\frac{1}{2}})^{T}\) which is n×n. Furthermore, we calculate the first iteration of Eq. (57) using two steps: (1) we calculate \(y_{i}-y_{j}+(\mathbf{x}_{j}-\mathbf{x}_{i})^{T}\widetilde{C}^{(k)}\mathbf{k}_{i}^{1/2}\) and store it in an n×n matrix r; (2) we calculate the first iteration of Eq. (57) using the value r(i,j). These two strategies greatly improve the efficiency of the algorithm when pn. More specifically, we reduce the update for D (k) in Eq. (54) of complexity O(n 3 p) into a problem of complexity O(n 2 p+n 4). A detailed implementation of the algorithm is shown in Algorithm 1.
Algorithm 1

Forward-backward splitting algorithm to solve sparse gradient learning for regression

Remark 7

Each update in Eq. (54) involves the summation of n 2 terms, which could be inefficient for datasets with large number of samples. A strategy to reduce the number of computations is to use a truncated weight function, e.g.,
$$ \omega_{ij}^s=\left \{\begin{array}{l@{\quad}l}\exp(-\frac{2\|\mathbf{x}_i-\mathbf{x}_j\|^2}{s^2}),&\mathbf{x}_j\in\mathcal{N}_i^k,\\0,&\mbox{otherwise},\end{array} \right .$$
where \(\mathcal{N}_{i}^{k}=\{\mathbf{x}_{j}:\mathbf{x}_{j} \ \mbox{is\ in\ the\ $k$\ nearest\ neighborhood\ of\ }\mathbf{x}_{i}\}\). This can reduce the number of summations from n 2 to kn.

5 Sparse gradient learning for classification

In this section, we extend the sparse gradient learning algorithm from regression to classification problems. We will also briefly introduce an implementation.

5.1 Defining objective function

Let x and y∈{-1,1} be respectively ℝ p -valued and binary random variables. The problem of classification is to estimate a classification function f C (x) from a set of observations \(\mathcal{Z}:=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}\), where \(\mathbf{x}_{i}:=(x_{i}^{1},\ldots,x_{i}^{p})^{T}\in\mathbb{R}^{p}\) is an input, and y i ∈{-1,1} is the corresponding output. A real valued function \(f_{\rho}^{\phi}:X\mapsto\mathbb{R}\) can be used to generate a classifier \(f_{C}(\mathbf{x})=\operatorname{sgn}(f_{\rho}^{\phi}(\mathbf{x}))\), where
$$\operatorname{sgn}\bigl(f_\rho^\phi(\mathbf{x})\bigr)=\left \{ \begin{array}{l@{\quad}l}1,&\mbox{if\ } f_\rho^\phi(\mathbf{x})>0,\\0,&\mbox {otherwise}.\end{array} \right .$$

Similar to regression, we also define an objective function, including a data fitting term and a regularization term, to learn the gradient of \(f_{\rho}^{\phi}\). For classical binary classification, we commonly use a convex loss function ϕ(t)=log(1+e -t ) to learn \(f^{\phi}_{\rho}\) and define the data fitting term to be \(\frac{1}{n}\sum_{i=1}^{n}\phi (y_{i} f_{\rho}^{\phi}(\mathbf{x}_{i}))\). The usage of loss function ϕ(t) is mainly motivated by the fact that the optimal \(f_{\rho}^{\phi}(\mathbf{x})=\log [P(y=1|\mathbf{x})/P(y=-1|\mathbf{x})]\), representing the log odds ratio between the two posterior probabilities. Note that the gradient of \(f_{\rho}^{\phi}\) exists under very mild conditions.

As in the case of regression, we use the first order Taylor expansion to approximate the classification function \(f_{\rho}^{\phi}\) by \(f_{\rho}^{\phi}(\mathbf{x})\approx f_{\rho}^{\phi}(\mathbf{x}_{0})+\nabla f_{\rho}^{\phi}(\mathbf{x}_{0})\cdot(\mathbf{x}-\mathbf{x}_{0})\). When x j is close to x i , \(f_{\rho}^{\phi}(\mathbf{x}_{j})\approx f^{0}(\mathbf{x}_{i})+\mathbf {f}(\mathbf{x}_{i})\cdot(\mathbf{x}_{j}-\mathbf{x}_{i})\), where f:=(f 1,…,f p ) with \(f^{j}=\partial f_{\rho}^{\phi}/\partial x^{j}\) for j=1,…,p, and f 0 is a new function introduced to approximate \(f_{\rho}^{\phi}(\mathbf{x}_{i})\). The introduction of f 0 is unavoidable since y i is valued -1 or 1 and not a good approximation of \(f_{\rho}^{\phi}\) at all. After considering Taylor expansion between all pairs of samples, we define the following empirical error term for classification
$$ \mathcal{E}^\phi_\mathcal{Z}\bigl(f^0,\mathbf{f}\bigr):=\frac{1}{n^2}\sum _{i,j=1}^n\omega_{i,j}^s\phi \bigl(y_j\bigl(f^0(\mathbf{x}_i)+\mathbf{f}(\mathbf{x}_i) \cdot(\mathbf{x}_j-\mathbf{x}_i)\bigr)\bigr),$$
where \(\omega_{i,j}^{s}\) is the weight function as in (5).
For the regularization term, we introduce
$$\varOmega\bigl(f^0,\mathbf{f}\bigr)=\lambda_1\|f^0\|_\mathcal{K}^2+\lambda_2\sum _{i=1}^p\|f^i\|_\mathcal{K}.$$
Comparing with the regularization term for regression, we have included an extra term \(\lambda_{1}\|f^{0}\|_{\mathcal{K}}^{2}\) to control the smoothness of the f 0 function. We use two regularization parameters λ 1 and λ 2 for the trade-off between \(\|f^{0}\|_{\mathcal{K}}^{2}\) and \(\sum_{i=1}^{p}\|f^{i}\|_{\mathcal{K}}\).
Combining the data fidelity term and regularization term, we formulate the sparse gradient learning for classification as follows
$$ \bigl(f_\mathcal{Z}^\phi,\mathbf{f}_\mathcal{Z}^\phi\bigr)=\arg\min_{(f^0,\mathbf{f})\in\mathbb{H}^{p+1}_\mathcal{K}}\mathcal{E}_\mathcal{Z}^\phi\bigl(f^0,\mathbf{f}\bigr)+\varOmega\bigl(f^0,\mathbf{f}\bigr).$$

5.2 Forward-backward splitting for classification

Using representer theorem, the minimizer of the infinite dimensional optimization problem in Eq. (61) has the following finite dimensional representation
$$f^\phi_\mathcal{Z}=\sum_{i=1}^n\alpha_{i,\mathcal{Z}}\mathcal {K}(\mathbf{x},\mathbf{x}_i),\qquad \bigl(\mathbf{f}^\phi_\mathcal{Z}\bigr)^j=\sum _{i=1}^nc_{i,\mathcal{Z}}^j\mathcal{K}(\mathbf{x},\mathbf{x}_i)$$
where \(\alpha_{i,\mathcal{Z}},c_{i,\mathcal{Z}}^{j}\in\mathbb{R}\) for i=1,…,n and j=1,…,p.
Then using the same technique as in the regression setting, the objective functional in minimization problem (61) can be reformulated as a finite dimensional convex function of vector α=(α 1,…,α n ) T and matrix \(\widetilde{C}=(\tilde{\mathbf{c}}_{i}^{j})_{i=1,j=1}^{n,p}\). That is,
$$\varPsi(\alpha,\widetilde{C})=\frac{1}{n^2}\sum _{i,j=1}^n\omega_{i,j}^s\phi \bigl(y_j\bigl(\alpha^T\mathbf{k}_i+(\mathbf{x}_j- \mathbf{x}_i)^T\widetilde{C}\mathbf{k}_i^{\frac{1}{2}}\bigr)\bigr)+\lambda_1\alpha^TK\alpha+\lambda_2\sum _{j=1}^p\|\tilde{\mathbf {c}}^j\|_2.$$
Then the corresponding finite dimensional convex
$$ \bigl(\widetilde{\alpha}_\mathcal{Z}^\phi,\widetilde{C}_\mathcal {Z}^\phi\bigr)=\arg\min_{\alpha\in\mathbb{R}^n,\widetilde{C}\in\mathbb{R}^{p\times n}}\varPsi(\widetilde{C})$$
can be solved by the forward-backward splitting algorithm.
We split \(\varPsi(\alpha,\widetilde{C})=\varPsi_{1}+\varPsi_{2}\) with
$$\varPsi_2=\frac{1}{n^2}\sum_{i,j=1}^n\omega_{i,j}^s\phi(y_j(\alpha^T\mathbf {k}_i+(\mathbf{x}_j-\mathbf{x}_i)^T\widetilde{C}\mathbf{k}_i^{\frac{1}{2}}))+\lambda_1\alpha^TK\alpha.$$
Then the forward-backward splitting algorithm for solving (62) becomes
$$ \begin{cases}\alpha^{(k+1)}=\alpha^{(k)}-\delta \Bigl(\frac{1}{n^2}\sum_{i,j=1}^n\frac{-\omega_{ij}y_j\mathbf{k}_i}{1+\exp(y_j((\alpha^{(k)})^T\mathbf{k}_i+(\mathbf{x}_j-\mathbf{x}_i)^T\widetilde{C}^{(k)}\mathbf{k}_i^{\frac {1}{2}}))}+2\lambda_1K\alpha^{(k)} \Bigr),\\[12pt]D^{(k+1)}=\widetilde{C}^{(k)}-\frac{\delta U}{n^2}\sum_{i,j=1}^n\frac{-\omega_{i,j}^sy_j(\beta_j-\beta_i)(\mathbf{k}_i^{1/2})^T}{1+\exp(y_j((\alpha^{(k)})^T\mathbf{k}_i+(\mathbf{x}_j-\mathbf{x}_i)^T\widetilde{C}^{(k)}\mathbf{k}_i^{\frac{1}{2}}))},\\\widetilde{C}^{(k+1)}=T_{\lambda_2\delta}(D^{(k+1)}),\end{cases} $$
where U,β satisfy equation (56) with U being a p×n unitary matrix.

With the derived \(\widetilde{C}_{\mathcal{Z}}^{\phi}\), we can do variable selection and dimension reduction as done for the regression setting. We omit the details here.

6 Examples

Next we illustrate the effectiveness of variable selection and dimension reduction by sparse gradient learning algorithm (SGL) on both artificial datasets and a gene expression dataset. As our method is a kernel-based method, known to be effective for nonlinear problems, we focus our experiments on nonlinear settings for the artificial datasets, although the method can be equally well applied to linear problems.

Before we report the detailed results, we would like to mention that our forward-backward splitting algorithm is very efficient for solving the sparse gradient learning problem. For the simulation studies, it takes only a few minutes to obtain the results to be described next. For the gene expression data involving 7129 variables, it takes less than two minutes to learn the optimal gradient functions on an Intel Core 2 Duo desktop PC (E7500, 2.93 GHz).

6.1 Simulated data for regression

In this example, we illustrate the utility of sparse gradient learning for variable selection by comparing it to the popular variable selection method LASSO. We pointed out in Sect. 2 that LASSO, assuming the prediction function is linear, can be viewed as a special case of sparse gradient learning. Because sparse gradient learning makes no assumption on the linearity of the prediction function, we expect it to be better equipped than LASSO for selecting variables with nonlinear responses.

We simulate 100 observations from the model
where x i ,i=1,…,5 are i.i.d. drawn from uniform distribution on [0,1] and ϵ is drawn form standard normal distribution with variance 0.05. Let x i ,i=6,…,10 be additional five noisy variables, which are also i.i.d. drawn from uniform distribution on [0,1]. We assume the observation dataset is given in the form of \(\mathcal{Z}:=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{100}\), where \(\mathbf{x}_{i}=(x_{i}^{1},x_{i}^{2},\ldots,x_{i}^{10})\) and \(y_{i}=(2x_{i}^{1}-1)^{2}+x_{i}^{2}+x_{i}^{3}+x_{i}^{4}+x_{i}^{5}+\epsilon\). It is easy to see that only the first 5 variables contribute the value of y.

This is a well-known example as pointed out by B. Turlach (2004) to show the deficiency of LASSO. As the ten variables are uncorrelated, LASSO will select variables based on their correlation with the response variable y. However, because (2x 1-1)2 is a symmetric function with respect to symmetric axis \(x^{1}=\frac{1}{2}\) and the variable x 1 is drawn from a uniform distribution on [0,1], the correlation between x 1 and y is 0. Consequently, x 1 will not be selected by LASSO. Because SGL selects variables based on the norm of the gradient functions, it has no such a limitation.

To run the SGL algorithm in this example, we use the truncated Gaussian in Eq. (58) with 10 neighbors as our weight function. The bandwidth parameter s is chosen to be half of the median of the pairwise distances of the sampling points. As the gradients of the regression function with respect to different variables are all linear, we choose \(\mathcal{K}(\mathbf{x},\mathbf {y})=1+\mathbf{x}\mathbf{y}\).

Figure 1 shows the variables selected by SGL and LASSO for the same dataset when the regularization parameter varies. Both methods are able to successfully select the four linear variables (i.e. x 2,…,x 4). However, LASSO failed to select x 1 and treated x 1 as if it were one of five noisy term x 6,…,x 10 (Fig. 1(b)). In contrast, SGL is clearly able to differentiate x 1 from the group of five noisy variables (Fig. 1(a)).
Fig. 1

(Color online) Regularization path for SGL and LASSO. Red line represents the variable x 1, blue lines represent the variables x 2,x 3,x 4,x 5 and green lines represent noisy variables x 6,x 7,x 8,x 9,x 10. (a) H K norm of each partial derivatives derived by SGL with respect to regularization parameter, where regularization parameter is scaled to be -logλ with base 10. (b) LASSO shrinkage of coefficients with respect to LASSO parameter t

To summarize how often each variable will be selected, we repeat the simulation 100 times. For each simulation, we choose a regularization parameter so that each algorithm returns exactly five variables. Table 1 shows the frequencies of variables x 1,x 2,…,x 10 selected by SGL and LASSO in 100 repeats. Both methods are able to select the four linear variables, x 2,x 3,x 4,x 5, correctly. But, LASSO fails to select x 1 and treats it as the same as the noisy variables x 6,x 7,x 8,x 9,x 10. This is in contrast to SGL, which is able to correctly select x 1 in 78% of the times, much greater than the frequencies (median 5%) of selecting the noisy variables. This example illustrates the advantage of SGL for variable selection in nonlinear settings.
Table 1

Frequencies of variables x 1,x 2,…,x 10 selected by SGL and LASSO in 100 repeats


x 1

x 2

x 3

x 4

x 5

x 6

x 7

x 8

x 9

x 10























6.2 Simulated data for classification

Next we apply SGL to an artificial dataset that has been commonly used to test the efficiency of dimension reduction methods in the literature. We consider a binary classification problem in which the sample data are lying in a 200 dimensional space with only the first 2 dimensions being relevant for classification and the remaining variables being noises. More specifically, we generate 40 samples with half from +1 class and the other half from -1 class. For the samples from +1 class, the first 2-dimensions of the sample data correspond to points drawn uniformly from a 2-dimensional spherical surface with radius 3. The remaining 198 dimensions are noisy variables with each variable being i.i.d. drawn from Gaussian distribution N(0,σ). That is,
$$ x^j\sim N(0,\sigma),\quad\mbox{for\ } j=3,4,\ldots,200.$$
For the samples from -1 class, the first 2-dimensions of the sample data correspond to points drawn uniformly from a 2-dimensional spherical surface with radius 3×2.5 and the remaining 198 dimensions are noisy variables with each variable x j i.i.d. drawn from N(0,σ) as (64). Obviously, this data set can be easily separated by a sphere surface if we project the data to the Euclidean space spanned by the first two dimensions.

In what follows, we illustrate the effectiveness of SGL on this data set for both variable selection and dimension reduction. In implementing SGL, both the weight function and the kernel are all chosen to be \(\exp(-\frac{\|\mathbf{x}-\mathbf{u}\|^{2}}{2s^{2}})\) with s being half of the median of pairwise distance of the sampling points.

We generated several datasets with different noise levels by varying σ from 0.1 to 3. SGL correctly selected x 1 and x 2 as the important variables for all cases we tested. Furthermore, SGL also generated two S-EDRs that captured the underlying data structure for all these cases (Fig. 2). It is important to emphasize that the two S-EDRs generated by SGL are the only two features the algorithm can possibly obtain, since the derived S-EGCM are supported on a 2×2 matrix. As a result, both of the derived S-EDRs are linear combinations of the first two variables. By contrast, using the gradient learning method (GL) reported in Mukherjee et al. (2010), the first two returned dimension reduction directions (called ESFs) are shown to be able to capture the correct underlying structure only when σ<0.7. In addition, the derived ESFs are linear combinations of all 200 original variables instead of only two variables as in S-EDRs. Figure 2(b), (e) shows the training data and the test data projected on the derived two S-EDRs for a dataset with large noise (σ=3). Comparing to the data projected on the first two dimensions (Fig. 2(a), (d)), the derived S-EDRs preserves the structure of the original data. In contrast, the gradient learning algorithm without sparsity constraint performed much poorer (Fig. 2(c), (f)).
Fig. 2

(Color online) Nonlinear classification simulation with σ=3. (a) Training data projected on the first two dimensions, (b) Training data projected on two S-EDRs derived by SGL. (c)Training data projected on first two ESFs derived by GL. (d) Test data projected on the first two dimensions. (e) Test data projected on two S-EDRs derived by SGL. (f) Test data projected on first two ESFs derived by GL

To explain why SGL performed better than GL without sparsity constraint, we plotted the norms of the derived empirical gradients from both methods in Fig. 3. Note that although the norms of partial derivatives of unimportant variables derived from the method without sparsity constraint are small, they are not exactly zero. As a result, all variables contributed and, consequently, introduced noise to the empirical gradient covariance matrix (Fig. 3(e), (f)).
Fig. 3

(Color online) Nonlinear classification simulation with σ=3 (continued). (a) RKHS norm of empirical gradient derived by SGL. (b) S-EGCM for first 10 dimension. (c) Eigenvalues of S-EGCM. (d) RKHS norm of empirical gradient derived by GL, (e) EGCM for first 10 dimension. (f) Eigenvalues of EGCM

We also tested LASSO for this artificial data set, and not surprisingly it failed to identify the right variables in all cases we tested. We omit the details here.

6.3 Leukemia classification

Next we apply SGL to do variable selection and dimension reduction on gene expression data. A gene expression data typically consists of the expression values of tens of thousands of mRNAs from a small number of samples as measured by microarrays. Because of the large number of genes involved, the variable selection step becomes especially important both for the purpose of generating better prediction models, and also for elucidating biological mechanisms underlying the data.

The gene expression data we will use is a widely studied dataset, consisting of the measurements of 7129 genes from 72 acute leukemia samples (Golub et al. 1999). The samples are labeled with two leukemia types according to the precursor of the tumor cells—one is called acute lymphoblastic leukemia (ALL), and the other one is called acute myelogenous leukemia (AML). The two tumor types are difficult to distinguish morphologically, and the gene expression data is used to build a classifier to classify these two types.

Among 72 samples, 38 are training data and 34 are test data. We coded the type of leukaemia as a binary response variable y, with 1 and -1 representing ALL and AML respectively. The variables in the training samples \(\{\mathbf{x}_{i}\}_{i=1}^{38}\) are normalized to be zero mean and unit length for each gene. The test data are similarly normalized, but only using the empirical mean and variance of the training data.

We applied three methods (SGL, GL and LASSO) to the dataset to select variables and extract the dimension reduction directions. To compare the performance of the three methods, we used linear SVM to build a classifier based on the variables or features returned by each method, and evaluated the classification performance using both leave-one-out (LOO) error on the training data and the testing error. To implement SGL, the bandwidth parameter s is chosen to be half of the median of the pairwise distances of the sampling points, and \(\mathcal{K}(\mathbf {x},\mathbf{y})=\mathbf{x}\mathbf{y}\). The regularization parameters for the three methods are all chosen according to their prediction power measured by leave-one-out error.

Table 2 shows the results of the three methods. We implemented two SVM classifiers for SGL using either only the variables or the features returned by SGL. Both classifiers are able to achieve perfect classification for both leave-one-out and testing samples. The performance of SGL is better than both GL and LASSO, although only slightly. All three methods performed significantly better than the SVM classifier built directly from the raw data in terms of LOO error and test error.
Table 2

Summary of the Leukemia classification results


SGL (variable selection)



Linear SVM


Number of variables or features




7129 (all)


Leave one out error (LOO)






Test errors






In addition to the differences in prediction performance, we note a few other observations. First, SGL selects more genes than LASSO, which likely reflects the failure of LASSO to choose genes with nonlinear relationships with the response variable, as we illustrated in our first example. Second, The S-EDRs derived by SGL are linear combinations of 106 selected variables rather than all original variables as in the case of ESFs derived by GL. This is a desirable property since an important goal of the gene expression analysis is to identify regulatory pathways underlying the data, e.g. those distinguishing the two types of tumors. By associating only a small number of genes, S-EDRs provide better and more manageable candidate pathways for further experimental testing.

7 Discussion

Variable selection and dimension reduction are two common strategies for high-dimensional data analysis. Although many methods have been proposed before for variable selection or dimension reduction, few methods are currently available for simultaneous variable selection and dimension reduction. In this work, we described a sparse gradient learning algorithm that integrates automatic variable selection and dimension reduction into the same optimization framework. The algorithm can be viewed as a generalization of LASSO from linear to non-linear variable selection, and a generalization of the OPG method for learning EDR directions from a non-regularized to regularized estimation. We showed that the integrated framework offers several advantages over the previous methods by using both simulated and real-world examples.

The SGL method can be refined by using an adaptive weight function rather than a fixed one as in our current implementation. The weight function \(\omega_{i,j}^{s}\) is used to measure the distance between two sample points. If the data are lying in a lower dimensional space, the distance would be more accurately captured by using only variables related to the lower dimensional space rather than all variables. One way to implement this is to calculate the distance using only selected variables. Note that the forward-backward splitting algorithm eliminates variables at each step of the iteration. We can thus use an adaptive weight function that calculates the distances based only on selected variables returned after each iteration. More specifically, let \(\mathcal{S}^{(k)}=\{i:\|(\tilde{\mathbf {c}}^{i})^{(k)}\|_{2}\neq0\}\) represent the variables selected after iteration k. An adaptive approach is to use \(\sum_{l \in\mathcal {S}^{(k)}} (x_{i}^{l} - x_{j}^{l})^{2}\) to measure the distance ∥x i -x j 2 after iteration k.

An interesting area for future research is to extend SGL for semi-supervised learning. In many applications, it is often much easier to obtain unlabeled data with a larger sample size un. Most natural (human or animal) learning seems to occur in semi-supervised settings (Belkin et al. 2006). It is possible to extend SGL for the semi-supervised learning along several directions. One way is to use the unlabeled data \(\mathcal{X}=\{\mathbf{x}_{i}\}_{i=n+1}^{n+u}\) to control the approximate norm of f in some Sobolev spaces and introduce a semi-supervised learning algorithm as where \(\|\mathbf{f}\|_{K}=\sum_{i=1}^{p}\|f^{i}\|_{K}\), W i,j are edge weights in the data adjacency graph, μ is another regularization parameter and often satisfies λ=o(μ). In order to make the algorithm efficient, we can use truncated weight in implementation as done in Sect. 6.1.

The regularization term \(\sum_{i,j=1}^{n+u}W_{i,j}\|\mathbf{f}(\mathbf {x}_{i})-\mathbf{f}(\mathbf{x}_{j})\|^{2}_{\ell^{2}(\mathbb{R}^{p})}\) is mainly motivated by the recent work of Belkin and Niyogi (2006). In that paper, they have introduced a regularization term \(\sum_{i,j=1}^{n+u}W_{i,j}(f(\mathbf{x}_{i})-f(\mathbf{x}_{j}))^{2}\) for semi-supervised regression and classification problems. The term \(\sum_{i,j=1}^{n+u}W_{i,j}(f(\mathbf{x}_{i})-f(\mathbf{x}_{j}))^{2}\) is well-known to be related to graph Laplacian operator. It is used to approximate \(\int_{\mathbf{x}\in\mathcal{M}}\|\nabla_{\mathcal{M}}f\|^{2}d\rho_{X}(\mathbf{x})\), where \(\mathcal{M}\) is a compact submanifold which is the support of marginal distribution ρ X (x), and \(\nabla_{\mathcal{M}}\) is the gradient of f defined on \(\mathcal{M}\) (Do Carmo and Flaherty 1992). Intuitively, \(\int_{\mathbf{x}\in\mathcal{M}}\|\nabla_{\mathcal{M}}f\|^{2}d\rho_{X}(\mathbf{x})\) is a smoothness penalty corresponding to the probability distribution. The idea behind \(\int_{\mathbf{x}\in \mathcal{M}}\|\nabla_{\mathcal{M}}f\|^{2}d\rho_{X}(\mathbf{x})\) is that it reflects the intrinsic structure of ρ X (x). Our regularization term \(\sum_{i,j=1}^{n+u}W_{i,j}\|\mathbf{f}(\mathbf {x}_{i})-\mathbf{f}(\mathbf{x}_{j})\|^{2}_{\ell^{2}(\mathbb{R}^{p})}\) is a corresponding vector form of \(\sum_{i,j=1}^{n+u}W_{i,j}(f(\mathbf {x}_{i})-f(\mathbf{x}_{j}))^{2}\) in Belkin et al. (2006). The regularization framework of the SGL for semi-supervised learning can thus be viewed as a generalization of this previous work.



This work was partially supported by a grant from National Science Foundation grant and a grant from University of California.


  1. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. MathSciNetCrossRefzbMATHGoogle Scholar
  2. Bach, F. R. (2008). Consistency of the group Lasso and multiple kernel learning. The Journal of Machine Learning Research, 9, 1179–1225. MathSciNetzbMATHGoogle Scholar
  3. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396. CrossRefzbMATHGoogle Scholar
  4. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2434. MathSciNetGoogle Scholar
  5. Bertin, K., & Lecué, K. (2008). Selection of variables and dimension reduction in high-dimensional non-parametric regression. Electronic Journal of Statistics, 2, 1224–1241. MathSciNetCrossRefGoogle Scholar
  6. Bickel, P., & Li, B. (2007). Local polynomial regression on unknown manifolds. In IMS lecture notes-monograph series: Vol. 54. Complex datasets and inverse problems: tomography, networks and beyond, (177–186). CrossRefGoogle Scholar
  7. Cai, J. F., Chan, R. H., & Shen, Z. (2008). A framelet-based image inpainting algorithm. Applied and Computational Harmonic Analysis, 24(2), 131–149. MathSciNetCrossRefzbMATHGoogle Scholar
  8. Combettes, P. L., & Wajs, V. R. (2005). Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4), 1168–1200 (electronic). doi: 10.1137/050626090. MathSciNetCrossRefzbMATHGoogle Scholar
  9. Cook, R. D. & Yin, X. (2001). Dimension reduction and visualization in discriminant analysis. Australian & New Zealand Journal of Statistics, 43(2), 147–199. doi: 10.1111/1467-842X.00164 With a discussion by A. H. Welsh, Trevor Hastie, Mu Zhu, S. J. Sheather, J. W. McKean, Xuming He and Wing-Kam Fung and a rejoinder by the authors. MathSciNetCrossRefzbMATHGoogle Scholar
  10. Cucker, F., & Zhou, D. X. (2007). Learning theory: an approximation theory viewpoint (Vol. 24). Cambridge: Cambridge Univ Press. CrossRefzbMATHGoogle Scholar
  11. Daubechies, I., Defrise, M., & De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 1413–1457. doi: 10.1002/cpa.20042 MathSciNetCrossRefzbMATHGoogle Scholar
  12. Dhillon, I. S., Mallela, S., & Kumar, R. (2003). A divisive information theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3, 1265–1287. zbMATHGoogle Scholar
  13. Do Carmo, M., & Flaherty, F. (1992). Riemannian geometry. Basel: Birkhauser. zbMATHGoogle Scholar
  14. Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3), 613–627. MathSciNetCrossRefzbMATHGoogle Scholar
  15. Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306. MathSciNetCrossRefGoogle Scholar
  16. Donoho, D., & Grimes, C. (2003). Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10), 5591–5596. MathSciNetCrossRefzbMATHGoogle Scholar
  17. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499 with discussion, and a rejoinder by the authors. MathSciNetCrossRefzbMATHGoogle Scholar
  18. Fukumizu, K., Bach, F. R., & Jordan, M. I. (2009). Kernel dimension reduction in regression. Annals of Statistics, 37(4), 1871–1905. doi: 10.1214/08-AOS637 MathSciNetCrossRefzbMATHGoogle Scholar
  19. Golub, G. H., & Van Loan, C. F. (1989). Matrix computations. Baltimore: Johns Hopkins University Press. zbMATHGoogle Scholar
  20. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439), 531–537. CrossRefGoogle Scholar
  21. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422. CrossRefzbMATHGoogle Scholar
  22. Guyon, I., & Ellsseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. zbMATHGoogle Scholar
  23. Hiriart-Urruty, J., & Lemaréchal, C. (1993). Convex analysis and minimization algorithms. Berlin: Springer. Google Scholar
  24. Hristache, M., Juditsky, A., & Spokoiny, V. (2001). Structure adaptive approach for dimension reduction. Annals of Statistics, 29(6), 1537–1566. MathSciNetzbMATHGoogle Scholar
  25. Lafferty, J., & Wasserman, L. (2008). Rodeo: sparse, greedy nonparametric regression. Annals of Statistics, 36(1), 28–63. doi: 10.1214/009053607000000811 MathSciNetCrossRefzbMATHGoogle Scholar
  26. Langford, J., Li, L., & Zhang, T. (2009). Sparse online learning via truncated gradient. In D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (Eds.) Advances in neural information processing systems (Vol. 21, pp. 905–912). Cambridge: MIT Google Scholar
  27. Li, K. C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–342 with discussion and a rejoinder by the author. MathSciNetCrossRefzbMATHGoogle Scholar
  28. Li, K. C. (1992). On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. Journal of the American Statistical Association, 87(420), 1025–1039. MathSciNetCrossRefzbMATHGoogle Scholar
  29. Li, B., Zha, H., & Chiaromonte, F. (2005) Contour regression: a general approach to dimension reduction. Ann Statist pp 1580–1616. Google Scholar
  30. Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics, 34(5), 2272–2297. doi: 10.1214/009053606000000722 MathSciNetCrossRefzbMATHGoogle Scholar
  31. Mackey, L. (2009). Deflation methods for sparse PCA. Advances in Neural Information Processing Systems, 21, 1017–1024. Google Scholar
  32. McDiarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics, 141, 148–188. MathSciNetGoogle Scholar
  33. Micchelli, C. A., & Pontil, M. (2005). On learning vector-valued functions. Neural Computation, 17(1), 177–204. doi: 10.1162/0899766052530802 MathSciNetCrossRefzbMATHGoogle Scholar
  34. Micchelli, C. A., & Pontil, M. (2007). Feature space perspectives for learning the kernel. Machine Learning, 66, 297–319. CrossRefGoogle Scholar
  35. Micchelli, C. A., Morales, J. M., & Pontil, M. (2010). A family of penalty functions for structured sparsity. Advances in Neural Information Processing Systems, 23, 1612–1623. Google Scholar
  36. Mukherjee, S. Zhou, D.X. (2006). Learning coordinate covariances via gradients. Journal of Machine Learning Research, 7, 519–549. MathSciNetzbMATHGoogle Scholar
  37. Mukherjee, S., & Wu, Q. (2006). Estimation of gradients and coordinate covariation in classification. Journal of Machine Learning Research, 7, 2481–2514. MathSciNetzbMATHGoogle Scholar
  38. Mukherjee, S., Wu, Q., & Zhou, D. (2010). Learning gradients on manifolds. Bernoulli, 16(1), 181–207. MathSciNetCrossRefzbMATHGoogle Scholar
  39. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. CrossRefGoogle Scholar
  40. Ruppert, D., & Ward, M. P. (1994). Multivariate locally weighted least squares regression. Annals of Statistics, 22(3), 1346–1370. MathSciNetCrossRefzbMATHGoogle Scholar
  41. Samarov, A. M. (1993). Exploring regression structure using nonparametric functional estimation. Journal of the American Statistical Association, 88(423), 836–847. MathSciNetCrossRefzbMATHGoogle Scholar
  42. Schölkopf, B., & Smola, A. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT. Google Scholar
  43. Tenenbaum, J., Silva, V., Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323 CrossRefGoogle Scholar
  44. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 58(1), 267–288. MathSciNetzbMATHGoogle Scholar
  45. van der Vaart, A. W., & Wellner, J. A. (1996). Springer Series in Statistics. Weak convergence and empirical processes. New York: Springer. With applications to statistics. zbMATHGoogle Scholar
  46. Vapnik, V. N. (1998). Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. New York: Wiley. Google Scholar
  47. Weston, J., Elisseff, A., Schölkopf, B., & Tipping, M. (2003). Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3, 1439–1461. zbMATHGoogle Scholar
  48. Xia, Y., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 64(3), 363–410 10.1111/1467-9868.03411. MathSciNetzbMATHGoogle Scholar
  49. Ye, G. B. & Zhou, D. X. (2008). Learning and approximation by Gaussians on Riemannian manifolds. Advances in Computational Mathematics, 29(3), 291–310. MathSciNetCrossRefzbMATHGoogle Scholar
  50. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85. doi: 10.1214/aos/1079120130 MathSciNetCrossRefzbMATHGoogle Scholar
  51. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 67(2), 301–320. MathSciNetCrossRefzbMATHGoogle Scholar
  52. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265–286. MathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.School of Information and Computer ScienceUniversity of CaliforniaIrvineUSA

Personalised recommendations