1 Introduction

Numerous statistical applications are confronted today with the so-called curse of dimensionality (cf. Hastie et al. 2001; Wasserman 2006). Using high-dimensional datasets implies an exponential increase of computational effort for many statistical routines, while the data thin out in the local neighborhood of any given point and classical statistical methods become unreliable. When a random phenomenon is observed in the high dimensional space \(\mathbb {R}^{d} \) the “intrinsic dimension” m covering degrees of freedom associated with same features may be much smaller than d. Then introducing structural assumptions allows to reduce the problem complexity without sacrificing any statistical information (Mizuta 2004; Roweis and Saul 2000). This study focuses on structural analysis of high dimensional unlabeled data. It is well known that the projection of high dimensional data on a randomly selected low-dimensional subspace is nearly normal. This fact can be intuitively explained by the central limit theorem. At the same time, interesting data features such as multimodality, limited support, or heavy tails are essentially non-Gaussian. These “deviations from normality” are easily observable in low-dimensional data (say, up to dimension 3), but they are hardly detectable when the data is high-dimensional (when data dimension exceeds 10). Our structural assumption is that the informative non-Gaussian part of the data is (approximately) located in a low-dimensional linear subspace \(\mathcal {I}\) while the non-informative noisy Gaussian part of the data distribution is full dimensional. If such non-Gaussian subspace were known, one could reduce data dimension and complexity by projecting the data onto it without loosing any substantial information. This assumption naturally leads to the problem of recovering the low dimensional informative non-Gaussian subspace from the original data. In what follows we focus on the problem of estimating the projector Π onto an m-dimensional non-Gaussian subspace from d-dimensional data when md. This is a semiparametric statistical problem with a target parameter of dimension O(dm) (the projection matrix) and a general m-dimensional nonparametric nuisance (the non-Gaussian data distribution). An important feature of the problem is that the target space—the set of m-dimensional projectors Π in \(\mathbb {R}^{d} \)—is not convex, so efficient methods of convex optimization cannot be applied to this problem directly, and the recovery of the parameter requires application of computationally expensive relaxation techniques. Another challenging task is the design of an estimation routine which is robust with respect to the unknown covariance matrix of the Gaussian component of the data. These characteristics of the problem make it hard even in the case of moderate dimension d.

Modern statistical literature offers two types of methods for solving a similar dimension reduction problem. The principal component analysis (PCA) looks for the subspace of the highest data variability but is limited to Gaussian data and is inefficient when recovering the non-Gaussian features. Another well known statistical technique, the independent component analysis (ICA) is based on the rather restrictive assumption that the full dimensional data distribution is a superposition of independent one-dimensional non-Gaussian sources; see e.g. Hyvärinen (1999) and references therein.

A novel approach to recovering the non-Gaussian subspace have been proposed recently in Blanchard et al. (2006), Kawanabe et al. (2007). The procedure called NGCA (for Non-Gaussian Component Analysis) decomposes the original problem of dimension reduction into two tasks: the first one is to extract from the data a number of candidate vectors \(\widehat {\beta}_{j}\) which are “close” to \(\mathcal {I}\). The second is to recover an estimation \(\widehat {\varPi} \) of the projector Π on \(\mathcal {I}\) from the collection \(\{\widehat {\beta}_{j}\} \). An essential disadvantage of the original NGCA estimation procedure is that it uses the ICA algorithm to construct the “initial guess” for \(\mathcal {I}\). As a result, it shares with the ICA its drawbacks. In particular, NGCA (same as projection pursuit of Hyvärinen 1999) is extremely sensitive to the condition number of the covariance matrix of the Gaussian component. To overcome this problem, the NGCA approach has been then developed into SNGCA (for Sparse NGCA), in Diederichs et al. (2009). The SNGCA algorithm does not rely upon the ICA, and appears to be more stable with respect to the parameters of the distribution of the Gaussian component. However, that procedure relies upon solving a convex optimization problem per candidate vector \(\widehat {\beta}_{j} \), and its numerical cost becomes prohibitive for problem dimension dimensions d≥50. Furthermore, the iterative nature of the SNGCA makes the theoretical study of the resulting estimation procedure extremely involved. In this paper we discuss a new method of NGCA which represents the problem of recovering the non-Gaussian subspace as a minmax problem. Special relaxation arguments allow to reduce this problem to a semidefinite convex-concave saddle-point optimization which can be effectively solved by modern methods of convex optimization. When compared to previous implementations of the SNGCA in Blanchard et al. (2006), Kawanabe et al. (2007), Diederichs et al. (2009), the new approach “shortcuts” the intermediary stages and makes better use of available information for estimation of \(\mathcal {I}\). Furthermore, it allows to treat in a transparent way the case of unknown dimension m of the target space \(\mathcal {I}\). Finally, the representation of the estimator \(\widehat {\varPi} \) as a solution of a semidefinite optimization problem allows to provide a rigorous theoretical study of the accuracy of the proposed method. In particularly, our theoretical results in Sect. 3 claim root-n consistency of \(\widehat {\varPi} \) (up to a “log-factor”) and hence its near optimality.

The paper is organized as follows: in Sect. 2 we present the setup of SNGCA and briefly review some existing techniques. Then in Sect. 3 we introduce the new approach to recovery of the non-Gaussian subspace and analyze its accuracy. Further in Sect. 4 we describe the use of the dual extrapolation scheme for linear matrix games for solving the semidefinite optimization problem. Finally we provide a simulation study in Sect. 5, where we compare the performance of the proposed algorithm SNGCA to that of some known projective methods of feature extraction.

2 Sparse non-Gaussian component analysis

2.1 The setup

The Non-Gaussian Component Analysis (NGCA) approach is based on the assumption that a high dimensional distribution tends to be normal in almost any randomly selected direction. This intuitive fact can be justified by the central limit theorem when the number of directions tends to infinity. It leads to the NGCA-assumption: the data distribution is a superposition of a full dimensional Gaussian distribution and a low dimensional non-Gaussian component. In many practical problems like clustering or classification, the Gaussian component is uninformative and it is treated as noise. The approach suggests to identify the non-Gaussian component and to use it for the further analysis.

The NGCA set-up can be formalized as follows; cf. Blanchard et al. (2006). Let X 1,…,X N be i.i.d. from a distribution P in \(\mathbb {R}^{d} \) describing the random phenomenon of interest. We suppose that P possesses a density ρ w.r.t. the Lebesgue measure on \(\mathbb {R}^{d} \), which can be decomposed as follows:

$$ \rho(x)=\phi_{\mu,\varSigma}(x)q(Tx). $$
(1)

Here ϕ μ,Σ denotes the density of the multivariate normal distribution \(\mathcal {N}(\mu,\varSigma) \) with parameters \(\mu\in \mathbb {R}^{d} \) (expectation) and \(\varSigma\in \mathbb {R}^{d\times d} \) positive definite (covariance matrix). The function \(q:\mathbb {R}^{m}\to \mathbb {R}\) with md is positive and bounded. \(T\in \mathbb {R}^{m\times d} \) is an unknown linear mapping. We refer to \(\mathcal {I}= \mathrm{range}\, T \) as target or non-Gaussian subspace. Note that though T is not uniquely defined, \(\mathcal {I}\) is well defined, same as the Euclidean projector Π on \(\mathcal {I}\). In what follows, unless it is explicitly specified otherwise, we assume that the effective dimension m of \(\mathcal {I}\) is known a priori. For the sake of simplicity we assume that the expectation of X vanishes: E[X]=0.

The model (1) allows for the following interpretation (cf. Sect. 2 of Blanchard et al. 2006): suppose that the observation \(X\in \mathbb {R}^{d} \) can be decomposed into X=Z+ξ, where Z is an “informative low-dimensional signal” such that \(Z\in \mathcal {I}\), \(\mathcal {I}\) being an m-dimensional subspace of \(\mathbb {R}^{d} \), and ξ is independent and Gaussian. One can easily show (see, e.g., Lemma 1 of Blanchard et al. 2006) that in this case the density of X can be represented as (1).

2.2 Basics of SNGCA estimation procedure

The estimation of \(\mathcal {I}\) relies upon the following result, proved in Blanchard et al. (2006): suppose that the function q is smooth, then for any smooth function \(\psi:\,\mathbb {R}^{d}\to \mathbb {R}\) the assumptions of (1) and E[X]=0 ensure that for

$$ \beta(\psi) \stackrel {\mathrm {def}}{=}\boldsymbol {E}\bigl[\nabla \psi(X)\bigr] = \int \nabla \psi(x) \rho(x)\, dx , $$
(2)

there is a vector \(\beta \in \mathcal {I}\) such that

where ∇ψ denotes the gradient of ψ and |⋅| p is the standard p -norm on \(\mathbb {R}^{d} \). In particular, if ψ satisfies E[(X)]=0, then \(\beta(\psi)\in \mathcal {I}\). Consequently

$$ \bigl|(I - \varPi^{*}) \beta(\psi) \bigr|_{2} \le \biggl| \varSigma^{-1} \int x \psi(x) \rho(x) \, dx \biggr|_{2}, $$
(3)

where I is the d-dimensional identity matrix and Π is the Euclidean projector on \(\mathcal {I}\).

The above result suggests the following two-stage estimation procedure: first compute a set of estimates \(\{\widehat {\beta}_{\ell}\} \) of elements {β j } of \(\mathcal {I}\), then recover an estimation of \(\mathcal {I}\) from \(\{\widehat {\beta}_{\ell}\} \). This heuristic has been first used to estimate \(\mathcal {I}\) in Blanchard et al. (2006). To be more precise, the construction implemented in Blanchard et al. (2006) can be summarized as follows: let for a family {h }, =1,…,L of smooth bounded (test) functions on \(\mathbb {R}^{d} \)

$$ \gamma_{\ell} \stackrel {\mathrm {def}}{=}\boldsymbol {E}\bigl[X h_{\ell}(X)\bigr],\qquad \eta_{\ell} \stackrel {\mathrm {def}}{=}\boldsymbol {E}\bigl[\nabla h_{\ell}(X)\bigr], $$
(4)

and let

$$ \widehat {\gamma}_{\ell} \stackrel {\mathrm {def}}{=}N^{-1}\sum_{i=1}^N X_ih_{\ell}(X_i),\qquad \widehat {\eta}_\ell \stackrel {\mathrm {def}}{=}N^{-1}\sum_{i=1}^N \nabla h_{\ell}(X_i) $$
(5)

be their “empirical counterparts”. The set of “approximating vectors” \(\{\widehat {\beta}_{\ell}\} \) used in Blanchard et al. (2006) is as follows: \(\widehat {\beta}_{\ell}=\widehat {\eta}_{\ell}-\widehat {\varSigma}^{-1}\widehat {\gamma}_{\ell}\), =1,…,L, where \(\widehat {\varSigma} \) is an estimate of the covariance matrix Σ. The projector estimation at the second stage is \(\widehat {\varPi}=\sum_{j=1}^{m} e_{j}e_{j}^{T} \), where e j , j=1,…,m, are m principal eigenvectors of the matrix \(\sum_{\ell=1}^{L}\widehat {\beta}_{\ell} \widehat {\beta}_{\ell}^{T} \). A numerical study, provided in Blanchard et al. (2006), has shown that the above procedure can be used successfully to recover \(\mathcal {I}\). On the other hand, such implementation of the two-stage procedure possesses two important drawbacks: it relies upon the estimation of the covariance matrix Σ of the Gaussian component, which can be hard even for moderate dimensions d. Poor estimation of Σ then will result in badly estimated vectors \(\widehat {\beta}_{\ell}\), and as a result, poorly estimated \(\mathcal {I}\). Further, using the eigenvalue decomposition of the matrix \(\sum_{\ell=1}^{L}\widehat {\beta}_{\ell} \widehat {\beta}_{\ell}^{T} \) entails that the variance of the estimation \(\widehat {\varPi} \) of the projector Π on \(\mathcal {I}\) is proportional to the number L of test-functions. As a result, the estimation procedure is restricted to utilizing only relatively small families {h }, and is sensitive to the initial selection of “informative” test-functions.

To circumvent the above limitations of the approach of Blanchard et al. (2006) a different estimation procedure has been proposed in Diederichs et al. (2009). In that procedure the estimates \(\widehat {\beta} \) of vectors from the target space are obtained by the method, which was referred to as convex projection. Let \(c \in \mathbb {R}^{L} \) and let

Observe that \(\beta(c) \in \mathcal {I}\) conditioned that γ(c)=0. Indeed, if ψ(x)=∑ c h (x), then ∑ c E[Xh (X)]=0, and by (3),

$$\eta(c) = \sum_{\ell}c^{\ell} \boldsymbol {E}\bigl[\nabla h_{\ell}(X)\bigr] \in \mathcal {I}. $$

Therefore, the task of estimating \(\beta\in \mathcal {I}\) reduces to that of finding a “good” corresponding coefficient vector. In Diederichs et al. (2009) vectors \(\{\widehat {c}_{j}\} \) are computed as follows: let

and let \(\xi_{j}\in \mathbb {R}^{d}\), j=1,…,J constitute a set of probe unit vectors. Then it holds

$$ \widehat {c}_j = \mathop {\mathrm {arg\,min}}\limits_{c} \bigl\{\bigl|\xi_j - \widehat {\eta}(c) \bigr|_{2}\, \mid \widehat {\gamma}(c)=0,\ |c |_{1} \le 1 \bigr\}, $$
(6)

and we set \(\widehat {\beta}_{j}=\widehat {\beta}(\widehat {c}_{j})= \sum_{\ell} \widehat {c}_{j}^{\ell} \widehat {\eta}_{\ell} \). Then \(\mathcal {I}\) is recovered by computing m principal axes of the minimal volume ellipsoid (Fritz-John ellipsoid) containing the estimated points \(\{\pm \widehat {\beta}_{j}\}_{j=1}^{J} \).

The recovery of \(\widehat {\mathcal {I}} \) through the Fritz-John ellipsoid (instead of eigenvalue decomposition of the matrix \(\sum_{\ell} \widehat {\beta}_{j}\widehat {\beta}_{j}^{T} \)) allows to bound the estimation error of \(\mathcal {I}\) by the maximal error of estimation \(\widehat {\beta} \) of elements of the target space (cf. Theorem 3 of Diederichs et al. 2009), while the 1-constraint on the coefficients \(\widehat {c}_{j} \) allows to control efficiently the maximal stochastic error of the estimations \(\widehat {\beta}_{j} \) (cf. Theorem 1 of Diederichs et al. 2009; Spokoiny 2009). On the other hand, that construction heavily relies upon the choice of the probe vectors ξ j . Indeed, in order to recover the projector on \(\mathcal {I}\), the collection of \(\widehat {\beta}_{j} \) should comprise at least m vectors with non-vanishing projection on the target space. To cope with this problem a multi-stage procedure has been used in Diederichs et al. (2009): given a set {ξ j } k=0 of probe vectors an estimation \(\widehat {\mathcal {I}}_{k=0} \) is computed, which is used to draw new probe vectors {ξ j } k=1 from the vicinity of \(\widehat {\mathcal {I}}_{k=0} \); these vectors are employed to compute a new estimation \(\widehat {\mathcal {I}}_{k=1} \), and so on. The iterative procedure improves significantly the accuracy of the recovery of \(\mathcal {I}\). Nevertheless, the choice of “informative” probe vectors at the first iteration k=0 remains a challenging task and hitherto is a weak point of the procedure.

3 Structural analysis by semidefinite programming

In the present paper we discuss a new choice of vectors β which solves the initialization problem of probe vectors for the SNGCA procedure in quite a particular way. Namely, the estimation procedure we are to present below does not require any probe vectors at all.

3.1 Informative vectors in the target space

Further developments are based on the following simple observation. Let η and γ be defined as in (4), and let \({U}=[\eta_{1},\ldots ,\eta_{L}]\in \mathbb {R}^{d\times L} \), \({G}=[\gamma_{1},\ldots ,\gamma_{L} ]\in \mathbb {R}^{d\times L} \). Using the observation in the previous section we conclude that if \(c\in \mathbb {R}^{L} \) satisfies \(Gc=\sum_{\ell=1}^{L}c^{\ell}\gamma_{\ell}=0 \) then \(Uc=\sum_{\ell=1}^{L}c^{\ell}\eta_{\ell}\) belongs to \(\mathcal {I}\). In other words, if Π is the Euclidean projector on \(\mathcal {I}\), then

$$(I-\varPi^*)Uc=0. $$

Suppose now that the set {h } of test functions is rich enough in the sense that vectors Uc span \(\mathcal {I}\) when c spans the subspace Gc=0. Recall that we assume the dimension m of the target space to be known. Then projector Π on \(\mathcal {I}\) is fully identified as the optimal solution to the problem

$$ \varPi^*=\arg\min_\varPi\max_c \left\{ \bigl|(I-\varPi)Uc\bigr|^2_2 \Biggm| \begin{array}{c} \mbox{$ \varPi $ is a projector on an }\\ \mbox{$ m $-dimensional subspace of $ \mathbb {R}^{d} $}\\ c\in \mathbb {R}^L,\ Gc=0 \end{array} \right\}. $$
(7)

In practice vectors γ and η are not available, but we can suppose that their “empirical counterparts”—vectors \(\widehat {\gamma}_{\ell}\), \(\widehat {\eta}_{\ell}\), =1,…,L can be computed, such that for a set A of probability at least 1−ε,

$$ |\widehat {\eta}_\ell-\eta_\ell|_2\le \delta_N,\qquad |\widehat {\gamma}_\ell-\gamma_\ell|_2\le \nu_N, \quad \ell=1,\ldots ,L. $$
(8)

Indeed, it is well known (cf., e.g., Lemma 1 in Diederichs et al. 2009 or van der Vaart and Wellner 1996) that if functions h (x)=f(x,ω ), =1,…,L, are used, where f is continuously differentiable, \(\omega_{\ell}\in \mathbb {R}^{d} \) are vectors on the unit sphere and f and ∇ x f are bounded, then (8) holds with

$$ \everymath{\displaystyle} \begin{array}{rcl} \delta_N&=&C_1\max\limits_{x\in \mathbb {R}^d,\,|\omega|_2=1} \bigl|\nabla_x f(x,\omega)\bigr|_2 N^{-1/2}\sqrt{\min\{d,\ln L\} +\ln\varepsilon^{-1}},\\[11pt] \nu_N&=&C_2 \max\limits_{x\in \mathbb {R}^d,\,|\omega|_2=1} \bigl|xf(x,\omega)\bigr|_2 N^{-1/2}\sqrt{\min\{d,\ln L\}+\ln\varepsilon^{-1}}, \end{array} $$
(9)

where C 1, C 2 are some absolute constants depending on the smoothness properties and the second moments of the underlying density.

Then for any \(c\in \mathbb {R}^{L} \) such that |c|1≤1 we can control the error of approximation of ∑ c γ and ∑ c η with their empirical versions. Namely, we have on A:

$$\max_{|c|_1\le 1} \biggl|\sum_{\ell}c_\ell(\widehat { \eta}_\ell-\eta_\ell) \biggr|_2\le \delta_N \quad \mbox{and}\quad \max_{|c|_1\le 1} \biggl\lvert \sum _{\ell}c_\ell(\widehat {\eta}_\ell- \eta_\ell)\biggr\rvert_2\le \nu_N. $$

Let now \(\widehat {U}=[\widehat {\eta}_{1},\ldots ,\eta_{L}] \), \(\widehat {G}=[\widehat {\gamma}_{1},\ldots ,\widehat {\gamma}_{L} ] \). When substituting \(\widehat {U} \) and \(\widehat {G} \) for U and G into (7) we come to the following minmax problem:

$$ \min_\varPi\max_c \left\{\bigl|(I-\varPi)\widehat {U}c\bigr|^2_2 \biggm| \begin{array}{c} \mbox{$ \varPi $ is a projector on an $ m $-dimensional}\\ \mbox{subspace of $ \mathbb {R}^{d} $}\\ c\in \mathbb {R}^L,\ |c|_1\le 1,\ |\widehat {G}c|_2\le \varrho \end{array}\right\}. $$
(10)

Here we have substituted the constraint Gc=0 with the inequality constraint \(|\widehat {G}c|_{2}\le \varrho \) for some ϱ>0 in order to keep the optimal solution c to (7) feasible for the modified problem (10) (this will be the case with probability at least 1−ε if ϱν N ).

As we will see in a moment, when c runs the ν N -neighborhood of intersection C N of the standard hyperoctahedron \(\{c\in \mathbb {R}^{L},\ |c|_{1}\le 1\} \) with the subspace \(\widehat {G}c=0 \), vectors \(\widehat {U}c \) span a close vicinity of the target space \(\mathcal {I}\).

3.2 Solution by semidefinite relaxation

Note that (10) is a hard optimization problem. Namely, the candidate maximizers c i of (10) are the extreme points of the set \(C_{N}=\{c\in \mathbb {R}^{L},\,|c|_{1}\le 1,\, |\widehat {G}c|_{2}\le \nu_{N}\} \), and there are O(L d) of such points. In order to be efficiently solvable, the problem (10) is to be “reduced” to a convex-concave saddle-point problem, which is, to the best of our knowledge, the only class of minmax problems which can be solved efficiently (cf. Nemirovski and Yudin 1983).

Thus the next step is to transform the problem in (10) into a convex-concave minmax problem using the Semidefinite Relaxation (or SDP-relaxation) technique (see e.g., Ben Tal and Nemirovski 2001, Chap. 4). We obtain the relaxed version of (10) in two steps. First, let us rewrite the objective function (recall that IΠ is also a projector, and thus an idempotent matrix):

$$\bigl|(I-\varPi)\widehat {U}c \bigr|^2_2 = c^T \widehat {U}^T(I-\varPi)^2\widehat {U}c= c^T \widehat {U}^T(I-\varPi)\widehat {U}c=\mathop{\mathrm{trace}}\bigl[\widehat {U}^T(I-\varPi) \widehat {U}X \bigr], $$

where the positive semidefinite matrix X=cc T is the “new variable”. The constraints on c can be easily rewritten for X:

  1. 1.

    the constraint |c|1≤1 is equivalent to |X|1≤1 (we use the notation \(|X|_{1}=\sum_{i,j=1}^{L} |X_{ij}| \));

  2. 2.

    because X is positive semidefinite, the constraint \(|\widehat {G}c|_{2}\le \varrho \) is equivalent to into \(\mathop{\mathrm{trace}}[\widehat {G}X\widehat {G}^{T}]\le \varrho^{2} \).

The only “bad” constraint on X is the rank constraint: rankX=1, and we simply remove it. Now we are done with the variable c and we arrive at

Let us recall that an m-dimensional projector Π is exactly a symmetric d×d matrix of rankΠ=m and traceΠ=m, with the eigenvalues 0≤λ i (Π)≤1, i=1,…,d. Once again we remove the “difficult” rank constraint rankΠ=m and finish with

$$ \min_P \max_X \left\{\mathop{\mathrm{trace}}\bigl[\widehat {U}^T(I-P)\widehat {U}X\bigr] \biggm| \begin{array}{c} 0\preceq P\preceq I,\ \mathop{\mathrm{trace}}P=m, \\[1pt] X\succeq 0,\ |X|_1\le 1,\ \mathop{\mathrm{trace}}[\widehat {G}X\widehat {G}^T]\le \varrho^2\end{array} \right\} $$
(11)

(we write PQ if the matrix QP is positive semidefinite). There is no reason for an optimal solution \(\widehat {P} \) of (11) to be a projector matrix. If an estimation of Π which is itself a projector is needed, one can use instead the projector \(\widehat {\varPi}\) onto the subspace spanned by m principal eigenvectors of \(\widehat {P} \).

Note that (11) is a linear matrix game with bounded convex domains of its arguments—positive semidefinite matrices \(P\in \mathbb {R}^{d\times d} \) and \(X\in \mathbb {R}^{L\times L} \).

We are about to describe the accuracy of the estimation \(\widehat {\varPi} \) of Π . To this end we need an identifiability assumption on the system {h } of test functions as follows:

Assumption 1

Suppose that there are vectors \(c_{1},\ldots ,c_{\overline {m}} \), \(m\le \overline {m}\le L \) such that |c k |1≤1 and Gc k =0, \(k=1,\ldots ,\overline {m} \), and non-negative constants \(\mu^{1},\ldots ,\mu^{\overline {m}} \) such that

$$ \varPi^{*} \preceq \sum_{k=1}^{\overline {m}} \mu^{k} Uc_kc_k^TU^T. $$
(12)

We denote \(\mu^{*}=\mu^{1}+\cdots+\mu^{\overline {m}} \).

In other words, if Assumption 1 holds, then the true projector Π on \(\mathcal {I}\) is μ × convex combination of rank-one matrices Ucc T U T where c satisfies the constraint Gc=0 and |c|1=1.

Theorem 1

Suppose that the true dimension m of the subspace \(\mathcal {I}\) is known and that ϱν N as in (8). Let \(\widehat {P} \) be an optimal solution to (11) and let \(\widehat {\varPi} \) be the projector onto the subspace spanned by m principal eigenvectors of \(\widehat {P} \). Then with probability ≥1−ε:

  1. (i)

    for any c such that |c|1≤1 and Gc=0,

    $$\bigl|(I-\widehat {\varPi})Uc\bigr|_2\le \sqrt{m+1} \bigl((\varrho+\nu_N)\lambda^{-1}_{\min}(\varSigma)+2\delta_N\bigr); $$
  2. (ii)

    further, if Assumption 1 holds then

    $$ \mathop{\mathrm{trace}}\bigl[(I-\widehat {P})\varPi^* \bigr] \le \mu^*\bigl((\varrho+\nu_N) \lambda^{-1}_{\min}(\varSigma)+2\delta_N \bigr)^2, $$
    (13)

    and

    $$ \everymath{\displaystyle} \begin{array}{rcl} \|\widehat {\varPi}- \varPi^*\|_2^2&\le& {2 \mu^*\bigl(\lambda^{-1}_{\min}( \varSigma) (\varrho+\nu_N)+2\delta_N\bigr)^2} \tau, \\[3pt] \tau&=&(m+1)\wedge \bigl(1- \mu^*\bigl(\lambda^{-1}_{\min}( \varSigma) (\varrho+\nu_N)+2\delta_N\bigr)^2 \bigr)^{-1} \end{array} $$
    (14)

(here \(\|A\|_{2}=(\sum_{i,j}A^{2}_{ij})^{1/2}=(\mathop{\mathrm{trace}}[A^{T}A])^{1/2} \) is the Frobenius norm of A).

Note that if we were able to solve the minimax problem in (10), we could expect its solution, let us call it \(\widetilde {\varPi}\), to satisfy with high probability

(cf. the proof of Lemma 1 in the Appendix). If we compare this bound to that of the statement (i) of Theorem 1, we conclude that the loss of the accuracy resulting from the substitution of (10) by its treatable approximation (11) is bounded with \(\sqrt{m+1} \). In other words, the “price” of the SDP-relaxation in our case is \(\sqrt{m+1} \) and does not depend on problem dimensions d and L. Furthermore, when Assumption 1 holds true, we are able to provide the bound on the accuracy of recovery of projector Π which is seemingly as good as if we were using instead of \(\widehat {\varPi}\) the solution \(\widetilde {\varPi}\) of (10).

Suppose now that the test functions h (x)=f(x,ω ) are used, with ω l on the unit sphere of \(\mathbb {R}^{d} \), that ϱ=ν N is chosen, and that Assumption 1 holds with “not too large” μ , e.g., \(\mu^{*}\le {\small 1\over 2}(\varrho+\nu_{N})\lambda^{-1}_{\min}(\varSigma)+\delta_{N} \). When substituting the bounds of (9) for δ N and ν N into (14) we obtain the bound for the accuracy of the estimation \(\widehat {\varPi}\) (with probability 1−ϵ):

$$\|\widehat {\varPi}-\varPi^*\|_2^2\le C(f) \mu^*N^{-1} \bigl(\min(d,\ln L)+\ln \epsilon^{-1} \bigr) $$

where C(f) depends only on f. This bound claims the root-N consistency in estimation of the non-Gaussian subspace with the log-price for relaxation and estimation error.

3.3 Case of unknown dimension m

The problem (11) may be modified to allow the treatment of the case when the dimension m of the target space is unknown a priori. Namely, consider for ρ≥0 the following problem

$$ \everymath{\displaystyle} \min_{P,t} \left\{t \Bigm| \begin{array}{c} \mathop{\mathrm{trace}}P\le t,\ \max_X\mathop{\mathrm{trace}}\bigl[\widehat {U}^T(I-P)\widehat {U}X \bigr]\le \rho^2,\ 0\preceq P\preceq I, \\[2pt] X\succeq 0,\ |X|_1\le 1,\ \mathop{\mathrm{trace}}\bigl[\widehat {G}X\widehat {G}^T \bigr]\le \varrho^2. \end{array} \right\} $$
(15)

The problem (15) is closely related to the 1-recovery estimator of sparse signals (see, e.g., the tutorial (Candès 2006) and the references therein) and the trace minimization heuristics widely used in the Sparse Principal Component Analysis (SPCA) (cf. d’Aspremont et al. 2007, 2008). As we will see in an instant, when the parameter ρ of the problem is “properly chosen”, the optimal solution \(\widehat {P} \) of (15) possesses essentially the same properties as that of the problem (11).

A result analogous to that in Theorem 1 holds:

Theorem 2

Let \(\widehat {P}\), \(\widehat {X} \) and  \(\widehat {t}=\mathop{\mathrm{trace}}\widehat {P} \) be an optimal solution to (15) (note that (15) is clearly solvable), \(\widehat {m}=\ \rfloor \widehat {t}\lfloor \),Footnote 1 and let \(\widehat {\varPi} \) be the projector onto the subspace spanned by \(\widehat {m} \) principal eigenvectors of \(\widehat {P} \). Suppose that ϱν N as in (8) and that

$$ \rho\ge \lambda^{-1}_{\min}(\varSigma) (\varrho+ \nu_N)+\delta_N. $$
(16)

Then with probability at least 1−ε:

  1. (i)
    $$\widehat {t}\le m \quad \mathit{and}\quad \bigl|(I-\widehat {\varPi})Uc\bigr|_2\le \sqrt{m+1}( \rho+2\delta_N); $$
  2. (ii)

    furthermore, if Assumption 1 hold then

    $$\mathop{\mathrm{trace}}\bigl[(I-\widehat {P})\varPi^* \bigr]\le \mu^*(\rho+\delta_N)^2, $$

    and

    $$ \|\widehat {\varPi}-\varPi^*\|_2^2\le {2\mu^*(\rho+ \delta_N)^2} \bigl[(m+1)\wedge \bigl(1- \mu^*(\rho+ \delta_N)^2 \bigr)^{-1} \bigr] $$
    (17)

(here \(\|A\|_{2}=(\sum_{i,j} A_{ij}^{2})^{1/2} \) is the Frobenius norm of  A).

The proof of the theorems is postponed until the Appendix.

The estimation procedure based on solving (15) allows to infer the target subspace \(\mathcal {I}\) without a priori knowledge of its dimension m. When the constraint parameter ρ is close to the right-hand side of (16), the accuracy of the estimation will be close to that, obtained in the situation when dimension m is known. However, the accuracy of the estimation heavily depends on the precision of the available (lower) bound for λ min(Σ). In the high-dimensional situation this information is hard to acquire, and the necessity to compute this quantity may be considered as a serious drawback of the proposed procedure.

4 Solving the saddle-point problem (11)

We start with the following simple observation: by using bisection or Newton search in ρ (note that the objective of (15) is obviously convex in ρ 2) we can reduce (15) to a small sequence to feasibility problems, closely related to (11): given t 0 report, if exists, P such that

$$\max_X \left\{ \begin{array}{c} \mathop{\mathrm{trace}}\bigl[ \widehat {U}^T(I-P)\widehat {U}X \bigr]\le \rho^2,\ 0\preceq P \preceq I,\ \mathop{\mathrm{trace}}P\le t_0, \\[4pt] X\succeq 0,\ |X|_1\le 1,\ \mathop{\mathrm{trace}}\bigl[\widehat {G}X\widehat {G}^T \bigr]\le \varrho^2 \end{array} \right\}. $$

In other words, we can easily solve (15) if for a given m we are able to find an optimal solution to (11). Therefore, in the sequel we concentrate on the optimization technique for solving (11).

4.1 Dual extrapolation algorithm

In what follows we discuss the dual extrapolation algorithm of Nesterov (2007) for solving a version of (11) in which, with a certain abuse, we substitute the inequality constraint \(\mathop{\mathrm{trace}}\widehat {G} X\widehat {G}^{T}\le \varrho^{2} \) with the equality constraint \(\mathop{\mathrm{trace}}[\widehat {G} X\widehat {G}^{T}]=0 \). This way we come to the problem:

$$ \min_{P\in \mathcal{P}}\max_{X\in \mathcal{X}} \mathop{\mathrm{trace}}\bigl[\widehat {U}^T(I-P) \widehat {U}X \bigr] $$
(18)

where

(here S L stands for the space of L×L symmetric matrices) and

Observe first that (18) is a matrix game over two convex subsets (of the cone) of positive semidefinite matrices. If we use a large number of test functions, say L 2∼106, the size of the variable X rules out the possibility of using the interior-point methods. The methodology which appears to be adequate in this case is that behind dual extrapolation methods, recently introduced in Nemirovski (2004), Lu et al. (2007), Nesterov (2005, 2007). The algorithm we use belongs to the family of subgradient descent-ascent methods for solving convex-concave games. Though the rate of convergence of such methods is slow—their precision is only \(\mathcal {O}( 1/k )\), where k is the iteration count, their iteration is relatively cheap, what makes the methods of this type appropriate in the case of high-dimensional problems when the high accuracy is not required.

We start with the general dual extrapolation scheme of Nesterov (2007) for linear matrix games. Let \(\mathbb {E}^{n} \) and \(\mathbb {E}^{m} \) be two Euclidean spaces of dimension n and m respectively, and let \(\mathcal {A}\subset \mathbb {E}^{n} \) and \(\mathcal {B}\subset \mathbb {E}^{m} \) be closed and convex sets. We consider the problem

$$ \min_{x\in \mathcal {A}}\max_{y\in \mathcal {B}}\langle x,Ay\rangle+\langle a,x\rangle+ \langle b,y\rangle. $$
(19)

Let ∥⋅∥ x and ∥⋅∥ y be some norms on \(\mathbb {E}^{n} \) and \(\mathbb {E}^{m} \) respectively. We say that d x (resp., d y ) is a distance-generating function of \(\mathcal {A}\) (resp., of \(\mathcal {B}\)) if d x (resp., d y ) is strongly convex modulus α x (resp., α y ) and differentiable on \(\mathcal {A}\) (resp., on \(\mathcal {B}\)).Footnote 2 Let for z=(x,y) d(z)=d x (x)+d y (y) (note that d is differentiable and strongly convex on \(\mathcal {A}\times \mathcal {B}\) with respect to the norm, defined on \(\mathcal {A}\times \mathcal {B}\) according to, e.g. ∥z∥=∥x x +∥y y ). We define the prox-function V of \(\mathcal {A}\times \mathcal {B}\) as follows: for z 0=(x 0,y 0) and z=(x,y) in \(\mathcal {A}\times \mathcal {B}\) we set

$$ V(z_0,z)\stackrel {\mathrm {def}}{=}d(z)-d(z_0)- \bigl\langle \nabla d(z_0),z-z_0 \bigr\rangle. $$
(20)

Next, for s=(s x ,s y ) we define the prox-transform T(z 0,s) of s:

$$ T(z_0,s)\stackrel {\mathrm {def}}{=}\arg\min_{z\in \mathcal {A}\times \mathcal {B}} \bigl[\langle s,\ z-z_0\rangle- V(z_0,z) \bigr]. $$
(21)

Let us denote F(z)=(−A T ya,Ax+b) the vector field of descend-ascend directions of (19) at z=(x,y) and let \(\overline {z} \) be the minimizer of d over \(\mathcal {A}\times \mathcal {B}\). Given vectors \(z_{k},z^{+}_{k}\in \mathcal {A}\times \mathcal {B}\) and s k E at the k-th iteration, we define the update \(z_{k+1},\,z^{+}_{k+1} \) and s k+1 according to

where λ k >0 is the current stepsize. Finally, the current approximate solution \(\widehat {z}_{k+1} \) is defined with

$$\widehat {z}_{k+1}={1\over k+1}\sum_{i=1}^{k+1} z^+_{i}. $$

The key element of the above construction is the choice of the distance-generating function d in the definition of the prox-function. It should satisfy two requirements:

  • let D be the variation of V over \(\mathcal {A}\times \mathcal {B}\) and let α be the parameter of strong convexity of V with respect to ∥⋅∥. The complexity of the algorithm is proportional to D/α, so this ratio should be as small as possible;

  • one should be able to compute efficiently the solution to the auxiliary problem (21) which is to be solved twice at each iteration of the algorithm.

Note that the prox-transform preserve the additive structure of the distance-generating function. Thus, in order to compute the prox-transform on the feasible domain \(\mathcal {P}\times \mathcal {X}\) of (18) we need to compute its “P and X components”—the corresponding prox-transforms on \(\mathcal {P}\) and cX. There are several evident choices of the prox-functions d P and d X of the domains \(\mathcal {P}\) and \(\mathcal {X}\) of (18) which satisfy the first requirement above and allow to attain the optimal value \(O(\sqrt{m\ln d\ln L}) \) of the ratio D/α for the prox-function V of (18). However, for such distance-generating functions there is no known way to compute efficiently the X-component of the prox-transform T in (21) for the set \(\mathcal{ X} \). This is why in order to admit an efficient solution the problem (18) is to be modified one more time.

4.2 Modified problem

We act as follows: first we eliminate the linear equality constraint which, taken along with X⪰0, says that X=Q T ZQ with Z⪰0 and certain Q; assuming that the d rows of \(\widehat {G} \) are linearly independent, we can choose Q as an appropriate (LdL matrix satisfying QQ T=I (the orthogonal basis of the kernel of \(\widehat {G} \)). Note that from the constraints on X it follows that trace[X]≤1, whence

Thus, although there are additional constraints on Z as well, Z belongs to the standard spectahedron

Now can rewrite our problem equivalently as follows:

$$ \min_{P\in \mathcal {P}} \max_{Z\in \mathcal {Z},\ |Q^TZQ|_1\le 1} \mathop{\mathrm{trace}}\bigl[\widehat {U}^T(I-P) \widehat {U} \bigl(Q^TZQ\bigr)\bigr]. $$
(22)

Let, further,

We claim that the problem (22) can be reduced to the saddle point problem

$$ \min_{(P,W)\in \mathcal {P}\times \mathcal {W}} \max_{(Z,Y)\in \mathcal {Z}\times \mathcal {Y}} \underbrace{ \bigl\{ \mathop{\mathrm{trace}}\bigl[\widehat {U}^T (I-P)\widehat {U}Y\bigr] +\lambda \mathop{\mathrm{trace}}\bigl[W\bigl(Q^TZQ-Y \bigr)\bigr] \bigr\} }_{F(P,W;\,Z,Y)} $$
(23)

provided that λ is not too small.

Now, “can be reduced to” means exactly the following:

Proposition 1

Suppose that \(\lambda>L|\widehat {U}|_{2}^{2} \), where |U|2 is the maximal Euclidean norm of columns of U. Let \((\widehat {P}, \widehat {W};\,\widehat {Z},\widehat {Y}) \) be a feasible solution ϵ-solution to (23), that is

where

Then setting

the pair \((\widehat {P}, \widetilde {Z}) \) is a feasible ϵ-solution to (22). Specifically, we have \((\widehat {P}, \widetilde {Z})\in \mathcal {P}\times \mathcal {Z}\) with \(|Q^{T}\widetilde {Z} Q|_{1}\le 1 \), and

where

The proof of the proposition is given in the Appendix A.3.

Note that feasible domains of (23) admit evident distance-generating functions. We provide the detailed computation of the corresponding prox-transforms in the Appendix A.4.

5 Numerical experiments

In this section we compare the numerical performance of the presented approach, which we refer to as SNGCA(SDP) with other statistical methods of dimension reduction on the simulated data.

5.1 Structural adaptation algorithm

We start with some implementation details of the estimation procedure. We use the choice of the test functions h (x)=f(x,ω ) for the SNGCA algorithm as follows:

$$f(x,\omega) = \tanh\bigl(\omega^{T}x\bigr) e^{-\alpha\|x\|^{2}_{2}/2}, $$

where ω , l=1,…,L are unit vectors in \(\mathbb {R}^{d} \).

We implement here a multi-stage variant of the SNGCA (cf. Diederichs et al. 2009). At the first stage of the SNGCA(SDP) algorithm (Algorithm 1) we assume that the directions ω are drawn randomly from the unit sphere of \(\mathbb {R}^{d} \). At each of the following stages we use the current estimation of the target subspace to “improve” the choice of directions ω as follows: we draw a fixed fraction of ω’s from the estimated subspace and draw randomly over the unit square the remaining ω’s. The simulation results below are presented for the estimation procedure with three stages. The size of the set of test function is set to L=10d, and the target accuracy of solving the problem (11) is set to 1e−4.

Algorithm 1
figure 1

SNGCA (SDP)

5.2 Experiment description

Each simulated data set X N=[X 1,…,X N ] of size N=1000 represents N independent and identically distributed (i.i.d.) realizations of a random vectors X of dimension d. Each simulation is repeated 100 times and we report the average over 100 simulations Frobenius norm of the error of estimation of the projection on the target space. In the examples below only m=2 components of X are non-Gaussian with unit variance, other d−2 components of X are independent standard normal random variables. The densities of the non-Gaussian components are chosen as follows:

(A):

Gaussian mixture: 2-dimensional independent Gaussian mixtures with density of each component given by 0.5ϕ −3,1(x)+0.5ϕ 3,1(x).

(B):

Dependent super-Gaussian: 2-dimensional isotropic distribution with density proportional to exp(−∥x∥).

(C):

Dependent sub-Gaussian: 2-dimensional isotropic uniform with constant positive density for ∥x2≤1 and 0 otherwise.

(D):

Dependent super- and sub-Gaussian: a component of X, say X 1, follows the Laplace distribution \(\mathcal {L}(1) \) and the other is a dependent uniform \(\mathcal {U}(c,c+1)\), where c=0 for |X 1|≤ln2 and c=−1 otherwise.

(E):

Dependent sub-Gaussian: 2-dimensional isotropic Cauchy distribution with density proportional to λ(λ 2x 2)−1 where λ=1.

We provide the 2-d density plots of the non-Gaussian components on Fig. 1. We start with comparing the presented algorithm with Projection Pursuit (PP) method (Hyvärinen 1999) and the NGCA for d=10. The results are presented on Fig. 2 (the corresponding results for PP and NGCA has been already reported in Diederichs et al. 2009 and Blanchard et al. 2006). Since the minimization procedure of PP tends to be trapped in a local minimum, in each of the 100 simulations, the PP algorithm is restarted 10 times with random starting points. The best result is reported for each PP-simulation. We observe that SNGCA(SDP) outperforms NGCA and PP in all tests.

Fig. 1
figure 2

(A) independent Gaussian mixtures, (B) isotropic super-Gaussian, (C) isotropic uniform and (D) dependent 1d Laplacian with additive 1d uniform, (E) isotropic sub-Gaussian

Fig. 2
figure 3

Comparison of Projection Pursuit (PP), NGCA and the new approach SNGCA(SDP)

In the next simulation we study the dependence of the accuracy of the SNGCA(SDP) on the noise level and compare it to the corresponding data for PP and NGCA. We present on Fig. 3 the results of experiments when the non-Gaussian coordinates have unit variance, but the standard deviation of the components of the 8-dimensional Gaussian distribution follows the geometrical progression 10r,10r+2r/7,…,10r where r=1,…,8. The conditioning of the covariance matrix heavily influences the estimation error of PP(tanh) and NGCA, but not that of SNGCA(SDP). The latter method appears to be insensitive to the differences in the noise variance along different directions in all test cases.

Fig. 3
figure 4

Estimation error with respect to the standard deviation of Gaussian components following a geometrical progression on [10r,10r] where r is the parameter on the abscissa

Next we compare the behavior of SNGCA(SDP), PP and NGCA as the dimension of the Gaussian component increases. On Fig. 4 we plot the mean error of estimation against the problem dimension d.

Fig. 4
figure 5

Mean-square estimation error vs problem dimension d

For PP and NGCA methods we observe that the estimation becomes meaningless (the estimation error explodes) already for d=30–40 for the models (A), (C) and for d=20–30 of the model (D). In the case of the models (B) and (E) we observe the progressive increase of the error for methods PP and NGCA. The proposed method SNGCA(SDP) behaves robustly with respect to the increasing dimension of the Gaussian component for all test models.

5.3 Application to geometric analysis of metastability

Some biologically active molecules exhibit different large geometric structures at the scale much larger than the diameter of the atoms. If there are more than one such structures with the life span much larger than the time scale of the local atomic vibrations, the structure is called metastable conformation (Schütte and Huisinga 2003). In other words, metastable conformations of biomolecules can be seen as connected subsets of state-space. When compared to the fluctuations within each conformation, the transitions between different conformations of a molecule are rare statistical events. Such multi-scale dynamic behavior of biomolecules stem from a decomposition of the free energy landscape into particularly deep wells each containing many local minima (Pillardy and Piela 1995; Frauenfelder and McMahon 2000). Such wells represent different almost invariant geometrical large scale structures (Amadei et al. 1993). The macroscopic dynamics is assumed to be a Markov jump process, hopping between the metastable sets of the state space while the microscopic dynamics within these sets mixes on much shorter time scales (Horenko and Schütte 2008). Since the shape of the energy landscape and the invariant density of the Markov process are unknown, the “essential degrees of freedom”, in which the rare conformational changes occur, are of importance.

We will now illustrate that SNGCA(SDP) is able to detect a multimodal component of the data density as a special case of non-Gaussian subspace in high-dimensional data obtained from molecular dynamics simulation of oligopeptides.

Clustering of 8-alanine

The first example is a times series, generated by an equilibrium molecular dynamics simulation of 8-alanine. We only consider the backbone dihedral angles in order to determine different conformations.

The time series of dimension d=14 consists of the cyclic data set of all backbone torsion angles. The simulation using CHARMM was done at T=300 K with implicit water by means of the solvent model ACE2 (Schaefer and Karplus 1996). A symplectic Verlet integrator with integration step of 1 fs was used; the total trajectory length was 4 μs and every τ=50 fs a set of coordinates was recorded.

The dimension reduction reported in the next Fig. 5 was obtained using SNGCA(SDP) for a given dimension m=5 of the target space containing the multimodal component. A concentration of the clustered data in the target space of SNGCA may be clearly observed. In comparison, the complement of the target space is almost completely filled with Gaussian noise.

Fig. 5
figure 6

Low dimensional multimodal component of 8-alanine

Fig. 6
figure 7

Gaussian noise in the complement of the SNGCA target space

Clustering of a 3-peptide molecule

In the next example we investigate Phenylalanyl-Glycyl-Glycine Tripeptide, which is assumed to realize all of the most important folding mechanisms of polypeptides (Reha et al. 2005). The simulation is done using GROMACS at T=300 K with implicit water. An integration step of a symplectic Verlet integrator is set to 2 fs, and every τ=50 fs a set of d=31 diedre angles was recorded. As in the previous experience, the dimension of the target space is set to m=5.

Figure 7 shows that the clustered data can be primarily found in the target space of SNGCA(SDP).

Fig. 7
figure 8

Low dimensional multimodal component of 3-peptide

We have studied a new approach to non-Gaussian component analysis. The suggested method, same as the techniques proposed in Blanchard et al. (2006), Diederichs et al. (2009) has two stages: on the first stage certain linear test functionals of unknown distribution are computed as a sampling of the data space and afterwards this information is used to recover the non-Gaussian subspace. The novelty of the proposed approach resides in the new method of non-Gaussian subspace identification, based upon semidefinite relaxation. The new procedure allows to overcome the main drawbacks of the previous implementations of the NGCA and seems to improve significantly the accuracy of estimation. Nevertheless the proposed algorithm is computationally demanding. While the first-order optimization algorithm we propose allows to treat efficiently the problems which are far beyond the reach of classical SDP-optimization techniques, the numerical difficulty seems to be one of the main practical limitation of the proposed approach. Furthermore the new approach lacks the ability to adapt itself to the underlying type of nonGaussian component. Both aspects including new rates of convergence for iterative adaptation of SNGCA to some former estimation result are currently in preparation.