Abstract
Robust measures are introduced for methods to determine statistically uncorrelated or also statistically independent components spanning data measured in a way that does not permit direct separation of these underlying components. Because of the nonlinear nature of the proposed methods, iterative methods are presented for the optimization of merit functions, and local convergence of these methods is proved. Illustrative examples are presented to demonstrate the benefits of the robust approaches, including an application to the processing of dynamic medical imaging.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The topics of this work focus on the low-dimensional representation of complex measured data. The lowest dimensional representation is a type of average. More accurate representations add dimensions beyond the average based upon subspaces in which the data vary the most. Choosing a basis for such subspaces is driven by the priority that data coordinates with respect to this basis be statistically uncorrelated or even statistically independent. The particular interest here is to present methods for performing these tasks which are robust against outliers in the measured data.
The most common type of average is the mean, which may be formulated variationally as the point minimizing the sum of squared distances to data points. As discussed in [14], a more robust method involves minimizing a merit function which does not grow as rapidly with respect to the data and would therefore apply less weight to erroneous data points far from a natural average. Various notions of an average based upon \(\ell _1\) measures are discussed in [11]. Based upon examples presented in Sect. 3, the type of average selected for this work is the geometric median, which may be formulated variationally as the point minimizing the sum of distances (not squared) to data points. The problem of determining the geometric median has a long history. In the 1937 paper by Weiszfeld [21], three proofs concerning the uniqueness of the geometric median are given, and one of these supplies an algorithm for its computation. We also refer to the recent annotated translation of that paper [19]. See also [3]. A shorter proof of uniqueness is given in Sect. 6 which is based upon a strict convexity argument. Moreover, a possibly novel characterization of a solution is provided in case data points are collinear. A regularized Weiszfeld iteration for computing the geometric median is proposed in Sect. 3, and the local convergence of this scheme is proved in Sect. 6. Alternative approaches, based upon the Chambolle–Pock algorithm [8] and projected dual ascent [4], are compared to our approach.
Given a natural average or center of the measured data, one may then wish to determine the direction in which data points vary the most from the center. This direction is the most significant principal component of the data. Principal components of lesser significance are sought similarly but within the orthogonal complement of the more significant ones. Determining and analyzing such components is the subject of principal component analysis (PCA) [15]. The most common way of determining these components is to select them as the eigenvectors of the covariance matrix of the data. The more significant components correspond to the larger eigenvalues of the covariance matrix since each eigenvalue gives the variance of the data projected onto the corresponding eigenvector. As discussed in Sect. 4, determining each eigenvector can be formulated variationally in terms of finding a best fit line through the data, where the line minimizes the sum of squared distances to data points. See [5] and [10] for \(\ell _1\)-based alternatives to this criterion. Based upon examples presented in Sect. 4, this line is determined here more robustly by minimizing the sum of distances (not squared) to data points. In other words, analogous to defining an average as a geometric median point, a principal component is defined here as a geometric median line. In another context, the terms robust PCA have been associated with the separation of sparse and low-rank components in the given data. This separation was first proposed in [7], where a low-rank component is measured with the spectral norm and a sparse component with the \(\ell _1\) norm. See also the related approaches of [23] and [18], where a low-rank component is separated from a column sparse component using similar norms. As in the works of [10] and [17], we do not assume here that the given data may be decomposed into sparse and low-rank components. Our task is instead to decompose (maximal rank) data into components which may all have significant energy in the data. The sum of distances (not squared) between data points and best lines is considered in [10] to obtain robust principal components all at once, and a convex relaxation of this approach is analyzed in [17]. In our approach, geometric median lines are obtained sequentially in decreasing orthogonal subspaces. In Sect. 4, an iterative scheme is proposed for computing these lines, and the scheme is based upon that used for computing the geometric median point. However, since the merit function is not convex, uniqueness of minimizers cannot be expected. Nevertheless, local convergence of the scheme to a minimizer is proved in Sect. 7. For approaches to PCA involving maximization of an \(\ell _1\) norm we refer to [16] and [18].
Suppose that the data are rotated to an axis system aligned with principal components and that they are then scaled along each new axis to normalize the respective variances to unity. When this rotation and scaling is carried out by standard methods using \(\ell _2\) measures, the transformed data have a covariance matrix equal to the identity. Then the data are said to have been sphered. In particular, the new data coordinates are statistically uncorrelated. However, they are not necessarily statistically independent [14]. (See, e.g., the example of Fig. 5 with \(m=1\) in (5.1) so that the data are sphered but the coordinates do not satisfy the independence criterion (5.4).) It might then be postulated that the data can be represented in a rotated axis system with respect to which coordinates are statistically independent. Determining and analyzing such a system is the subject of independent component analysis (ICA). In case the postulate holds, coordinates of the sphered data represent weighted sums of statistically independent variables, and by the Central Limit Theorem [1] histograms of such coordinates tend to be bell shaped. In order to identify the postulated rotation, it is standard to minimize the Gaussianity of histograms of coordinates in the desired rotated system. The approach proposed by [14] is to determine this rotation by maximizing a merit function which is known to be minimized by data with a Gaussian distribution. It is also argued in [14] that one such merit function is more robust to data outliers than another when it does not grow as rapidly with respect to the data. Such candidate merit functions are considered in Sect. 5. The optimization method of [24] is robust against local extrema. Here we propose an approach for determining the desired rotation by first targeting independence directly instead of using the indirect measure of Gaussianity. The merit function proposed in Sect. 5 is motivated by the observation that while sphered axes tend to be aligned with data clusters, independent axes tend to separate clusters. See the examples presented in Sect. 5 for details. A fixed point iteration scheme based on the optimality condition is proposed in Sect. 5 for computing robust independent components, and the local convergence of this scheme is proved in Sect. 8.
The paper is outlined as follows. In Sect. 2, standard \(\ell _2\) approaches to PCA and ICA are summarized, particularly to establish the background used later for the presentation of more robust methods. In Sect. 3, a robust method of data centering is proposed using the geometric median. In Sect. 4, a robust method for determining principal components is proposed using lines which are best fit in the sense that the sum of distances (not squared) to the data points is minimized within the subspace orthogonal to other components. In Sect. 5, a robust method for determining independent components is proposed which maximizes separations among sphered data clusters. Due to the nonlinearity of the respective optimality conditions, iterative schemes are proposed in Sects. 3–5 to solve the respective optimization problems. Local convergence of these schemes is proved in Sects. 6–8. In Sect. 9, the proposed methods are applied to a magnetic resonance image sequence to separate intensity changes due to physiological motion from those due to contrast agent, and benefits of the robust methods are demonstrated with respect to this realistic example. See also [20] and [22]. The paper ends with a summary in Sect. 10.
2 Summary of \(\ell _2\) Approaches to PCA and ICA
Let an unknown random vector \(\mathrm{z} \in \mathbb {R}^m\) be given with components \(\{\mathrm{z}_i\}_{i=1}^m\) which will be called sources. For example, the sources could be random variables associated with sounds produced independently at a cocktail party. The sources are assumed to satisfy the following:
-
1.
For \(1\le i\ne j\le m\), \(\mathrm{z}_i\) and \({z}_j\) are statistically independent.
-
2.
No \({z}_i\) is normally distributed.
-
3.
For \(1\le i\le m\), the variance \(\sigma _i^2 = {E}[({z}_i-{E}[{z}_i])^2]\) of \({z}_i\) is positive.
Here, E denotes the expectation. Since the sources are statistically independent, they are uncorrelated [14]. Let their positive definite diagonal covariance matrix be denoted by
which is unknown. Let a random vector \({y} \in \mathbb {R}^m\) be defined through a measurement process
modeled in terms of the mixing matrix \(A \in \mathbb {R}^{m\times m}\). The components \(\{{y}_i\}_{i=1}^m\) of y will be called measurements. For example, the measurements could be random variables associated with sounds recorded by separate microphones at the cocktail party mentioned above. In this case, it may be assumed naturally that each microphone records a weighted sum of sources, where a weight is stronger when the source is nearer, and the set of vectors of such weights for the respective microphones is linearly independent. Under the assumption that the mixing matrix is invertible, the goal is to determine a matrix \(W\in \mathbb {R}^{m\times m}\) such that the components \(\{{x}_i\}_{i=1}^m\) of the random vector
estimate the sources in the following sense. First, normalizing \({z}=A^{-1}{y}\) according to \(C({z})^{-\frac{1}{2}}{ z}\) removes the ambiguity of unknown variances by setting the covariance matrix to the identity. Secondly, since the order and sign of components in \(C({z})^{-\frac{1}{2}}{z}\) are unknown, the alternative \(PC({z})^{-\frac{1}{2}}{z}\) also satisfies the source assumptions when \(P\in \mathbb {R}^{m\times m}\) is any matrix satisfying \((P_{q_i,j})^2=\delta _{i,j}\) with \(\{q_i\}_{i=1}^m\) being a permutation of \(\{i\}_{i=1}^m\). Thus, W estimates a product \(PC({z})^{-\frac{1}{2}}A^{-1}\), and the covariance matrix of x in (2.2) is the identity.
Suppose that each random measurement variable \({y}_i\) is sampled directly to obtain n samples \(\{y_{ij}\}_{j=1}^n\). Implicitly underlying these are samples \(\{z_{ij}\}_{j=1}^n\) of each random source variable \({z}_i\). Define the sample vectors \({{\varvec{y}}}_i = \{y_{ij}\}_{j=1}^n\), \({{\varvec{z}}}_i = \{z_{ij}\}_{j=1}^n\), \(i=1,\ldots ,m\). According to the linear model in (2.1), the matrices \(Y=\{{{\varvec{y}}}_1,\ldots ,{{\varvec{y}}}_m\}^\mathrm{T} \in \mathbb {R}^{m\times n}\) and \(Z=\{{{\varvec{z}}}_1,\ldots ,{{\varvec{z}}}_m\}^\mathrm{T} \in \mathbb {R}^{m\times n}\) are related by
By (2.2), the estimation \(X = \{{{\varvec{x}}}_1,\ldots ,{{\varvec{x}}}_m\}^\mathrm{T}\in \mathbb {R}^{m\times n}\) of the sources satisfies
The matrix W is determined stepwise in terms of its singular value decomposition
where \(U,V \in \mathbb {R}^{m\times m}\) are orthogonal and \(\Lambda \in \mathbb {R}^{m\times m}\) is positive definite and diagonal. Specifically, after the data are centered
with
the product \(V^\mathrm{T}Y_\mathrm{c}\) should rotate the data so that the new coordinate axes are aligned with the visually natural axes of the cluster of data points \(\{Y{\hat{{\varvec{e}}}}_j\}_{j=1}^n\), \({\hat{{\varvec{e}}}}_j \in \mathbb {R}^n\), \(({\hat{{\varvec{e}}}}_j)_i = \delta _{i,j}\). After this rotation, the product
should scale the data so that the variance along each new coordinate axis is unity. For this reason, the data \(Y_\mathrm{s}\) are said to be sphered. The final orthogonal matrix U in (2.5) is chosen so that the components of the random variable x in (2.2) are maximally independent in a sense made precise below.
To determine the transformations V and \(\Lambda \), the covariance matrix of the sphered data is required to be the identity,
which is accomplished by determining the matrices V and \(\Lambda \) from the eigenspace decomposition of the centered data,
The columns of V are the so-called principal components of the data Y. Analyzing this decomposition is the subject of principal component analysis (PCA). For instance, the sampled data may be filtered by projecting these data onto subspaces spanned by principal components. For this, assume that the entries of \(\Lambda = \text{ diag }\{\lambda _i\}_{i=1}^m\) and \(V=\{{{\varvec{v}}}_1,\ldots ,{{\varvec{v}}}_m\}\) are ordered according to \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _m\). This means that the variance \(\lambda _i = \frac{1}{n}\Vert Y_\mathrm{c}^\mathrm{T}{{\varvec{v}}}_i\Vert _{\ell _2}^2\) of the data \(Y_\mathrm{c}\) along the axis \({{\varvec{v}}}_i\) is larger than the variance \(\lambda _j\) along the axis \({{\varvec{v}}}_j\) for \(i<j\). To select only the \(r<m\) components with respect to which the data have the most variation, define the projected data \(Y_P \approx Y\) by
where the projector \(P \in \mathbb {R}^{r\times m}\) is defined with entries \(P_{i,j} = \delta _{i,j}\). Note that with (2.6), (2.8), and (2.10), this result can be rewritten as \(Y_P = {\bar{Y}} + {\textstyle \frac{1}{n}}Y_\mathrm{c}(PY_\mathrm{s})^\mathrm{T}(PY_\mathrm{s})\).
Next, the transformation U in (2.1) is determined so that the components of the random variable x in (2.2) are independent. While the rows of the sphered data \(Y_\mathrm{s}\) are statistically uncorrelated, they are not necessarily statistically independent [14]. A criterion is now sought for a final rotation of axes which gives the desired independence. Since as seen in (2.1) measurements are sums of independent random variables, the Central Limit Theorem suggests why the measurements tend to be normally distributed [1]. The matrix U is often chosen to reverse this effect, i.e., to make the components of x depart from being normally distributed as much as possible. Here, the significance of the assumption that no component \(z_i\) be normally distributed can be seen, as otherwise the proposed measure of independence would not bring a separation of sources in the following. For the required statistical constructions, let E[x] denote the expectation of a random variable x. Since a normally distributed random variable n with mean 0 and variance \(\sigma ^2\) has moments
it follows that the Kurtosis \(K = K_4\),
of n satisfies \(K({n}) = {E}[|{n}|^4]-3{E}[|{n}|^2]^2 = 0\). Hence, a parameter-dependent random variable may be made to depart maximally from being normally distributed by maximizing the square of its Kurtosis with respect to parameters. Applying this criterion to the rows of
the rows of \(U = \{{{\varvec{u}}}_i^\mathrm{T}\}_{i=1}^m\) are determined as follows. Define
and the projected data
whose columns lie in \(T_l \subset \mathbb {R}^m\) defined as the range of \(Y_l\). Note that
Given \(T_l\), let the lth column of \(U^\mathrm{T}\) be determined inductively by
By (2.9) and (2.17), the second moment of \(\mathbf{u}^\mathrm{T}Y_l/\Vert \mathbf{u}\Vert _{\ell _2}\) is
Thus, F in (2.18) is computed according to
In this way, the rows of U are determined sequentially so that the earlier components of x depart from being normally distributed more than later components. Alternatively, all components of x may be estimated with roughly equal quality by determining all rows of U simultaneously through maximizing \(\sum _{l=1}^m F({{\varvec{u}}}_l^\mathrm{T} Y_\mathrm{s})\) under the conditions \({{\varvec{u}}}_i^\mathrm{T}{{\varvec{u}}}_j = \delta _{i,j}\), \(1\le i,j\le m\). Once matrices U, \(\Lambda \), and V are determined, the source samples are estimated according to (2.4) and (2.5).
The columns of \(V\Lambda ^{\frac{1}{2}}U^\mathrm{T}\) are the so-called independent components of the data Y. Analyzing this decomposition is the subject of independent component analysis (ICA). For instance, the sampled data may be filtered by projecting these data onto subspaces spanned by independent components. Specifically, to select the \(r<m\) desired independent components \(\{q_1,\ldots ,q_r\}\) \(\subset \) \(\{1,\ldots ,m\}\), define the projected data \(Y_Q \approx Y\) by
where the projector \(Q \in \mathbb {R}^{r\times m}\) is defined with entries \(Q_{ij} = \delta _{q_i,j}\). Note that with (2.6), (2.8), (2.10) and (2.14), this result can be rewritten as \(Y_Q = {\bar{Y}} + {\textstyle \frac{1}{n}}Y_\mathrm{c}(QX_\mathrm{c})^\mathrm{T}(QX_\mathrm{c})\).
In the calculations above, it is implicitly assumed that the number of samples n is at least as large as the number of sources m. Otherwise, the rank n of the covariance matrix \(\frac{1}{n} Y_\mathrm{s} Y_\mathrm{s}^\mathrm{T}\) would be less than its dimension m, and the diagonal matrix \(\Lambda \) in (2.10) would not be positive definite. In case \(n<m\) does in fact hold, because so few samples have been collected, one might be inclined simply to replace Y with \(Y^\mathrm{T}\) and therefore reverse the roles of time and space in the data. However, the data must possess an ergodicity property for the results with transposed data to be roughly equivalent to those without transposed data. Since such a property may not generally hold, the matrices above are determined here as follows; see also [6]. With \({\bar{Y}}\) and \(Y_\mathrm{c}\) given by (2.6), define the singular value decomposition \(Y_\mathrm{c}/\sqrt{n} = V{\hat{\Sigma }}{\hat{Y}}_\mathrm{s}/\sqrt{n}\) in terms of rotation matrices \(V\in \mathbb {R}^{m\times m}\) and \({\hat{Y}}_\mathrm{s}/\sqrt{n}\in \mathbb {R}^{n\times n}\) and a rectangular matrix \({\hat{\Sigma }} \in \mathbb {R}^{m\times n}\) for which \({\hat{\Lambda }} = {\hat{\Sigma }}^\mathrm{T}{\hat{\Sigma }} \in \mathbb {R}^{n\times n}\) is diagonal and positive definite. The matrices \({\hat{\Lambda }}\) and \({\hat{Y}}_\mathrm{s}\) are determined from the eigenspace decomposition \(Y_\mathrm{c}^\mathrm{T}Y_\mathrm{c} = {\hat{Y}}_\mathrm{s}^\mathrm{T} {\hat{\Lambda }} {\hat{Y}}_\mathrm{s}\), \({\hat{Y}}_\mathrm{s} {\hat{Y}}_\mathrm{s}^\mathrm{T} = nI\). Since the last \(m-n\) rows of \({\hat{\Sigma }}\) are zero, the last \(m-n\) columns \(\tilde{V}\in \mathbb {R}^{m\times (m-n)}\) of \(V=[{\hat{V}},\tilde{V}]\) may be neglected to obtain \(Y_\mathrm{c}= {\hat{V}}{\hat{\Lambda }}^{\frac{1}{2}}{\hat{Y}}_\mathrm{s}\). The matrix \({\hat{V}} \in \mathbb {R}^{m\times n}\) is determined from \({\hat{V}} = Y_\mathrm{c}{\hat{Y}}_\mathrm{s}^\mathrm{T}{\hat{\Lambda }}^{-\frac{1}{2}}/n\). The sphered data \({\hat{Y}}_\mathrm{s}\) are transformed by the rotation matrix \({\hat{U}} \in \mathbb {R}^{n\times n}\) maximizing independence of the rows of \({\hat{X}}_\mathrm{c} = {\hat{U}}{\hat{Y}}_\mathrm{s} \in \mathbb {R}^{n\times n}\). Note that with \(\Lambda = {\hat{\Sigma }}{\hat{\Sigma }}^\mathrm{T}\) it also follows that \(Y_\mathrm{c}Y_\mathrm{c}^\mathrm{T} = V{\hat{\Sigma }}[{\hat{Y}}_\mathrm{s}{\hat{Y}}_\mathrm{s}^\mathrm{T}] {\hat{\Sigma }}^\mathrm{T}V^\mathrm{T} = nV\Lambda V^\mathrm{T}\) holds, giving (2.10). Let \({\hat{Y}}_\mathrm{s}\) be padded with \(m-n\) zero rows to obtain \(Y_\mathrm{s} = [{\hat{Y}}_\mathrm{s};0] \in \mathbb {R}^{m\times n}\). With \(\Sigma = \Lambda ^{\frac{1}{2}}\), it follows from the singular value decomposition of \(Y_\mathrm{c}\) that \(Y_\mathrm{c} = V\Sigma Y_\mathrm{s}\) holds, and hence the counterpart \(Y_\mathrm{s} = (\Lambda ^\dagger )^{\frac{1}{2}} V^\mathrm{T}Y_\mathrm{c}\) to (2.8) holds, where \(\Lambda ^\dagger \) denotes the pseudo-inverse of \(\Lambda \). The rotation matrix \(U = [[{\hat{U}},0];[0,\tilde{U}]] \in \mathbb {R}^{m\times m}\) can be defined by supplementing \({\hat{U}}\) with the rotation matrix \(\tilde{U}\in \mathbb {R}^{(m-n)\times (m-n)}\) and otherwise padding with zeros, and \(X_\mathrm{c} = UY_\mathrm{s} \in \mathbb {R}^{m\times n}\) can be defined to give (2.14). However, according to \(X_\mathrm{c} = UY_\mathrm{s} = [[{\hat{U}},0];[0,\tilde{U}]][{\hat{Y}}_\mathrm{s};0] = [{\hat{X}}_\mathrm{c};0]\), the last \(m-n\) rows of \(X_\mathrm{c}\) are zero, contrary to the objective that the rows be independent. Thus, \({\hat{X}}_\mathrm{c}\) marks the end of the calculation and \({\hat{X}} = {\hat{U}}{\hat{\Lambda }}^{-\frac{1}{2}}{\hat{V}}^\mathrm{T}Y\) gives the maximum number of independent components which can be determined from the undersampled data. Finally, for projectors \(P,Q \in \mathbb {R}^{n\times n}\), (2.11) and (2.21) become \(Y_P = {\bar{Y}} + \frac{1}{n}Y_\mathrm{c}(P{\hat{Y}}_\mathrm{s})^\mathrm{T} (P{\hat{Y}}_\mathrm{s})\) and \(Y_Q = {\bar{Y}} + \frac{1}{n}Y_\mathrm{c} (Q{\hat{X}}_\mathrm{s})^\mathrm{T}(Q{\hat{X}}_\mathrm{s})\), respectively.
3 \(\ell _1\) Approach to Centering
That \(\ell _1\) measures lead to statistically robust results may be highlighted by the following simple example. Suppose samples \(Y = \{y_j\}_{j=1}^n = \{0,1,\ldots ,1\}\in \mathbb {R}^n\), \(n>2\), have been collected, where the first measurement is clearly an outlier. The \(\ell _2\) mean of these data is given by
which is clearly influenced by the outlier. On the other hand, the \(\ell _1\) mean is given by
which is insensitive to the outlier. Here, the median is defined as the middle entry in the sorted list of values, if n is odd, or else the average of two middle values, if n is even.
For a generalization of this robust scalar mean to its counterpart for vectors, let the data \(Y = \{{{\varvec{y}}}_1,\ldots ,{{\varvec{y}}}_m\}^\mathrm{T} \in \mathbb {R}^{m\times n}\), \({{\varvec{y}}}_i = \{y_{ij}\}_{j=1}^n\), with columns \(Y{\hat{{\varvec{e}}}}_j\), \({\hat{{\varvec{e}}}}_j \in \mathbb {R}^n\), \(({\hat{{\varvec{e}}}}_j)_i = \delta _{i,j}\), be given as in Sect. 2. Note that if the \(\ell _1\) mean were defined according to
then the solution would be unnaturally determined componentwise through decoupled minimizations. By contrast, the following \(\ell _1\) mean for vectors [11], i.e., the geometric median [19],
minimizes, in a natural way, the \(\ell _1\) norm of Euclidean distances between the data points and the selected mean. The robustness of this measure in relation to the mean or median can be highlighted by the following simple example, which is illustrated in Fig. 1a. Here the data are given by
marked with \(\cdot \) in Fig. 1a, and the point (0, 1) may be regarded as an outlier from points otherwise lying on the line between (0, 0) and (1, 1). The componentwise mean of the data gives (0.375, 0.625) marked with \(\times \) in Fig. 1a. Then the componentwise median gives (0.25, 0.75) as marked with \(+\) in Fig. 1a. Finally, the measure defined in (3.4) gives the geometric median \({\bar{Y}} = (0.4996,0.5004)\) marked with \(\Box \) in Fig. 1a, where the smooth landscape for the merit function M is shown in Fig. 1b. Because of the natural result obtained by the geometric median in Fig. 1a, (3.4) will be used here for the \(\ell _1\) vector mean.
To compute the \(\ell _1\) mean of (3.4), the following iteration is used. For \(\tau >0\), compute iteratively \(D_l\in \mathbb {R}^{m\times n}\) and \({{\varvec{\mu }}}_l\in \mathbb {R}^m\) by
The motivation for this iteration is based upon the optimality conditions (6.9) derived later in Sect. 6. A continuum level steepest descent approach to solving (6.9) is given formally by
where (3.9) implicitly defines \({{\varvec{\mu }}}(t)\) so that the sum over j in (3.8) and hence \(\sum _{j=1}^n D(t){\hat{{\varvec{e}}}}_j\) remains 0. A semi-implicit time-stepping approach to solving (3.8)–(3.9) is given by
where (3.11) defines \({{\varvec{\mu }}}_{l+1}\) to correct the departure of \(\sum _{j=1}^n D_{l+1}{\hat{{\varvec{e}}}}_j\) in (3.10) from 0, as required by the last equation in (6.9). After rearranging terms, the scheme (3.6)–(3.7) is seen to be equivalent to (3.10)–(3.11). This derivation of the scheme (3.6)–(3.7) offers a heuristic explanation for the advantage of choosing \(\tau \) large so that the desired steady state of (3.8)–(3.9) is achieved rapidly. Further details of this iteration and its convergence analysis are given in Sect. 6. The \(\ell _1\)-mean is given by taking the limit,
After these calculations have been completed, the centered data are given by \(Y_\mathrm{c} = Y-{\bar{Y}}\), the counterpart to (2.6) with (3.4) replacing (2.7).
Note that the iteration (3.6)–(3.7) agrees with the Weiszfeld Algorithm for \(\tau \rightarrow \infty \) [21]. Since this limit may lead to undefined terms, the proposed scheme may be regarded as a regularized Weiszfeld iteration. This scheme is compared in Figs. 2 and 3 with two alternatives which are described next.
The first alternative approach is given by the Chambolle–Pock Algorithm [8], which can be written for \({{\varvec{\mu }}}_l,{{\varvec{\nu }}}_l \in \mathbb {R}^m\), \(D_l \in \mathbb {R}^{m\times n}\) as
In (3.13), \(P_B\) is a projection onto the set \(B=\{{{\varvec{v}}}\in \mathbb {R}^m: \Vert {{\varvec{\nu }}}\Vert _2 \le 1\}\) and the theoretical requirements are that \(\theta =1\) and \(0<\varsigma ,\tau < 1\) hold. Note that, as (3.5)–(3.6) can be seen as a semi-implicit time-stepping scheme for solving (3.8)–(3.9), (3.13) can be seen as an alternative explicit time-stepping scheme.
The second alternative approach is given by solving the dual problem \(\sup _{D\in A\cap C} \sum _{j=1}^n ({\hat{{\varvec{e}}}}_jD)^\top Y{\hat{{\varvec{e}}}}_j\), where \(A=\{D\in \mathbb {R}^{m\times n}: \Vert D{\hat{{\varvec{e}}}}_j\Vert _{\ell _2}\le 1,1\le j\le n\}\) und \(C = \{D\in \mathbb {R}^{m\times n}: \sum _{j=1}^n D{\hat{{\varvec{e}}}}_j=0\}\). The solution \(D^*\) is computed by a dual ascent iteration
where \(\tau >0\) is a stepsize and \(P_{A\cap C}\) is a projection onto \(A\cap C\), computed according to [4]. This ascent method can also be viewed as an explicit time-stepping scheme for solving the dual problem. After \(D^*\) is obtained, the fixed point iteration \({{\varvec{\mu }}}_{l+1} = (1/n)\sum _{j=1}^n[Y{\hat{{\varvec{e}}}}_j-\Vert {{\varvec{\mu }}}_l-Y{\hat{{\varvec{e}}}}_j\Vert _{\ell _2} D^*{\hat{{\varvec{e}}}}_j]\) is used to compute a solution \({{\varvec{\mu }}}^*\) to the optimality condition in (6.9).
The convergence of the three schemes, regularized Weiszfeld (3.6)–(3.7), Chambolle–Pock (3.13), and dual ascent (3.14), is compared in Fig. 2. All parameters have been chosen manually to minimize the number of iterations required for convergence. For these comparisons, the data \(Y\in \mathbb {R}^{m\times n}\), \(m=2\), \(n=1000\), are chosen so that the points \(\{Y{\hat{{\varvec{e}}}}_j\}_{j=1}^n\) are subgaussian (uniformly, Y = rand(m,n)) distributed in Fig. 2a and supergaussian (Laplacian, Y = log(rand(m,n)./rand(m,n))) distributed in Fig. 2b. Clearly, the regularized Weiszfeld iteration converges much more rapidly. Since additional work is required to determine \({{\varvec{\mu }}}^*\) from the result of the dual ascent method, only the regularized Weiszfeld and the Chambolle–Pock iterations are compared in Fig. 3 for the realistic computation of \({\bar{Y}}\) for the example of Sect. 9. Again, the regularized Weiszfeld iteration converges much more rapidly. This convergence performance combined with the popularity of the Weiszfeld Algorithm has motivated the focus on the scheme (3.6)–(3.7) and its analysis for this work.
4 \(\ell _1\) Approach to PCA
To present our approach, we start with some preliminaries. Unless otherwise specified, it is assumed that \(m\le n\). With V and \(\Lambda \) in (2.10) given in terms of components as \(V = \{{\hat{{{\varvec{v}}}}}_i\}_{i=1}^m\) and \(\Lambda = \text{ diag }\{\lambda _i\}_{i=1}^m\), respectively, define
and the projected data
Let \(S_k = {\mathcal R}(Y_k)\) where \({\mathcal R}\) denotes the range. For convenience, it is assumed here that the data \(Y_\mathrm{c}\) have maximal rank so that \(S_1 = \mathbb {R}^m\). Then \({{\varvec{0}}} = {{\varvec{v}}}^\mathrm{T}(I-V_{k-1}V_{k-1}^\mathrm{T})Y_\mathrm{c} = {{\varvec{v}}}^\mathrm{T}Y_k\) is equivalent to \({{\varvec{v}}} = V_{k-1}V_{k-1}^\mathrm{T}{{\varvec{v}}}\), which is equivalent to \({{\varvec{v}}} = {{\varvec{0}}}\) exactly when \(V_{k-1}^\mathrm{T}{{\varvec{v}}} = {{\varvec{0}}}\). Hence,
Before presenting the proposed robust measure for determining visually natural data axes, a motivation is given by reformulating the \(\ell _2\) eigenspace decomposition in (2.10) in terms of a least squares fit of an axis system to the cloud of data points. Given \(S_k\), let the kth column of V and the kth diagonal entry of \(\Lambda \) be determined inductively by the regression
where
Since minimizing
over \({\hat{{{\varvec{v}}}}} \in S_k\) with \( \Vert {\hat{{{\varvec{v}}}}}\Vert _{\ell _2} = 1\) is equivalent to maximizing the Rayleigh quotient \({{\varvec{v}}}^\mathrm{T}Y_kY_k^\mathrm{T}{{\varvec{v}}}/{{\varvec{v}}}^\mathrm{T}{{\varvec{v}}} = \Vert Y_k^\mathrm{T}{{\varvec{v}}}\Vert _{\ell _2}^2/\Vert {{\varvec{v}}}\Vert _{\ell _2}^2\) over the same set, \({\hat{{{\varvec{v}}}}}_k\) in (4.4) is the eigenvector of \(\frac{1}{n}Y_\mathrm{c}Y_\mathrm{c}^\mathrm{T}\) with the kth largest eigenvalue \(\lambda _k=\Vert {\hat{{{\varvec{v}}}}}_k^\mathrm{T}Y_\mathrm{c}/\sqrt{n}\Vert _{\ell _2}^2= \Vert {\hat{{{\varvec{v}}}}}_k^\mathrm{T}Y_k/\sqrt{n}\Vert _{\ell _2}^2\) as shown in (4.4).
We now aim for an appropriate \(\ell _1\)-variant of (4.4)–(4.5). The proposed approach to determine the orthogonal matrix V and the diagonal matrix \(\Lambda \) in (2.5) is to replace the sum of squared norms in (4.5) with a sum of norms in (4.8) below. The approach is reminiscent of (3.4) in the sense that while a geometric median point is selected by (3.4), a geometric median line is determined by (4.7). As for (4.5), let \(V_k\) be given by (4.1) and \(Y_k\) by (4.2). Then given \(S_k\) according to (4.3), let \({{\varvec{v}}}_k\) and \(\lambda _k\) be determined inductively by
where
The robustness of the \(\ell _1\) measure in \(H_k\) in relation to the \(\ell _2\) measure in \({\tilde{H}}_k\) is highlighted by the following simple example, which is illustrated in Fig. 4. Here the data are given by
marked with \(\cdot \) in Fig. 4a. The points \((0,\frac{1}{2}),(1,\frac{1}{2})\) may be regarded as outliers from points otherwise lying on the line between (0, 0) and (1, 1). The \(\ell _2\) data axes are given by (4.4) and are shown in Fig. 4a as dotted line segments. The \(\ell _1\) data axes are given by (4.7) and are shown in Fig. 4a as solid line segments. The landscape for the merit function \(H_1(\theta )= H_1({\hat{{{\varvec{v}}}}}(\theta ))\) of (4.8) with \({\hat{{{\varvec{v}}}}}(\theta ) = (\cos (\theta ),\sin (\theta )\) is shown in Fig. 4b, and the minimum is marked with \(\Box \); see Remark 2 concerning the regularity of the merit function.
The data axes defined by (4.7) are computed by the following scheme. For \(\tau >0\) and \(\rho >0\), compute iteratively \(D_l{\hat{{\varvec{e}}}}_j \in S_k\), \(j=1,\ldots ,n\), and \(\hat{\mathbf{v}}_l\in S_k\) with \(\Vert D_0{\hat{{\varvec{e}}}}_j\Vert _{\ell _2} \le 1, j=1,\ldots ,n\), and \(\Vert \hat{\mathbf{v}}_0\Vert _{\ell _2}=1\),
The motivation for this iteration and its convergence analysis are given in Sect. 7. Then the kth components of V and \(\Lambda \) are given, respectively, by taking the limit
and setting
After these calculations have been completed for \(k=1,\ldots ,m\), the \(\ell _1\) sphered data are given by \(Y_\mathrm{s} = \Lambda ^{-\frac{1}{2}}V^\mathrm{T}Y_\mathrm{c}\), the counterpart to (2.8).
Since the minimization problems (4.7) are performed over progressively smaller subspaces \(S_k\), it follows that \(\lambda _1\ge \lambda _2 \ge \cdots \ge \lambda _m\). Thus, as in Sect. 2, the \(\ell _1\)-variation \(\lambda _i\) of the data \(Y_\mathrm{c}\) along the axis \({{\varvec{v}}}_i\) is larger than the \(\ell _1\)-variation \(\lambda _j\) along the axis \({{\varvec{v}}}_j\) for \(i<j\). To select only the \(r<m\) components with respect to which the data have the most \(\ell _1\)-variation, define the projected data \(Y_P \approx Y\) by the counterpart to (2.11) where the matrices \({\bar{Y}}\), V, and \(\Lambda \) are now determined by \(\ell _1\) measures while the projector \(P \in \mathbb {R}^{r\times m}\) is defined as before with entries \(P_{i,j} = \delta _{i,j}\).
As the columns of V are determined sequentially, the earlier components of y are more strongly separated from other later components. Alternatively, all components of y may be estimated with roughly equal quality by determining all columns of V simultaneously through minimizing a sum of functionals of the form (4.8) for each column under the constraint that V be orthogonal [10, 17].
In case the data are undersampled and \(n<m\), the steps outlined in this section must be modified as follows. Let \({\bar{Y}}\) and \(Y_\mathrm{c}\) be determined as described following (3.4). Then \({\hat{V}} \in \mathbb {R}^{m\times n}\) is determined by solving (4.7) but for \(k=1,\ldots ,n\) where \(S_1\) is the range of \(Y_\mathrm{c}\). With \(\lambda _k = R_k({{\varvec{v}}}_k)\), set \({\hat{\Lambda }} = \text{ diag }\{\lambda _k\}_{k=1}^n\). The sphered data are then given by \({\hat{Y}}_\mathrm{s} = {\hat{\Lambda }}^{-\frac{1}{2}}{\hat{V}}^\mathrm{T}Y_\mathrm{c} \in \mathbb {R}^{n\times n}\). Finally, for a projector \(P \in \mathbb {R}^{n\times n}\), (2.11) becomes \(Y_P = {\bar{Y}} + {\hat{V}}{\hat{\Lambda }}^{\frac{1}{2}} P^\mathrm{T}P{\hat{\Lambda }}^{-\frac{1}{2}}{\hat{V}}^\mathrm{T}(Y-{\bar{Y}})\).
5 \(\ell _1\) Approach to ICA
While the Kurtosis has been used as a measure of Gaussianity in (2.18) to determine independent components, an alternative measure of independence is proposed here which is more robust in the presence of outliers. This approach targets independence directly in a manner which can be illustrated in terms of the example shown below in Fig. 5. Here the data are given by
where \(\sigma (t) = \text{ sign }(t)\). Let these data represent a realization of a random vector \({y} \in \mathbb {R}^2\) satisfying
where P denotes the probability. In order that the rotation dependent random vector
satisfy the independence condition,
a rotation angle \(\theta =\pi /4+k\pi /2\), \(k\in \mathbb {Z}\), must be chosen. For the determination of the proper rotation, four different measures of independence are compared in Fig. 5,
where \(K_m\) is given by (2.13) and (see [14])
The data axes obtained by maximizing the last three measures in (5.5) are identical and are shown in Fig. 9a as dotted line segments. The data axes obtained by maximizing the first measure in (5.5) are shown in Fig. 9a as solid line segments. The landscape for the merit function \(\Vert U(\theta )Y\Vert _{\ell _1}\) of (5.5) is shown in Fig. 9b and the maximum is marked with \(\Box \). Only the first measure in (5.5) is maximized at a desired angle as shown in the landscape of Fig. 9b. All other measures are maximized at a multiple of \(\pi \). The correct rotation is also obtained using (5.8) for the data presented in Sect. 9. While a more detailed statistical analysis is intended for a future work, it is presently on the basis of these examples that the measure shown below in (5.8) is proposed to determine the rotation matrix U of (2.14).
To achieve maximally independent rows of \(X_\mathrm{c} = UY_\mathrm{s}\), the rows of the orthogonal matrix \(U = \{\hat{{{\varvec{u}}}}_i^\mathrm{T}\}_{i=1}^m\) are determined as follows. Define \(U_l = \{\hat{{{\varvec{u}}}}_1,\ldots ,{\hat{{{\varvec{u}}}}}_l\}^\mathrm{T}\) as in (2.15) and the projected data \(Y_l = (I-U_{l-1}^\mathrm{T}U_{l-1})Y_\mathrm{s}\), \(l=2,\ldots ,m\), \(Y_1 = Y_\mathrm{s}\), as in (2.16). Let \(T_l = {\mathcal R}(Y_l)\) where \({\mathcal R}\) denotes the range. For convenience, it is assumed here that the data \(Y_\mathrm{s}\) have maximal rank so that \(T_1 = \mathbb {R}^m\). Then \({{\varvec{0}}} = {{\varvec{u}}}^\mathrm{T}(I-U_{l-1}^\mathrm{T}U_{l-1})Y_\mathrm{s} = {{\varvec{u}}}^\mathrm{T}Y_l\) is equivalent to \({{\varvec{u}}} = U_{l-1}^\mathrm{T}U_{l-1}{{\varvec{u}}}\), which is equivalent to \({{\varvec{u}}} = {{\varvec{0}}}\) exactly when \(U_{l-1}{{\varvec{u}}} = {{\varvec{0}}}\). Hence,
Given \(T_l\), let the lth column of \(U^\mathrm{T}\) be determined inductively by
where
In this way, the rows of U are determined sequentially so that the earlier components of x are more strongly separated from other later components. Alternatively, all components of x may be estimated with roughly equal quality by determining all rows of U simultaneously through maximizing a sum of functionals of the form (5.9) for each row under the constraint that U be orthogonal [14]. Once matrices U, \(\Lambda \), and V are determined, the source samples are estimated according to (2.4) and (2.5).
The robustness of the measure \(G_l\) in (5.9) in relation to the measure F in (2.20) is highlighted by the following simple example, which is illustrated in Fig. 6. Here the data Y are given by
marked with \(\cdot \) in Fig. 6a, and the points \((\pm 3,0)\) may be regarded as outliers from points otherwise lying at the diamond vertices \(\{(0,\pm 1),(\pm 1,0)\}\). The \(\ell _2\) data axes are given by (2.18) and are shown in Fig. 6a as dotted line segments. The \(\ell _1\) data axes are given by (5.8) and are shown in Fig. 6a as solid line segments, where the landscape for the merit function \(G_1\) of (5.9) is shown in Fig. 6b and the maximum is marked with \(\Box \).
The vectors defined by (5.8) are computed by the following scheme. For \(\tau > 0\), compute iteratively \(\hat{\mathbf{u}}_k \in T_l\) with \(\Vert \hat{\mathbf{u}}_0\Vert _{\ell _2} = 1\),
where
The motivation for this iteration and its convergence analysis are given in Sect. 8. The lth column of \(U^\mathrm{T}\) is given by taking the limit,
After these calculations have been completed for \(l=1,\ldots ,m\), the \(\ell _1\) maximally independent data are given by \(X_\mathrm{c} = UY_\mathrm{s}\), the counterpart to (2.14).
To select the \(r<m\) desired independent components \(\{q_1,\ldots ,q_r\}\) \(\subset \) \(\{1,\ldots ,m\}\), define the projected data \(Y_Q \approx Y\) by the counterpart to (2.21) where the matrices \({\bar{Y}}\), V, \(\Lambda \), and U are now determined by \(\ell _1\) measures, while the projector \(Q \in \mathbb {R}^{r\times m}\) is defined as before with entries \(Q_{ij} = \delta _{q_i,j}\).
In case the data are undersampled and \(n<m\), let \({\hat{V}} \in \mathbb {R}^{m\times n}\), \({\hat{Y}}_\mathrm{s},{\hat{\Lambda }} \in \mathbb {R}^{n\times n}\) be given as described at the end of Sect. 3. Then, the sphered data \({\hat{Y}}_\mathrm{s}\) are transformed by the rotation matrix \({\hat{U}} \in \mathbb {R}^{n\times n}\) maximizing independence of the rows of \({\hat{X}}_\mathrm{c} = {\hat{U}}{\hat{Y}}_\mathrm{s} \in \mathbb {R}^{n\times n}\). Thus, \({\hat{X}} = {\hat{U}}{\hat{\Lambda }}^{-\frac{1}{2}}{\hat{V}}^\mathrm{T}Y\) gives the maximum number of independent components which can be determined from the undersampled data. Finally, for a projector \(Q \in \mathbb {R}^{n\times n}\), (2.21) becomes \(Y_Q = {\bar{Y}} + {\hat{V}}{\hat{\Lambda }}^{\frac{1}{2}}{\hat{U}}^\mathrm{T} Q^\mathrm{T}Q{\hat{U}}{\hat{\Lambda }}^{-\frac{1}{2}}{\hat{V}}^\mathrm{T}(Y-{\bar{Y}})\).
6 Convergence of the Iterative Scheme for the \(\ell _1\) Mean
The analysis of the scheme (3.6)–(3.7) begins by establishing basic properties for the minimization problem (3.4), i.e., the determination of the geometric median. As indicated in Sect. 1, a proof of uniqueness of the geometric median is provided here which is shorter than that found in [21] or [19] and is based on strict convexity of the functional M. Also, a possibly novel characterization of a solution is provided in case the columns of Y are collinear, which means that Y can be expressed in the form
Lemma 1
If the columns of \(Y \in \mathbb {R}^{m\times n}\) are not collinear, then M is strictly convex.
Proof
The mapping M is the sum of convex mappings and hence convex itself. If M is not strictly convex, then there are vectors \({{{\varvec{\mu }}}_\mathbf{1}},{{{\varvec{\mu }}}_\mathbf{2}} \in \mathbb {R}^m\), and \({\varvec{\overline{\mu }}}= \overline{\alpha } {{{\varvec{\mu }}}_\mathbf{1}} + (1 - \overline{\alpha }) {{{\varvec{\mu }}}_\mathbf{2}}, \text { with } \overline{\alpha } \in (0,1)\) such that
and thus the mapping \(\alpha \rightarrow M (\alpha {{{\varvec{\mu }}}_\mathbf{1}} + (1- \alpha ){{{\varvec{\mu }}}_\mathbf{2}}),\, \alpha \in (0,1),\) is affine. Let \(\hat{\alpha } \in (0,1)\) be such that \(\hat{{{\varvec{\mu }}}} = \hat{\alpha } {{{\varvec{\mu }}}_\mathbf{1}} + (1-\hat{\alpha }){{{\varvec{\mu }}}_\mathbf{2}}\) does not coincide with any of the columns \(Y {\hat{{\varvec{e}}}}_i\). Then we find the Hessian
and hence for any \({{\varvec{x}}} \in \mathbb {R}^m\)
Note that \(|({\hat{{\varvec{\mu }}}} - Y {\hat{{\varvec{e}}}}_i)^\mathrm{T} {{\varvec{x}}}| \le \Vert {\hat{{\varvec{\mu }}}} - Y {\hat{{\varvec{e}}}}_i\Vert _{\ell _2} \, \Vert {{\varvec{x}}}\Vert _{\ell _2}.\) If \(|({\hat{{\varvec{\mu }}}} - Y {\hat{{\varvec{e}}}}_i)^\mathrm{T} {{\varvec{x}}}| = \Vert {\hat{{\varvec{\mu }}}} - Y {\hat{{\varvec{e}}}}_i\Vert \, \Vert {{\varvec{x}}}\Vert \) for all \(i = 1,\ldots ,n,\) then there exist \(b_i\in \mathbb {R}\) such that \({\hat{{\varvec{\mu }}}} - Y {\hat{{\varvec{e}}}}_i = b_i{{\varvec{x}}},\) for \(i=1,\ldots ,n\). Thus there exists \({{\varvec{b}}} \in \mathbb {R}^n\) such that \(Y = {{\varvec{\mu }}} {{\varvec{e}}}^\mathrm{T} - {{\varvec{b}}}{{\varvec{x}}}^\mathrm{T}\) which contradicts the assumption. Hence there exists at least one index i such that \(|({\hat{{\varvec{\mu }}}} - Y {\hat{{\varvec{e}}}}_i)^\mathrm{T} {{\varvec{x}}}| < \Vert {\hat{{\varvec{\mu }}}} - Y {\hat{{\varvec{e}}}}_i\Vert _{\ell _2} \, \Vert {{\varvec{x}}}\Vert _{\ell _2}\) and thus \({{\varvec{x}}}^\mathrm{T} \nabla ^2 M({\hat{{\varvec{\mu }}}}) {{\varvec{x}}} > 0.\) This contradicts that \(\alpha \rightarrow M (\alpha {{{\varvec{\mu }}}_\mathbf{1}} + (1 -\alpha ) {{{\varvec{\mu }}}_\mathbf{2}})\) is affine at \(\hat{\alpha }\). \(\square \)
Lemma 2
If the columns of \(Y \in \mathbb {R}^{m\times n}\) are collinear, then M is minimized (not necessarily uniquely) by \({{\varvec{\mu }}}^\star = {{\varvec{a}}} + {{\varvec{b}}} \cdot \text { median } ({{\varvec{y}}})\). If the columns of Y are not collinear, there exists a unique \({{\varvec{\mu }}}^\star \in \mathbb {R}^m\) minimizing M.
Proof
Existence of a solution \({{\varvec{\mu }}}^* \in \mathbb {R}^m\) follows by standard subsequential limit arguments. If the columns of Y are not collinear, uniqueness of the solution \({{\varvec{\mu }}}^\star \) follows from strict convexity of \({{\varvec{\mu }}} \rightarrow M ({{\varvec{\mu }}})\).
Suppose next that the columns of Y are collinear so that \(Y = {{\varvec{a}}}{{\varvec{e}}}^\mathrm{T} + {{\varvec{b}}}{{\varvec{y}}}^\mathrm{T}\). Set \(\nu = \Vert {{\varvec{b}}}\Vert _{\ell _2}\) and \({\varvec{w}} = {{\varvec{b}}} + \nu {\hat{{\varvec{e}}}}_1\) where \(({\hat{{\varvec{e}}}}_1)_i = \delta _{i1}\). Then the Householder transformation
is orthogonal and satisfies
as well as \(\Vert U{{\varvec{x}}}\Vert _{\ell _2} = \Vert {{\varvec{x}}}\Vert _{\ell _2}\), \(\forall {{\varvec{x}}} \in \mathbb {R}^m\). Let an arbitrary \({{\varvec{\mu }}} \in \mathbb {R}^m\) be represented as
and note that
With (6.5)–(6.7), the merit function can be written as
Once the merit function has been reduced to this one-dimensional form, it is readily seen that \(M({{\varvec{\mu }}}) \ge M({{\varvec{a}}}+\gamma {\varvec{b)}}\) for \(\gamma = \text{ median }({{\varvec{y}}})\), and hence M is minimized at \({{\varvec{a}}}+\gamma {{\varvec{b}}}\). That the minimizer is not necessarily unique can be seen from the case that n is even and the components of \({{\varvec{y}}}\) are distinctly ascending, so M has the same value \(M({{\varvec{a}}}+\gamma {{\varvec{b}}})\) at all points \({{\varvec{a}}}+{{\varvec{b}}}[ty_{n/2}+(1-t)y_{n/2+1}]\), \(t\in [0,1]\). \(\square \)
Lemma 3
The first-order necessary optimality condition for a minimizer \({{\varvec{\mu }}}^\star \) of M over \(\mathbb {R}^m\) is that there is \(D^\star \in \mathbb {R}^{m\times n}\) satisfying,
Proof
The necessary optimality condition for a minimizer \({{\varvec{\mu }}}^\star \) is that \(0 \in \partial M({{\varvec{\mu }}}^\star )\). By the chain rule (see, e.g., [2], p. 233), the subdifferential of M is given by the sum of the respective subdifferentials,
Thus, there exist \({\varvec{d}}_j^\star \in \partial \Vert {{\varvec{\mu }}}^\star -Y{\hat{{\varvec{e}}}}_j\Vert _{\ell _2}\), \(j=1,\ldots ,n\), satisfying
and
where B(0, 1) is the unit ball. Combining these facts, we have
The claim (6.9) follows with \(D^\star = \{{\varvec{d}}_1^\star ,\ldots ,{\varvec{d}}_n^\star \}\). \(\square \)
Turning to the iteration (3.6)–(3.7), we observe that if convergence to a fixed point \(\{D^\star ,{{\varvec{\mu }}}^\star \}\) can be guaranteed, then from (3.6) we have that
From (3.7) we have
Combining (6.14) and (6.15) gives
Moreover, if \(\Vert D_0{\hat{{\varvec{e}}}}_j\Vert _{\ell _2} \le 1, j=1,\ldots ,n,\) then the iterates \(\{D_l\}\) also satisfy this bound, and hence \(\Vert D^\star {\hat{{\varvec{e}}}}_j\Vert _{\ell _2} \le 1, j=1,\ldots ,n\). Thus \(\{D^\star ,{{\varvec{\mu }}}^\star \}\) must satisfy the necessary optimality condition (6.9). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed.
Theorem 1
Suppose that the columns of \(Y \in \mathbb {R}^{m\times n}\) are not collinear so that (6.1) does not hold. Let \(\{D^\star ,{{\varvec{\mu }}}^\star \}\) satisfy (6.9) with \({{\varvec{\mu }}}^\star \not \in \{Y{\hat{{\varvec{e}}}}_j\}_{j=1}^n\). Then \(\{D^\star ,{{\varvec{\mu }}}^\star \}\) is a fixed point for the iteration (3.6)–(3.7), and for \(\tau \) sufficiently large, iterates \(\{D_l,{{\varvec{\mu }}}_l\}\) converge to this fixed point when \(\{D_0,{{\varvec{\mu }}}_0\}\) starts the iteration close enough to \(\{D^\star ,{{\varvec{\mu }}}^\star \}\).
Proof
It will first be shown that \(\{D^\star ,{{\varvec{\mu }}}^\star \}\) is a fixed point for the iteration (3.6)–(3.7), which is locally asymptotically stable for \(\tau \) sufficiently large. Using (6.9) and substituting \(D_l = D^\star \) on the right side of (3.6),
gives \(D_{l+1} = D^\star \). Using this result together with (6.9) and setting \({{\varvec{\mu }}}_l = {{\varvec{\mu }}}^\star \) on the right side of (3.7),
gives \({{\varvec{\mu }}}_{l+1}={{\varvec{\mu }}}^\star \). Thus, \(\{D^\star ,{{\varvec{\mu }}}^\star \}\) is a fixed point of the iteration (3.6)–(3.7). To establish the stability of the fixed point, define
and
so that (3.6)–(3.7) is given by
The claimed stability will follow once it is shown that the Jacobian of this mapping evaluated at \(\{D^\star ,{{\varvec{\mu }}}^\star \} = \{{\varvec{d}}_1^\star ,\ldots ,{\varvec{d}}_n^\star ,{{\varvec{\mu }}}^\star \}\) has spectral radius less than 1 when \(\tau \) is sufficiently large. For (6.19),
and
For (6.20),
and
Thus, the Jacobian satisfies
The matrix \(B \in \mathbb {R}^{m\times m}\) is clearly symmetric positive semi-definite, and it will now be shown that its spectrum lies in [0, 1). Suppose there is an \({{\varvec{x}}} \in \mathbb {R}^m\) satisfying
Then there exists \({\varvec{\alpha }} = \{\alpha _i\}_{i=1}^n\) such that
violating the assumption that the columns of Y are not collinear. Thus, there can be no \({{\varvec{x}}}\) satisfying (6.27). This result implies strict inequality in the following estimate:
Thus, the spectral radius of B is less than 1. Let a rotation matrix P be chosen so that \(P^\mathrm{T}BP = \Lambda =\text{ diag }\{\lambda _i\}_{i=1}^m\) where \(\lambda _i \in [0,1)\). Then the matrix J in (6.26) satisfies
proving that the spectrum of J lies in [0, 1). Since the Jacobian in (6.26) is arbitrarily well approximated by J when \(\tau \) is sufficiently large, the spectral radius of the Jacobian must be less than 1 for \(\tau \) sufficiently large. Because of the assumption \({{\varvec{\mu }}}^\star \not \in \{Y{\hat{{\varvec{e}}}}_j\}_{j=1}^n\), the mapping (6.19)–(6.20) is smooth in a neighborhood of the fixed point \(\{D^\star ,{{\varvec{\mu }}}^\star \}\), and hence iterates \(\{D_l,{{\varvec{\mu }}}_l\}\) converge to the fixed point when started sufficiently close to it.
Remark 1
Computations demonstrate that the iteration (3.6) – (3.7) converges to a minimizer of M for all \(\tau >0\) and even when the condition \({{\varvec{\mu }}}^\star \not \in \{Y{\hat{{\varvec{e}}}}_j\}_{j=1}^n\) is violated. Furthermore, while the uniqueness of the minimizer is not guaranteed when the non-collinearity condition is violated, the iteration is found to converge to the median shown in Lemma 1 when \(\tau \) is sufficiently small.
7 Convergence of the Iterative Scheme for \(\ell _1\) PCA
The analysis of the scheme (4.10)–(4.11) begins with establishing the existence of a minimizer for \(H_k\) in (4.8). Recall the assumption in Sect. 4 that \(S_1 = {\mathcal R}(Y_\mathrm{c}) = \mathbb {R}^m\) so that \(S_k = {\mathcal R}(Y_k) = {\mathcal R}(V_{k-1})^\perp \), \(k=2,\ldots ,m\), as seen in (4.3). Also, define
Lemma 4
For \(H_k\) in (4.8), there exists a minimizer \({\hat{{{\varvec{v}}}}}^\star \) over \(S_k\) which satisfies \({\hat{{{\varvec{v}}}}}^\star \in \mathbb {S}_k\). Moreover, \(\sum _{j=1}^n \Vert Y_k{\hat{{\varvec{e}}}}_j\Vert _{\ell _2}\) gives an upper bound for the minimum.
Proof
Because of the properties of projections, it follows that \(\Vert ({\hat{{{\varvec{v}}}}}{\hat{{{\varvec{v}}}}}^\mathrm{T}-I)Y_k{\hat{{\varvec{e}}}}_j\Vert _{\ell _2} \le \Vert Y_k{\hat{{\varvec{e}}}}_j\Vert _{\ell _2}\) holds \(\forall {\hat{{{\varvec{v}}}}} \in \mathbb {S}_k\) and \(\forall j\). By (4.8), \(H_k({{\varvec{v}}})\le H_k({{\varvec{0}}})\) holds \(\forall {{\varvec{v}}} \in S_k\). Thus, \(H_k\) is not strictly minimized at \({{\varvec{v}}}={{\varvec{0}}}\). According to (4.8), \(H_k\) is constant along rays outside the origin. Therefore, the minimization can as well be carried out over \(\mathbb {S}_k\). The claim follows since \(\mathbb {S}_k\) is compact and \(H_k\) is continuous on \(\mathbb {S}_k\). \(\square \)
Lemma 5
The first-order necessary optimality condition for a minimizer \({\hat{{{\varvec{v}}}}}^\star \) of \(H_k\) over \(S_k\) satisfying \({\hat{{{\varvec{v}}}}}^\star \in \mathbb {S}_k\) as given by Lemma 4 is that there exists \(D^\star \in \mathbb {R}^{m\times n}\) satisfying
Proof
The necessary optimality condition for a minimizer \({{\varvec{v}}}^\star \) is
where \(\partial \) denotes the Clarke derivative. Thus, there exist \({{\varvec{b}}}_j^\star \in \partial _{{{\varvec{v}}}}\Vert ({{\varvec{v}}}{{\varvec{v}}}^\mathrm{T}/\Vert {{\varvec{v}}}\Vert _{\ell _2}^2-I) Y_k{\hat{{\varvec{e}}}}_j\Vert _{\ell _2}|_{{{\varvec{v}}}={\varvec{v}}^\star }\), \(j=1,\ldots ,n\), satisfying
By the chain rule (see, e.g., [9], Propositions 2.2.7, 2.3.2 and Theorem 2.3.9), the respective Clarke derivatives are given according to
where
and
with the unit ball B(0, 1). Let \({\varvec{c}}_j^\star \) be chosen so that
and
According to (7.8),
With (7.9), it follows for \(Y_k{\hat{{\varvec{e}}}}_j \ne {\hat{{{\varvec{v}}}}}^\star {\hat{{{\varvec{v}}}}}^\star {}^\mathrm{T}Y_k{\hat{{\varvec{e}}}}_j\),
and for \(Y_k{\hat{{\varvec{e}}}}_j = {\hat{{{\varvec{v}}}}}^\star {\hat{{{\varvec{v}}}}}^\star {}^\mathrm{T}Y_k{\hat{{\varvec{e}}}}_j\),
Define
Combining (7.4), (7.11), (7.12), and (7.13) gives
The claim (7.2) follows with \(D^\star = \{{\varvec{d}}_1^\star ,\ldots ,{\varvec{d}}_n^\star \}\). \(\square \)
Turning to the iteration (4.10)–(4.11), we observe that if convergence to a fixed point \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\) with \(\Vert {\hat{{{\varvec{v}}}}}^\star \Vert _{\ell _2}=1\) can be guaranteed, then from (4.10) we have that \({\hat{{{\varvec{v}}}}}^\star {}^\mathrm{T}D_j^\star {\hat{{\varvec{e}}}}_j = 0\) and
According to (4.11), the fixed point \({\hat{{{\varvec{v}}}}}^\star \) satisfies
where \({{\varvec{v}}}^\star \) is defined by the right side of (7.16). Applying \(({\hat{{{\varvec{v}}}}}^\star {\hat{{{\varvec{v}}}}}^\star {}^\mathrm{T}-I)\) to both sides of (7.16) gives
Combining (7.15) and (7.17) gives
Moreover, if \(\Vert D_0{\hat{{\varvec{e}}}}_j\Vert _{\ell _2} \le 1, j=1,\ldots ,n,\) then the iterates \(\{D_l\}\) also satisfy this bound, and hence \(\Vert D^\star {\hat{{\varvec{e}}}}_j\Vert _{\ell _2} \le 1, j=1,\ldots ,n\). Thus, \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\) must satisfy the necessary optimality condition (7.2). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed.
Theorem 2
Let \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\) satisfy (7.2) with \({\hat{{{\varvec{v}}}}}^\star \in \mathbb {S}_k\) and suppose
Then \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\) is a fixed point of the iteration (4.10)–(4.11), and for \(\tau \) sufficiently large and for \(\rho \) sufficiently small, the iterates \(\{D_l,{\hat{{{\varvec{v}}}}}_l\}\) converge to this fixed point when \(\{D_0,{\hat{{{\varvec{v}}}}}_0\}\) starts the iteration close enough to \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\).
Proof
It will first be shown that \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\) is a fixed point for the iteration (4.10)–(4.11), which is locally asymptotically stable for \(\tau \) sufficiently large and for \(\rho \) sufficiently small. Using (7.2) and substituting \(D_l = D^\star \) and \(\hat{\mathbf{v}}_l = {\hat{{{\varvec{v}}}}}^\star \) on the right side of (4.10),
gives \(D_{l+1} = D^\star \). Also by (7.2),
and hence
satisfies \({{\varvec{v}}}^\star = {\hat{{{\varvec{v}}}}}^\star \Vert {{\varvec{v}}}^\star \Vert _{\ell _2}\). Thus, setting \(D_l = D^\star \) and \(\hat{\mathbf{v}}_l = {\hat{{{\varvec{v}}}}}^\star \) on the right side of (4.11) gives \(\hat{\mathbf{v}}_{l+1} = {\hat{{{\varvec{v}}}}}^\star \). Therefore, \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\) is a fixed point of the iteration (4.10)–(4.11).
To establish the stability of the fixed point, define
and
so that (4.10)–(4.11) is given by
The claimed stability will follow once it is shown that the Jacobian of this mapping from the \((n+1)\)-fold Cartesian product \((S_k)^{n+1}\) into \((S_k)^{n+1}\) evaluated at \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \} = \{{\varvec{d}}_1^\star ,\ldots ,{\varvec{d}}_n^\star ,{\hat{{{\varvec{v}}}}}^\star \}\) has only eigenvalues with magnitude less than 1 when \(\tau \) is sufficiently large and \(\rho \) is sufficiently small. For (7.23),
and
For (7.24),
and with \(\hat{{{\varvec{g}}}}^\star = {{\varvec{g}}}({\varvec{d}}_1^\star ,\ldots , {\varvec{d}}_n^\star ,{\hat{{{\varvec{v}}}}}^\star ) = {\hat{{{\varvec{v}}}}}^\star \),
Also
and
Thus, the Jacobian satisfies
By the definition \(S_k = {\mathcal R}(Y_k)\) prior to (4.3), it follows from (7.27) that \(A_j:S_k\rightarrow S_k\), \(j=1,\ldots ,n\), and from (7.31) that \(B:S_k\rightarrow S_k\). In the same way, it follows from (7.32) that \(J:(S_k)^{n+1}\rightarrow (S_k)^{n+1}\). It will be shown that \(B:S_k\rightarrow S_k\) has spectrum in \((-1,1)\). Let B be expressed as \(B = (I-{\hat{{{\varvec{v}}}}}^\star {\hat{{{\varvec{v}}}}}^\star {}^\mathrm{T}) - \rho \hat{C}\) where
Suppose there were \({{\varvec{v}}} \in S_k\) for which \({{\varvec{v}}}^\mathrm{T}C{{\varvec{v}}} = 0\). Then by (4.2) and (4.3), \({{\varvec{v}}}^\mathrm{T}Y_k={{\varvec{v}}}^\mathrm{T}(I-V_{k-1}V_{k-1}^\mathrm{T})Y_\mathrm{c} = {{\varvec{v}}}^\mathrm{T}Y_\mathrm{c}\) and with (7.19),
and \({{\varvec{v}}}^\mathrm{T}Y_\mathrm{c}=0\) would contradict the assumption that \(S_1 = {\mathcal R}(Y_\mathrm{c}) = \mathbb {R}^m\). Thus, there exist \(0<\lambda _\mathrm{min}\le \lambda _\mathrm{max}\) such that C satisfies
Let \(S_k = U_k \oplus W_k\) where \(W_k = \text{ span }\{{\hat{{{\varvec{v}}}}}^\star \}\). Clearly, \(\lambda = 0\) is an eigenvalue of B associated with the eigenvector \({\hat{{{\varvec{v}}}}}^\star \in W_k\). Eigenvalues for eigenvectors in \(U_k\) will now be estimated. Let \(\rho \) be small enough that \(-1 < 1-\rho \lambda _\mathrm{max} < 1-\rho \lambda _\mathrm{min} < 1\). Then for \({{\varvec{x}}} \in U_k\) it follows with the definition of \(\hat{C}\) and C that \({{\varvec{x}}}^\mathrm{T}B{{\varvec{x}}} = \Vert {{\varvec{x}}}\Vert _{\ell _2}^2 - \rho {{\varvec{x}}}^\mathrm{T}\hat{C}{{\varvec{x}}} = \Vert {{\varvec{x}}}\Vert _{\ell _2}^2 - \rho {{\varvec{x}}}^\mathrm{T}C{{\varvec{x}}}\) and
Thus, the spectrum of \(B:S_k\rightarrow S_k\) lies in \((-1,1)\). Recalling that the dimension of \(S_k\) is \(m-k+1\), let \(P \in \mathbb {R}^{m\times (m-k+1)}\) be chosen with orthonormal columns such that \(P^\mathrm{T}B P= \Lambda = \text{ diag }\{\lambda _i\}_{i=1}^{m-k+1}\) where \(\lambda _i \in (-1,1)\). Also set \(Z \in \mathbb {R}^{m\times (m-k+1)}\) with all zero entries. Further let \(I\in \mathbb {R}^{m\times m}\) and \(0\in \mathbb {R}^{m\times m}\) in (7.37) below denote the identity and the zero matrix, respectively. Then the matrix J satisfies
proving that the spectrum of J lies in \((-1,1)\). Since the Jacobian in (6.26) is arbitrarily well approximated by J when \(\tau \) is sufficiently large, its spectrum lies strictly within the ball of radius 1 for \(\tau \) sufficiently large and for \(\rho \) sufficiently small. Because of the assumption (7.19), the mapping (7.23)–(7.24) is smooth in a neighborhood of the fixed point \(\{D^\star ,{\hat{{{\varvec{v}}}}}^\star \}\), and hence the iterates \(\{D_l,{\hat{{{\varvec{v}}}}}_l\}\) converge to the fixed point when started sufficiently close to it. \(\square \)
Remark 2
Computations demonstrate that the iteration (4.10)–(4.11) can converge to a minimizer for \(H_k\) even when the condition (7.19) is violated. Such a case is illustrated in Fig. 4. Variations of the example illustrated in Fig. 4, in which data points lie on the boundary of a rectangle instead of a square, indicate an advantage to having sphered the data by \(\ell _2\) means according to (2.8) and (2.10) before proceeding with the methods of Sect. 4.
8 Convergence of the Iterative Scheme for \(\ell _1\) ICA
The analysis of the scheme (5.11) begins with establishing existence of a maximizer for \(G_l\) in (5.9). Recall the assumption in Sect. 5 that \(T_1 = {\mathcal R}(Y_\mathrm{s}) = \mathbb {R}^m\) so that \(T_l = {\mathcal R}(Y_l) = {\mathcal R}(U_{l-1}^\mathrm{T})^\perp \), \(l=2,\ldots ,m\), as seen in (5.7). Also, define
Lemma 6
For \(G_l\) in (5.9), there exists a maximizer \({\hat{{{\varvec{u}}}}}^\star \) over \(T_l\) satisfying \({\hat{{{\varvec{u}}}}}^\star \in \mathbb {T}_l\).
Proof
Since \({\hat{{{\varvec{u}}}}}^\mathrm{T}Y_l = {\hat{{{\varvec{u}}}}}^\mathrm{T} (I-U_{l-1}^\mathrm{T}U_{k-1})Y_\mathrm{s} = {\hat{{{\varvec{u}}}}}^\mathrm{T}Y_\mathrm{s}\) holds for each l, the assumption \(T_1 = {\mathcal R}(Y_\mathrm{s}) = \mathbb {R}^m\) implies that there is a \({\hat{{{\varvec{u}}}}} \in \mathbb {T}_l\) satisfying \(G_l({\hat{{{\varvec{u}}}}})\ne 0\). Thus, \(G_l\) is not maximized at \({{\varvec{u}}}={{\varvec{0}}}\). According to (5.9), \(G_l\) is constant along rays outside the origin. Thus, the maximization can as well be carried out on \(\mathbb {T}_l\). The claim follows since \(\mathbb {T}_l\) is compact and \(G_l\) is continuous on \(\mathbb {T}_l\). \(\square \)
Lemma 7
For any \({{\varvec{u}}} \in T_l \backslash \{0\}\), the directional derivative of \(G_l\) in the direction of \({\varvec{w}} \in T_l\) is given by
where
Proof
For \({{\varvec{u}}} \in T_l\backslash \{0\}\), \({\varvec{w}} \in T_l\), and \(t>0\) sufficiently small that \({{\varvec{u}}} +t{\varvec{w}} \ne 0\),
For \({\hat{{\varvec{e}}}}_j^\mathrm{T}Y_l^\mathrm{T}{{\varvec{u}}} \ne 0\), the terms of the first sum in (8.4) satisfy
and for \({\hat{{\varvec{e}}}}_j^\mathrm{T}Y_l^\mathrm{T}{{\varvec{u}}} = 0\),
The terms of the second sum in (8.4) satisfy
Combining these calculations gives (8.2). \(\square \)
Lemma 7 is now used to prove Lemma 8. For the following, let \(\sigma ({{\varvec{v}}}) = \{\sigma (v_i)\}\) where \({{\varvec{v}}} = \{v_i\}\) and \(\sigma (t) = \text{ sign }(t)\).
Lemma 8
The first-order necessary optimality condition for a maximizer \({\hat{{{\varvec{u}}}}}^\star \) of \(G_l\) over \(T_l\) satisfying \({\hat{{{\varvec{u}}}}}^\star \in \mathbb {T}_l\) as given by Lemma 6 is
and the sets \({\mathcal Y} = \{j: Y_l{\hat{{\varvec{e}}}}_j={{\varvec{0}}}\}\) and \({\mathcal S} = \{j: {{\varvec{u}}}^\star {}^\mathrm{T}Y_l {\hat{{\varvec{e}}}}_j=0\}\) agree. As a consequence, \(\exists \epsilon > 0\) such that
Proof
Let \({\hat{{{\varvec{u}}}}}^\star \in \mathbb {T}_l\) be a maximizer for \(G_l\) guaranteed by Lemma 6. By (8.2),
By decomposing sums into indices in \({\mathcal S}\) and \({\mathcal S}^\mathrm{c}\),
where
Combining (8.10) and (8.11), it follows
This implies the existence of \(\{\gamma _j\}_{j \in {\mathcal S}}\) (possibly zero) such that
Now with \({\varvec{w}} = {{\varvec{v}}}^\star \), combining (8.10), (8.11), and (8.14) gives
or \({{\varvec{v}}}^\star = {{\varvec{0}}}\). With (8.12), the optimality condition (8.8) follows.
According to (8.14), \({\mathcal S} \subset {\mathcal Y}\) holds. Since \({\mathcal Y} \subset {\mathcal S}\) always holds, it follows that \({\mathcal Y} = {\mathcal S}\). Note that
is smooth for \({{\varvec{u}}} \in B({\hat{{{\varvec{u}}}}}^\star ,\epsilon )\) for some \(\epsilon > 0\). As a result, \(G_l\) is smooth in \(B({\hat{{{\varvec{u}}}}}^\star ,\epsilon )\) and can be differentiated directly to obtain (8.9). \(\square \)
Turning to the iteration (5.11), we observe that if convergence to a fixed point \({\hat{{{\varvec{u}}}}}^\star \) can be guaranteed, then multiplying \({\hat{{{\varvec{u}}}}}^\star {}^\mathrm{T}\) by \(\Vert {{\varvec{u}}}^\star \Vert {\hat{{{\varvec{u}}}}}^\star = {{\varvec{u}}}^\star = {\hat{{{\varvec{u}}}}}^\star + \tau [Y_l\sigma (Y_l^\mathrm{T}{\hat{{{\varvec{u}}}}}^\star ) - \hat{{{\varvec{u}}}}^\star \Vert {\hat{{{\varvec{u}}}}}^\star {}^\mathrm{T}Y_l\Vert _{\ell _1}]\) gives
where the last equality follows with \((Y_l^\mathrm{T}{\hat{{{\varvec{u}}}}}^\star )^\mathrm{T}\sigma (Y_l^\mathrm{T}{\hat{{{\varvec{u}}}}}^\star ) = \Vert {\hat{{{\varvec{u}}}}}^\star {}^\mathrm{T}Y_l\Vert _{\ell _1}\). Since \(\Vert {{\varvec{u}}}^\star \Vert _{\ell _2} = \Vert {\hat{{{\varvec{u}}}}}^\star \Vert _{\ell _2} = 1\) holds, it follows that \({{\varvec{u}}}^\star = {\hat{{{\varvec{u}}}}}^\star \) and hence \({\hat{{{\varvec{u}}}}}^\star \) satisfies (8.8). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed.
Theorem 3
Let \({\hat{{{\varvec{u}}}}}^\star \in \mathbb {T}_l\) satisfy (8.8) where the set \({\mathcal S} = \{j: {\hat{{\varvec{e}}}}_j^\mathrm{T}Y_l^\mathrm{T}{\hat{{{\varvec{u}}}}}^\star = 0\}\) is empty. Then \({\hat{{{\varvec{u}}}}}^\star \) is a fixed point of the iteration (5.11), and for \(\tau \) sufficiently small, the iterates \(\{{\hat{{{\varvec{u}}}}}_k\}\) converge to this fixed point when \(\{{\hat{{{\varvec{u}}}}}_0\}\) starts the iteration close enough to \(\{{\hat{{{\varvec{u}}}}}^\star \}\).
Proof
It will first be shown that \(\{{\hat{{{\varvec{u}}}}}^\star \}\) is a fixed point for the iteration (5.11) which is locally asymptotically stable for \(\tau \) sufficiently small. Setting \(\hat{\mathbf{u}}_{k+1} = {\hat{{{\varvec{u}}}}}^\star \) on the right side of (5.11) and using (8.8) shows that \(\mathbf{u}_{k+1} = {\hat{{{\varvec{u}}}}}^\star \). Hence with \(\Vert \mathbf{u}_{k+1}\Vert _{\ell _2} = \Vert {\hat{{{\varvec{u}}}}}^\star \Vert _{\ell _2} = 1\) it follows that \(\hat{\mathbf{u}}_{k+1} = \mathbf{u}_{k+1} = {\hat{{{\varvec{u}}}}}^\star \), and thus \({\hat{{{\varvec{u}}}}}^\star \) is a fixed point. To establish the stability of the fixed point, define
so that (5.11) is given by
The claimed stability will follow once it is shown that the Jacobian of this mapping evaluated at \({\hat{{{\varvec{u}}}}}^\star \) has spectral radius less than 1 when \(\tau \) is sufficiently small. The Jacobian is given by
where with \(\{j: {\hat{{\varvec{e}}}}_j^\mathrm{T}Y_l^\mathrm{T}{{\varvec{u}}}=0\} = \emptyset \),
It follows with (8.8) that \({{\varvec{g}}}({\hat{{{\varvec{u}}}}}^\star ) = \hat{{{\varvec{u}}}}^\star \) and thus \(\Vert {{\varvec{g}}}({\hat{{{\varvec{u}}}}}^\star )\Vert _{\ell _2} = \Vert {\hat{{{\varvec{u}}}}}^\star \Vert _{\ell _2} = 1\). The Jacobian of \({{\varvec{G}}}\) at \({\hat{{{\varvec{u}}}}}^\star \) is symmetric and is given by
For \({{\varvec{v}}} \in \mathbb {R}^m\), it follows with \(0\le \Vert (I-{\hat{{{\varvec{u}}}}}^\star \hat{{{\varvec{u}}}}^\star {}^\mathrm{T}){{\varvec{v}}}\Vert _{\ell _2}^2 = {{\varvec{v}}}^\mathrm{T}(I-{\hat{{{\varvec{u}}}}}^\star \hat{{{\varvec{u}}}}^\star {}^\mathrm{T}){{\varvec{v}}} \le \Vert {{\varvec{v}}}\Vert _{\ell _2}^2\) that the Jacobian satisfies
provided that \(1>\tau \Vert {\hat{{{\varvec{u}}}}}^\star {}^\mathrm{T}Y_l\Vert _{\ell _1}\). For any such \(\tau \), the spectral radius of the Jacobian in (8.22) is less than 1. Because of the assumption that \({\mathcal S}\) is empty, the mapping (8.18) is smooth in a neighborhood of the fixed point \(\{{\hat{{{\varvec{u}}}}}^\star \}\), and hence the iterates \(\{{\hat{{{\varvec{u}}}}}_k\}\) converge to the fixed point when started sufficiently close to it. \(\square \)
Remark 3
Computations demonstrate that the iteration (5.11) converges to a maximizer for \(G_l\) even when the condition \({\mathcal S} = \{j: {\hat{{\varvec{e}}}}_j^\mathrm{T}Y_l^\mathrm{T}{\hat{{{\varvec{u}}}}}^\star = 0\} = \emptyset \) is violated.
9 Application to DCE-MRI Sequences
In this section, the proposed methods are applied to the dynamic contrast-enhanced magnetic resonance image (DCE-MRI) sequence http://math.uni-graz.at/keeling/manuskripten/dcemri.mpg to separate intensity changes due to physiological motion from those due to contrast agent. With such a separation, unavoidable physiological motion may be removed in order to investigate tissues in a stationary state. See also [20] and [22]. To focus on the period in which contrast agent arrives in the imaged tissues, only the first 40 of 134 frames are used for the following decompositions. Each frame consists of an image with \(400 \times 400\) pixels. Thus, in the notation of Sect. 2, the data are \(Y \in \mathbb {R}^{m\times n}\) with \(m=400^2 \gg 40 = n\). For a static display of the DCE-MRI sequence, representative stages are shown in Fig. 7: exhale and inhale, with and without contrast agent. Specifically, with \({\bar{Y}},V,\Lambda \) given by (2.7) and (2.10), the images of Fig. 7 are given by
and the images \({\bar{Y}}\) and
are shown in Fig. 8. Brightness changes in relation to the background are seen throughout organs in Fig. 8b, and this suggests that \({\hat{{{\varvec{v}}}}}_1\) represents intensity changes in the DCE-MRI sequence due to contrast agent. On the other hand, brightness changes are seen mainly on the edges of organs in Fig. 8c, and this suggests that \({\hat{{{\varvec{v}}}}}_2\) represents intensity changes in the DCE-MRI sequence due to physiological motion. The image sequence is shown more dynamically in Fig. 9. The graphs in the left column are the time courses \({\hat{{{\varvec{v}}}}}_i^\mathrm{T}Y\), \({\hat{{{\varvec{v}}}}}_i^\mathrm{T}Y_\mathrm{s}\), \({\hat{{{\varvec{v}}}}}_i^\mathrm{T}X_\mathrm{c}\), \(i=1,2\), for the raw, sphered and independent data, respectively, where \(Y_\mathrm{s}\) and \(X_\mathrm{c}\) are given by (2.8) and (2.14), respectively. The graphs in the right column are corresponding plots in a phase plane. To determine the most significant independent components, all but the top two principal components were discarded. Then V was replaced by its first two columns, \(Y_\mathrm{s}\) by its first two rows and \(\Lambda \) by a diagonal matrix containing the largest two eigenvalues. Also, the independent images
are shown in Fig. 9g, h. As explained in connection with Fig. 8, these can be associated respectively with intensity changes in the DCE-MRI sequence due to contrast agent and to physiological motion. Note that there are only small differences between Figs. 9c and e, between Figs. 9d and f, between Figs. 8b and 9g and between Figs. 8c and 9h. Thus, the separation of intensity changes due to physiological motion from those due to contrast agent is achieved here already with the sphered data. Hence the transformation to independent data had little effect for this particular example. Recall from Sect. 2 that the order and the sign of ICA components are not uniquely determined.
This separation will now be considered in the presence of outliers. As seen in the full DCE-MRI sequence, an excessively bright frame may appear suddenly. To simulate this effect, intensities of the final frame of the sequence are increased by a constant factor. Then the same methods used for Fig. 9 are applied to the corrupted data, and the results are shown in Fig. 10 with the same format as that used in Fig. 9. The corrupted data may be seen at the final time shown in the graphs of the first column of Fig. 10. Also the outlier is conspicuous in the phase plane graphs in the right column of Fig. 10. Finally, the images defined by (9.3) with the corrupted data are shown in Figs. 10g and h. Since these clearly differ from their counterparts in Figs. 9g and h, the presence of the single outlier has corrupted the separation of intensity changes due to physiological motion from those due to contrast agent. We would like to report that we obtained results comparable to those seen in Fig. 10 by applying the FastICA code [13] using optional robust merit functions.
For comparison, the \(\ell _1\)-based methods of Sects. 3–5 are now applied to the corrupted data, and the results are shown in Fig. 11 with the same format as that used in Figs. 9 and 10. Now the matrices \(\overline{V}\), \(Y_\mathrm{c}\), V, \(\Lambda \), \(Y_\mathrm{s}\), and U are understood as explained in Sects. 3–5 as well as in Remark 2. As in the previous cases, all but the top two principal components are discarded and the respective matrices are reduced correspondingly to have only two rows or columns or both. Then (9.2) and (9.3) apply with these \(\ell _1\)-based matrices. As before, the corrupted data may be seen at the final time shown in the graphs of the first column of Fig. 11. Also the outlier is conspicuous in the phase plane graphs in the right column of Fig. 11. Finally, the images defined by (9.3) with the corrupted data are shown in Figs. 11g and h. Note the similarities between Figs. 11e–h and their counterparts in Figs. 9e–h. On this basis, the \(\ell _1\) methods can be seen to have successfully separated intensity changes due to physiological motion from those due to contrast agent in spite of the outlier. With this separation, the data are projected onto the single independent component of Fig. 11g using (2.21) to produce the following transformed DCE-MRI sequence manifesting contrast changes free of physiological motion,
http://math.uni-graz.at/keeling/manuskripten/dcemri_ica.mpg.
10 Conclusion
In this work, robust measures have been introduced for centering complex data in the presence of outliers and for determining principal and independent components of such data. The approach to centering is to use the geometric median. The approach for determining principal components is to find best fit lines through the data, where each line minimizes the sum of distances (not squared) to data points in the subspace orthogonal to other components. The approach for determining independent components is first to sphere the data so that the corresponding axes are aligned with clusters, and then to determine independent axes as those which separate sphered clusters as much as possible. This separation is accomplished by maximizing an \(\ell _1\) counterpart to Rayleigh quotients. To optimize the respective merit functions, iterative methods were proposed and their local convergence was proved. Illustrative examples were presented to demonstrate the benefits of the robust approaches. Finally, the proposed methods were applied to a DCE-MRI sequence to separate intensity changes due to physiological motion from those due to contrast agent, and benefits of the robust methods have been demonstrated with respect to this realistic example.
References
Ash, R.B.: Real Analysis and Probability. Academic Press, New York (1972)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2010)
Bose, P., Maheshwari, A., Morin, P.: Fast approximations for sums of distances, clustering and the Fermat Weber problem. Computational Geometry: Theory and Applications 24(3), 135–146 (2003)
Boyle, J.P., Dykstra, R.L.: A Method for Finding Projections onto the Intersection of Convex Sets in Hilbert Spaces. Lecture Notes in Statistics 37, 28–47 (1986)
Brooks, J.P., Dulá, J.H., Boone, E.L.: A pure \(L_1\)-norm principal comoponent analysis. Computational Statistics and Data Analysis 61, 83–98 (2013)
Cai, T.T., Ma, Z., Wu, Y.: Sparse PCA: Optimal rates and adaptive estimation. Annals of Statistics 41(6), 3074–3110 (2013)
Candés, E.J., Li, X., Ma, Y. and Wright, J.: Robust Principal Component Analysis?, JACM, 58(3), Art. 11, May (2011)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. JMIV 40(1), 120–145 (2011)
Clarke, F.H.: Optimization and Nonsmooth Analysis, Wiley Interscience, (1983)
Ding, C. and Zhou, D. and He, X. and Zha, H.: \(R_1\)-pca: Rotation invariant \(L_1\)-norm principal component analysis for robust subspace factorization, Proceedings of the 23rd International Coference on Machine Learning, pp. 281 – 288, (2006)
Dodge, Y., Rousson, V.: Multivariate \(L_1\) Mean. Metrika 49(2), 127–134 (1999)
Ekeland, I., Témam, R.: Convex Analysis and Variational Problems. North-Holland, Amsterdam-Oxford (1976)
Gävert, H., Hurri, J., Särelä, J. and Hyvärinen, A.: http://research.ics.aalto.fi/ica/fastica/, Copyright (C) 1996–2005
Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001)
Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986)
Kwak, N.: Principal component analysis based on \(L1\)-norm maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 1672–1680 (2008)
Lerman, G., McCoy, M., Tropp, J.A., Zhang, T.: Robust Computation of Linear Models by Convex Relaxation. Foundations of Computational Mathematics 15(2), 363–410 (2015)
McCoy, M., Tropp, J.A.: Two Proposals for Robust PCA via Semidefinite Programming. Journal of Statistics 5, 1123–1160 (2011)
Weiszfeld, E., Plastria, F.: On the point for which the sum of the distances to \(n\) given points is minimum. Ann. Oper. Res. 167, 7–41 (2009)
Reishofer, G., Fazekas, F., Keeling, S.L., Enzinger, C., Payer, F., Simbrunner, J., Stollberger, R.: Minimizing Macrovessel Signal in Cerebral Perfusion Imaging using Independent Component Analysis. Magnetic Resonance in Medicine 57(2), 278–288 (2007)
Weiszfeld, E.: Sur le point pour lequel la somme des distances de \(n\) points donnés est minimum. Tôhoku Mathematical Journal 43, 355–386 (1937)
Wollny, G., Kellman, P., Santos, A., Ledesma-Carbayo, M.J.: Automatic motion compensation of free breathing acquired myocardial perfusion data by using independent component analysis. Medical Image Analysis 16, 1015–1028 (2012)
Xu, H., Caramanis, C., Sanghavi, S.: Robust PCA via Outliear Pursuit. IEEE Transactions on Information Theory 58(5), 3047–3064 (2012)
Zarzoso, V.: Robust Independent Component Analysis by Iterative Maximization of the Kurtosis Contrast with Algebraic Optimal Step Size. IEEE Trans. on Neural Networks 21(2), 248–261 (2010)
Acknowledgments
Open access funding provided by University of Graz. The authors wish to express their appreciation for many fruitful discussions with Gernot Reishofer of the Radiology Clinic of the Graz Medical University, who also provided data for this work. Also, the authors gratefully acknowledge the support from the Austrian Science Fund Fonds zur Förderung der Wissenschaftlichen Forschung (FWF) under Grant SFB F032. The authors are supported by the Austrian Science Fund Fonds zur Förderung der Wissenschaftlichen Forschung (FWF) under Grant SFB F032 and are linked on the SFB webpage http://math.uni-graz.at/mobis.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Keeling, S.L., Kunisch, K. Robust \(\ell _1\) Approaches to Computing the Geometric Median and Principal and Independent Components. J Math Imaging Vis 56, 99–124 (2016). https://doi.org/10.1007/s10851-016-0637-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10851-016-0637-9