Robust \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _1$$\end{document}ℓ1 Approaches to Computing the Geometric

Robust measures are introduced for methods to determine statistically uncorrelated or also statistically independent components spanning data measured in a way that does not permit direct separation of these underlying components. Because of the nonlinear nature of the proposed methods, iterative methods are presented for the optimization of merit functions, and local convergence of these methods is proved. Illustrative examples are presented to demonstrate the benefits of the robust approaches, including an application to the processing of dynamic medical imaging.


Introduction
The topics of this work focus on the low-dimensional representation of complex measured data. The lowest dimensional representation is a type of average. More accurate representations add dimensions beyond the average based upon subspaces in which the data vary the most. Choosing a basis for such subspaces is driven by the priority that data coordinates with respect to this basis be statistically uncorrelated or even statistically independent. The particular interest here is to present methods for performing these tasks which are robust against outliers in the measured data. The most common type of average is the mean, which may be formulated variationally as the point minimizing the sum of squared distances to data points. As discussed in [14], a more robust method involves minimizing a merit function which does not grow as rapidly with respect to the data and would therefore apply less weight to erroneous data points far from a natural average. Various notions of an average based upon 1 measures are discussed in [11]. Based upon examples presented in Sect. 3, the type of average selected for this work is the geometric median, which may be formulated variationally as the point minimizing the sum of distances (not squared) to data points. The problem of determining the geometric median has a long history. In the 1937 paper by Weiszfeld [21], three proofs concerning the uniqueness of the geometric median are given, and one of these supplies an algorithm for its computation. We also refer to the recent annotated translation of that paper [19]. See also [3]. A shorter proof of uniqueness is given in Sect. 6 which is based upon a strict convexity argument. Moreover, a possibly novel characterization of a solution is provided in case data points are collinear. A regularized Weiszfeld iteration for computing the geometric median is proposed in Sect. 3, and the local convergence of this scheme is proved in Sect. 6. Alternative approaches, based upon the Chambolle-Pock algorithm [8] and projected dual ascent [4], are compared to our approach. Given a natural average or center of the measured data, one may then wish to determine the direction in which data points vary the most from the center. This direction is the most significant principal component of the data. Principal components of lesser significance are sought similarly but within the orthogonal complement of the more significant ones. Determining and analyzing such components is the subject of principal component analysis (PCA) [15]. The most common way of determining these components is to select them as the eigenvectors of the covariance matrix of the data. The more significant components correspond to the larger eigenvalues of the covariance matrix since each eigenvalue gives the variance of the data projected onto the corresponding eigenvector. As discussed in Sect. 4, determining each eigenvector can be formulated variationally in terms of finding a best fit line through the data, where the line minimizes the sum of squared distances to data points. See [5] and [10] for 1 -based alternatives to this criterion.
Based upon examples presented in Sect. 4, this line is determined here more robustly by minimizing the sum of distances (not squared) to data points. In other words, analogous to defining an average as a geometric median point, a principal component is defined here as a geometric median line. In another context, the terms robust PCA have been associated with the separation of sparse and low-rank components in the given data. This separation was first proposed in [7], where a low-rank component is measured with the spectral norm and a sparse component with the 1 norm. See also the related approaches of [23] and [18], where a low-rank component is separated from a column sparse component using similar norms. As in the works of [10] and [17], we do not assume here that the given data may be decomposed into sparse and low-rank components. Our task is instead to decompose (maximal rank) data into components which may all have significant energy in the data. The sum of distances (not squared) between data points and best lines is considered in [10] to obtain robust principal components all at once, and a convex relaxation of this approach is analyzed in [17]. In our approach, geometric median lines are obtained sequentially in decreasing orthogonal subspaces. In Sect. 4, an iterative scheme is proposed for computing these lines, and the scheme is based upon that used for computing the geometric median point. However, since the merit function is not convex, uniqueness of minimizers cannot be expected. Nevertheless, local convergence of the scheme to a minimizer is proved in Sect. 7. For approaches to PCA involving maximization of an 1 norm we refer to [16] and [18].
Suppose that the data are rotated to an axis system aligned with principal components and that they are then scaled along each new axis to normalize the respective variances to unity. When this rotation and scaling is carried out by standard methods using 2 measures, the transformed data have a covariance matrix equal to the identity. Then the data are said to have been sphered. In particular, the new data coordinates are statistically uncorrelated. However, they are not necessarily statistically independent [14]. (See, e.g., the example of Fig. 5 with m = 1 in (5.1) so that the data are sphered but the coordinates do not satisfy the independence criterion (5.4).) It might then be postulated that the data can be represented in a rotated axis system with respect to which coordinates are statistically independent. Determining and analyzing such a system is the subject of independent compo-nent analysis (ICA). In case the postulate holds, coordinates of the sphered data represent weighted sums of statistically independent variables, and by the Central Limit Theorem [1] histograms of such coordinates tend to be bell shaped. In order to identify the postulated rotation, it is standard to minimize the Gaussianity of histograms of coordinates in the desired rotated system. The approach proposed by [14] is to determine this rotation by maximizing a merit function which is known to be minimized by data with a Gaussian distribution. It is also argued in [14] that one such merit function is more robust to data outliers than another when it does not grow as rapidly with respect to the data. Such candidate merit functions are considered in Sect. 5. The optimization method of [24] is robust against local extrema. Here we propose an approach for determining the desired rotation by first targeting independence directly instead of using the indirect measure of Gaussianity. The merit function proposed in Sect. 5 is motivated by the observation that while sphered axes tend to be aligned with data clusters, independent axes tend to separate clusters. See the examples presented in Sect. 5 for details. A fixed point iteration scheme based on the optimality condition is proposed in Sect. 5 for computing robust independent components, and the local convergence of this scheme is proved in Sect. 8.
The paper is outlined as follows. In Sect. 2, standard 2 approaches to PCA and ICA are summarized, particularly to establish the background used later for the presentation of more robust methods. In Sect. 3, a robust method of data centering is proposed using the geometric median. In Sect. 4, a robust method for determining principal components is proposed using lines which are best fit in the sense that the sum of distances (not squared) to the data points is minimized within the subspace orthogonal to other components. In Sect. 5, a robust method for determining independent components is proposed which maximizes separations among sphered data clusters. Due to the nonlinearity of the respective optimality conditions, iterative schemes are proposed in Sects. 3-5 to solve the respective optimization problems. Local convergence of these schemes is proved in Sects. 6-8. In Sect. 9, the proposed methods are applied to a magnetic resonance image sequence to separate intensity changes due to physiological motion from those due to contrast agent, and benefits of the robust methods are demonstrated with respect to this realistic example. See also [20] and [22]. The paper ends with a summary in Sect. 10. produced independently at a cocktail party. The sources are assumed to satisfy the following: 1. For 1 ≤ i = j ≤ m, z i and z j are statistically independent.
Here, E denotes the expectation. Since the sources are statistically independent, they are uncorrelated [14]. Let their positive definite diagonal covariance matrix be denoted by which is unknown. Let a random vector y ∈ R m be defined through a measurement process modeled in terms of the mixing matrix A ∈ R m×m . The components {y i } m i=1 of y will be called measurements. For example, the measurements could be random variables associated with sounds recorded by separate microphones at the cocktail party mentioned above. In this case, it may be assumed naturally that each microphone records a weighted sum of sources, where a weight is stronger when the source is nearer, and the set of vectors of such weights for the respective microphones is linearly independent. Under the assumption that the mixing matrix is invertible, the goal is to determine a matrix W ∈ R m×m such that the components estimate the sources in the following sense. First, normalizing z = A −1 y according to C(z) − 1 2 z removes the ambiguity of unknown variances by setting the covariance matrix to the identity. Secondly, since the order and sign of components in C(z) − 1 2 z are unknown, the alternative PC(z) − 1 2 z also satisfies the source assumptions when P ∈ R m×m is any matrix Thus, W estimates a product PC(z) − 1 2 A −1 , and the covariance matrix of x in (2.2) is the identity.
Suppose that each random measurement variable y i is sampled directly to obtain n samples {y i j } n j=1 . Implicitly underlying these are samples {z i j } n j=1 of each random source variable z i . Define the sample vectors y i = {y i j } n j=1 , According to the linear model in (2.1), the matrices Y = {y 1 , . . . , y m } T ∈ R m×n and Z = {z 1 , . . . , z m } T ∈ R m×n are related by By (2.2), the estimation X = {x 1 , . . . , x m } T ∈ R m×n of the sources satisfies The matrix W is determined stepwise in terms of its singular value decomposition where U, V ∈ R m×m are orthogonal and ∈ R m×m is positive definite and diagonal. Specifically, after the data are centered the product V T Y c should rotate the data so that the new coordinate axes are aligned with the visually natural axes of the cluster of data points {Yê j } n j=1 ,ê j ∈ R n , (ê j ) i = δ i, j . After this rotation, the product should scale the data so that the variance along each new coordinate axis is unity. For this reason, the data Y s are said to be sphered. The final orthogonal matrix U in (2.5) is chosen so that the components of the random variable x in (2.2) are maximally independent in a sense made precise below.
To determine the transformations V and , the covariance matrix of the sphered data is required to be the identity, which is accomplished by determining the matrices V and from the eigenspace decomposition of the centered data, The columns of V are the so-called principal components of the data Y . Analyzing this decomposition is the subject of principal component analysis (PCA). For instance, the sampled data may be filtered by projecting these data onto subspaces spanned by principal components. For this, assume that the entries of = diag{λ i } m i=1 and V = {v 1 , . . . , v m } are ordered according to λ 1 ≥ λ 2 ≥ · · · ≥ λ m . This means that the variance To select only the r < m components with respect to which the data have the most variation, define the projected data Y P ≈ Y by (2.11) where the projector P ∈ R r ×m is defined with entries P i, j = δ i, j . Note that with (2.6), (2.8), and (2.10), this result can be rewritten as Y P =Ȳ + 1 n Y c (PY s ) T (PY s ). Next, the transformation U in (2.1) is determined so that the components of the random variable x in (2.2) are independent. While the rows of the sphered data Y s are statistically uncorrelated, they are not necessarily statistically independent [14]. A criterion is now sought for a final rotation of axes which gives the desired independence. Since as seen in (2.1) measurements are sums of independent random variables, the Central Limit Theorem suggests why the measurements tend to be normally distributed [1]. The matrix U is often chosen to reverse this effect, i.e., to make the components of x depart from being normally distributed as much as possible. Here, the significance of the assumption that no component z i be normally distributed can be seen, as otherwise the proposed measure of independence would not bring a separation of sources in the following. For the required statistical constructions, let E[x] denote the expectation of a random variable x. Since a normally distributed random variable n with mean 0 and variance σ 2 has moments (2.12) it follows that the Kurtosis K = K 4 , Hence, a parameter-dependent random variable may be made to depart maximally from being normally distributed by maximizing the square of its Kurtosis with respect to parameters. Applying this criterion to the rows of whose columns lie in T l ⊂ R m defined as the range of Y l . Note that Given T l , let the lth column of U T be determined inductively by (2.18) By (2.9) and (2.17), the second moment of u T Y l / u 2 is In this way, the rows of U are determined sequentially so that the earlier components of x depart from being normally distributed more than later components. Alternatively, all components of x may be estimated with roughly equal quality by determining all rows of U simultaneously through maximizing m l=1 F(u T l Y s ) under the conditions u T i u j = δ i, j , 1 ≤ i, j ≤ m. Once matrices U , , and V are determined, the source samples are estimated according to (2.4) and (2.5).
The columns of V 1 2 U T are the so-called independent components of the data Y . Analyzing this decomposition is the subject of independent component analysis (ICA). For instance, the sampled data may be filtered by projecting these data onto subspaces spanned by independent components. Specifically, to select the r < m desired independent components where the projector Q ∈ R r ×m is defined with entries Q i j = δ q i , j . Note that with (2.6), (2.8), (2.10) and (2.14), this result can be rewritten as Y Q =Ȳ + 1 n Y c (Q X c ) T (Q X c ). In the calculations above, it is implicitly assumed that the number of samples n is at least as large as the number of sources m. Otherwise, the rank n of the covariance matrix 1 n Y s Y T s would be less than its dimension m, and the diagonal matrix in (2.10) would not be positive definite. In case n < m does in fact hold, because so few samples have been collected, one might be inclined simply to replace Y with Y T and therefore reverse the roles of time and space in the data. However, the data must possess an ergodicity property for the results with transposed data to be roughly equivalent to those without transposed data. Since such a property may not generally hold, the matrices above are determined here as follows; see also [6]. WithȲ and Y c given by (2.6), define the singular value decomposition Y c / √ n = Vˆ Ŷ s / √ n in terms of rotation matrices V ∈ R m×m andŶ s / √ n ∈ R n×n and a rectangular matrix ∈ R m×n for whichˆ =ˆ Tˆ ∈ R n×n is diagonal and positive definite. The matricesˆ andŶ s are determined from the eigenspace decomposition Y T c Y c =Ŷ T sˆ Ŷ s ,Ŷ sŶ T s = n I . Since the last m −n rows ofˆ are zero, the last m −n columns The sphered dataŶ s are transformed by the rotation matrixÛ ∈ R n×n maximizing independence of the rows ofX c =ÛŶ s ∈ R n×n . Note that with ∈ R m×m can be defined by supplementingÛ with the rotation matrix U ∈ R (m−n)×(m−n) and otherwise padding with zeros, and X c = U Y s ∈ R m×n can be defined to give (2.14). However, , the last m − n rows of X c are zero, contrary to the objective that the rows be independent. Thus,X c marks the end of the calculation andX =Ûˆ − 1 2V T Y gives the maximum number of independent components which can be determined from the undersampled data. Finally, for projectors P, Q ∈ R n×n , (2.11) and (2.21)

1 Approach to Centering
That 1 measures lead to statistically robust results may be highlighted by the following simple example. Suppose samples Y = {y j } n j=1 = {0, 1, . . . , 1} ∈ R n , n > 2, have been collected, where the first measurement is clearly an outlier. The 2 mean of these data is given by which is clearly influenced by the outlier. On the other hand, the 1 mean is given by which is insensitive to the outlier. Here, the median is defined as the middle entry in the sorted list of values, if n is odd, or else the average of two middle values, if n is even. For a generalization of this robust scalar mean to its counterpart for vectors, let the data then the solution would be unnaturally determined componentwise through decoupled minimizations. By contrast, the following 1 mean for vectors [11], i.e., the geometric median [19], minimizes, in a natural way, the 1 norm of Euclidean distances between the data points and the selected mean. The robustness of this measure in relation to the mean or median can be highlighted by the following simple example, which is illustrated in Fig. 1a. Here the data are given by Fig. 1a, and the point (0, 1) may be regarded as an outlier from points otherwise lying on the line between (0, 0) and (1, 1). The componentwise mean of the data gives (0.375, 0.625) marked with × in Fig. 1a. Then the componentwise median gives (0.25, 0.75) as marked with + in Fig. 1a. Finally, the measure defined in (3.4) gives the geometric medianȲ = (0.4996, 0.5004) marked with in Fig. 1a, where the smooth landscape for the merit function M is shown in Fig. 1b. Because of the natural result obtained by the geometric median in Fig. 1a, (3.4) will be used here for the 1 vector mean. Fig. 1 The mean, median, and geometric median are compared in a, where the data of (3.5) are shown with ·, the mean with ×, the median with +, and the geometric median with . Shown in b is the landscape of the merit function M in (3.4) To compute the 1 mean of (3.4), the following iteration is used. For τ > 0, compute iteratively D l ∈ R m×n and μ l ∈ R m by (3.7) The motivation for this iteration is based upon the optimality conditions (6.9) derived later in Sect. 6. A continuum level steepest descent approach to solving (6.9) is given formally by where (3.9) implicitly defines μ(t) so that the sum over j in (3.8) and hence n j=1 D(t)ê j remains 0. A semi-implicit time-stepping approach to solving (3.8)-(3.9) is given by where (3.11) defines μ l+1 to correct the departure of n j=1 D l+1ê j in (3.10) from 0, as required by the last equation in (6.9). After rearranging terms, the scheme (3.6)-(3.7) is seen to be equivalent to (3.10)- (3.11). This derivation of the scheme (3.6)-(3.7) offers a heuristic explanation for the advantage of choosing τ large so that the desired steady state of (3.8)-(3.9) is achieved rapidly. Further details of this iteration and its convergence analysis are given in Sect. 6. The 1 -mean is given by taking the limit, (3.12) After these calculations have been completed, the centered data are given by Y c = Y −Ȳ , the counterpart to (2.6) with (3.4) replacing (2.7). Note that the iteration (3.6)-(3.7) agrees with the Weiszfeld Algorithm for τ → ∞ [21]. Since this limit may lead to undefined terms, the proposed scheme may be regarded as a regularized Weiszfeld iteration. This scheme is compared in Figs. 2 and 3 with two alternatives which are described next.
The first alternative approach is given by the Chambolle-Pock Algorithm [8], which can be written for μ l , In (3.13), P B is a projection onto the set B = {v ∈ R m : ν 2 ≤ 1} and the theoretical requirements are that θ = 1 and 0 < ς, τ < 1 hold. Note that, as (3.5)-(3.6) can be seen as a semi-implicit time-stepping scheme for solving (3.8)-(3.9), (3.13) can be seen as an alternative explicit timestepping scheme. The second alternative approach is given by solving the dual problem sup D∈A∩C The solution D * is computed by a dual ascent iteration where τ > 0 is a stepsize and P A∩C is a projection onto A∩C, computed according to [4]. This ascent method can also be viewed as an explicit time-stepping scheme for solving the dual problem. After D * is obtained, the fixed point iteration is used to compute a solution μ * to the optimality condition in (6.9).
The convergence of the three schemes, regularized Weiszfeld (3.6)-(3.7), Chambolle-Pock (3.13), and dual ascent (3.14), is compared in Fig. 2. All parameters have been chosen manually to minimize the number of iterations required for convergence. For these comparisons, the data Y ∈ R m×n , m = 2, n = 1000, are chosen so that the points {Yê j } n j=1 are subgaussian (uniformly, Y = rand(m,n)) distributed in Fig. 2a and supergaussian (Laplacian, Y = log(rand(m,n)./rand(m,n))) distributed in Fig. 2b. Clearly, the regularized Weiszfeld iteration converges much more rapidly. Since additional work is required to determine μ * from the result of the dual ascent method, only the regularized Weiszfeld and the Chambolle-Pock iterations are compared in Fig. 3 for the realistic computation ofȲ for the example of Sect. 9. Again, the regularized Weiszfeld iteration converges much more rapidly. This convergence performance combined with the popularity of the Weiszfeld Algorithm has motivated the focus on the scheme (3.6)-(3.7) and its analysis for this work.

1 Approach to PCA
To present our approach, we start with some preliminaries. Unless otherwise specified, it is assumed that m ≤ n. With V and in (2.10) given in terms of components as and the projected data where R denotes the range. For convenience, it is assumed here that the data Y c have maximal rank so that Before presenting the proposed robust measure for determining visually natural data axes, a motivation is given by reformulating the 2 eigenspace decomposition in (2.10) in terms of a least squares fit of an axis system to the cloud of data points. Given S k , let the kth column of V and the kth diagonal entry of be determined inductively by the regression The 1 (solid) and 2 (dotted) data axes are compared in a, where the data of (4.9) are shown with ·. Shown in b is the landscape of the merit function Since minimizing We now aim for an appropriate 1 -variant of (4.4)-(4.5). The proposed approach to determine the orthogonal matrix V and the diagonal matrix in (2.5) is to replace the sum of squared norms in (4.5) with a sum of norms in (4.8) below. The approach is reminiscent of (3.4) in the sense that while a geometric median point is selected by (3.4), a geometric median line is determined by (4.7). As for (4.5), let V k be given by (4.1) and Y k by (4.2). Then given S k according to (4.3), let v k and λ k be determined inductively byv where The robustness of the 1 measure in H k in relation to the 2 measure inH k is highlighted by the following simple example, which is illustrated in Fig. 4. Here the data are given by marked with · in Fig. 4a. The points (0, 1 2 ), (1, 1 2 ) may be regarded as outliers from points otherwise lying on the line between (0, 0) and (1, 1). The 2 data axes are given by (4.4) and are shown in Fig. 4a as dotted line segments. The 1 data axes are given by (4.7) and are shown in Fig. 4a as solid line segments. The landscape for the merit function Fig. 4b, and the minimum is marked with ; see Remark 2 concerning the regularity of the merit function.
The data axes defined by (4.7) are computed by the following scheme. For τ > 0 and ρ > 0, compute iteratively D lê j ∈ S k , j = 1, . . . , n, andv l ∈ S k with D 0ê j 2 ≤ 1, j = 1, . . . , n, and v 0 2 = 1, , j = 1, . . . , n (4.10) Fig. 5 The 1 (solid) and K 2 , K 4 , and K e (dotted, i.e., identical) data axes were obtained by maximizing the measures in (5.5). The results are compared in a, where the data of (5.10) are shown with ·. Shown in b is the landscape of the merit function U (4.11) The motivation for this iteration and its convergence analysis are given in Sect. 7. Then the kth components of V and are given, respectively, by taking the limit v k = lim l→∞v l (4.12) and setting After these calculations have been completed for k = 1, . . . , m, the 1 sphered data are given by Y s = − 1 2 V T Y c , the counterpart to (2.8).
Since the minimization problems (4.7) are performed over progressively smaller subspaces S k , it follows that λ 1 ≥ λ 2 ≥ · · · ≥ λ m . Thus, as in Sect. 2, the 1 -variation λ i of the data Y c along the axis v i is larger than the 1 -variation λ j along the axis v j for i < j. To select only the r < m components with respect to which the data have the most 1 -variation, define the projected data Y P ≈ Y by the counterpart to (2.11) where the matricesȲ , V , and are now determined by 1 measures while the projector P ∈ R r ×m is defined as before with entries P i, j = δ i, j .
As the columns of V are determined sequentially, the earlier components of y are more strongly separated from other later components. Alternatively, all components of y may be estimated with roughly equal quality by determining all columns of V simultaneously through minimizing a sum of functionals of the form (4.8) for each column under the constraint that V be orthogonal [10,17].
In case the data are undersampled and n < m, the steps outlined in this section must be modified as follows. LetȲ and Y c be determined as described following (3.4). ThenV ∈ R m×n is determined by solving (4.7) but for The sphered data are then given bŷ

1 Approach to ICA
While the Kurtosis has been used as a measure of Gaussianity in (2.18) to determine independent components, an alternative measure of independence is proposed here which is more robust in the presence of outliers. This approach targets independence directly in a manner which can be illustrated in terms of the example shown below in Fig. 5. Here the data are given by where σ (t) = sign(t). Let these data represent a realization of a random vector y ∈ R 2 satisfying where P denotes the probability. In order that the rotation dependent random vector satisfy the independence condition, Fig. 6 The 1 (solid) and 2 (dotted) data axes are compared in a, where the data of (5.10) are shown with ·. Shown in b is the landscape of the merit function G 1 in (5.9) a rotation angle θ = π/4 + kπ/2, k ∈ Z, must be chosen. For the determination of the proper rotation, four different measures of independence are compared in Fig. 5,

4)
where K m is given by (2.13) and (see [14]) The data axes obtained by maximizing the last three measures in (5.5) are identical and are shown in Fig. 9a as dotted line segments. The data axes obtained by maximizing the first measure in (5.5) are shown in Fig. 9a as solid line segments. The landscape for the merit function U (θ )Y 1 of (5.5) is shown in Fig. 9b and the maximum is marked with . Only the first measure in (5.5) is maximized at a desired angle as shown in the landscape of Fig. 9b. All other measures are maximized at a multiple of π . The correct rotation is also obtained using (5.8) for the data presented in Sect. 9. While a more detailed statistical analysis is intended for a future work, it is presently on the basis of these examples that the measure shown below in (5.8) is proposed to determine the rotation matrix U of (2.14). To achieve maximally independent rows of X c = U Y s , the rows of the orthogonal matrix U = {û T i } m i=1 are determined as follows. Define U l = {û 1 , . . . ,û l } T as in (2.15) and the projected data For convenience, it is assumed here that the data Y s have maximal rank so that which is equivalent to u = 0 exactly when U l−1 u = 0. Hence, Given T l , let the lth column of U T be determined inductively bŷ In this way, the rows of U are determined sequentially so that the earlier components of x are more strongly separated from other later components. Alternatively, all components of x may be estimated with roughly equal quality by determining all rows of U simultaneously through maximizing a sum of functionals of the form (5.9) for each row under the constraint that U be orthogonal [14]. Once matrices U , , and V are determined, the source samples are estimated according to (2.4) and (2.5). The robustness of the measure G l in (5.9) in relation to the measure F in (2.20) is highlighted by the following simple example, which is illustrated in Fig. 6. Here the data Y are given by Fig. 6a, and the points (±3, 0) may be regarded as outliers from points otherwise lying at the diamond vertices {(0, ±1), (±1, 0)}. The 2 data axes are given by (2.18) and are shown in Fig. 6a as dotted line segments. The 1 data axes are given by (5.8) and are shown in Fig. 6a as solid line segments, where the landscape for the merit function G 1 of (5.9) is shown in Fig. 6b and the maximum is marked with .
The vectors defined by (5.8) are computed by the following scheme. For τ > 0, compute iterativelyû k ∈ T l with û 0 2 = 1, where The motivation for this iteration and its convergence analysis are given in Sect. 8. The lth column of U T is given by taking the limit, After these calculations have been completed for l = 1, . . . , m, the 1 maximally independent data are given by X c = U Y s , the counterpart to (2.14).
To select the r < m desired independent components {q 1 , . . . , q r } ⊂ {1, . . . , m}, define the projected data Y Q ≈ Y by the counterpart to (2.21) where the matricesȲ , V , , and U are now determined by 1 measures, while the projector Q ∈ R r ×m is defined as before with entries Q i j = δ q i , j .
In case the data are undersampled and n < m, let V ∈ R m×n ,Ŷ s ,ˆ ∈ R n×n be given as described at the end of Sect. 3. Then, the sphered dataŶ s are transformed by the rotation matrixÛ ∈ R n×n maximizing independence of the rows ofX c =ÛŶ s ∈ R n×n . Thus,X = Uˆ − 1 2V T Y gives the maximum number of independent components which can be determined from the undersampled data. Finally, for a projector Q ∈ R n×n , (2.21) becomes

Convergence of the Iterative Scheme for the 1 Mean
The analysis of the scheme (3.6)-(3.7) begins by establishing basic properties for the minimization problem (3.4), i.e., the determination of the geometric median. As indicated in Sect. 1, a proof of uniqueness of the geometric median is provided here which is shorter than that found in [21] or [19] and is based on strict convexity of the functional M. Also, a possibly novel characterization of a solution is provided in case the columns of Y are collinear, which means that Y can be expressed in the form Y = ae T + b y T , where e = (1, . . . , 1) T , a, b ∈ R m , e, y ∈ R n . (6.1)

Lemma 2 If the columns of Y ∈ R m×n are collinear, then M is minimized (not necessarily uniquely) by μ = a + b · median ( y). If the columns of Y are not collinear, there exists a unique μ ∈ R m minimizing M.
Proof Existence of a solution μ * ∈ R m follows by standard subsequential limit arguments. If the columns of Y are not collinear, uniqueness of the solution μ follows from strict convexity of μ → M(μ).
Suppose next that the columns of Y are collinear so that Y = ae T + b y T . Set ν = b 2 and w = b + νê 1 where (ê 1 ) i = δ i1 . Then the Householder transformation is orthogonal and satisfies U b = −νê 1 (6.5) as well as U x 2 = x 2 , ∀x ∈ R m . Let an arbitrary μ ∈ R m be represented as and note that With (6.5)-(6.7), the merit function can be written as Once the merit function has been reduced to this onedimensional form, it is readily seen that M(μ) ≥ M(a + γ b) for γ = median( y), and hence M is minimized at a + γ b. That the minimizer is not necessarily unique can be seen from the case that n is even and the components of y are distinctly ascending, so M has the same value M(a + γ b) at all points a + b[t y n/2 + (1 − t)y n/2+1 ], t ∈ [0, 1].

Lemma 3
The first-order necessary optimality condition for a minimizer μ of M over R m is that there is D ∈ R m×n satisfying, Proof The necessary optimality condition for a minimizer μ is that 0 ∈ ∂ M(μ ). By the chain rule (see, e.g., [2], p. 233), the subdifferential of M is given by the sum of the respective subdifferentials, Thus, there exist d j ∈ ∂ μ − Yê j 2 , j = 1, . . . , n, satisfying n j=1 d j = 0, (6.11) and where B(0, 1) is the unit ball. Combining these facts, we have The claim (6.9) follows with D = {d 1 , . . . , d n }.
The claimed stability will follow once it is shown that the Jacobian of this mapping evaluated at {D , μ } = {d 1 , . . . , d n , μ } has spectral radius less than 1 when τ is sufficiently large. For (6.19), For (6.20), Thus, the Jacobian satisfies The matrix B ∈ R m×m is clearly symmetric positive semidefinite, and it will now be shown that its spectrum lies in [0, 1). Suppose there is an x ∈ R m satisfying (6.27) violating the assumption that the columns of Y are not collinear. Thus, there can be no x satisfying (6.27). This result implies strict inequality in the following estimate: Thus, the spectral radius of B is less than 1. Let a rotation matrix P be chosen so that where λ i ∈ [0, 1). Then the matrix J in (6.26) satisfies proving that the spectrum of J lies in [0, 1). Since the Jacobian in (6.26) is arbitrarily well approximated by J when τ is sufficiently large, the spectral radius of the Jacobian must be less than 1 for τ sufficiently large. Because of the assumption μ / ∈ {Yê j } n j=1 , the mapping (6.19)-(6.20) is smooth in a neighborhood of the fixed point {D , μ }, and hence iterates {D l , μ l } converge to the fixed point when started sufficiently close to it.
Remark 1 Computations demonstrate that the iteration (3.6) -(3.7) converges to a minimizer of M for all τ > 0 and even when the condition μ / ∈ {Yê j } n j=1 is violated. Furthermore, while the uniqueness of the minimizer is not guaranteed when the non-collinearity condition is violated, the iteration is found to converge to the median shown in Lemma 1 when τ is sufficiently small.

Convergence of the Iterative Scheme for 1 PCA
The analysis of the scheme (4.10)-(4.11) begins with establishing the existence of a minimizer for H k in (4.8). Recall the assumption in Sect. 4 k = 2, . . . , m, as seen in (4.3). Also, define (4.8), there exists a minimizerv over S k which satisfiesv ∈ S k . Moreover, n j=1 Y kê j 2 gives an upper bound for the minimum.

Lemma 4 For H k in
Proof Because of the properties of projections, it follows that (vv T − I )Y kê j 2 ≤ Y kê j 2 holds ∀v ∈ S k and ∀ j. By (4.8), H k (v) ≤ H k (0) holds ∀v ∈ S k . Thus, H k is not strictly minimized at v = 0. According to (4.8), H k is constant along rays outside the origin. Therefore, the minimization can as well be carried out over S k . The claim follows since S k is compact and H k is continuous on S k .

Lemma 5 The first-order necessary optimality condition for a minimizerv of H k over S k satisfyingv ∈ S k as given by Lemma 4 is that there exists D ∈ R m×n satisfying
Proof The necessary optimality condition for a minimizer v is where ∂ denotes the Clarke derivative.
Turning to the iteration (4.10)-(4.11), we observe that if convergence to a fixed point {D ,v } with v 2 = 1 can be guaranteed, then from (4.10) we have thatv T D jê j = 0 and According to (4.11), the fixed pointv satisfieŝ where v is defined by the right side of (7.16). Applying (v v T − I ) to both sides of (7.16) gives Combining (7.15) and (7.17) gives Moreover, if D 0ê j 2 ≤ 1, j = 1, . . . , n, then the iterates {D l } also satisfy this bound, and hence D ê j 2 ≤ 1, j = 1, . . . , n. Thus, {D ,v } must satisfy the necessary optimality condition (7.2). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed.

(7.19)
Then {D ,v } is a fixed point of the iteration (4.10)-(4.11), and for τ sufficiently large and for ρ sufficiently small, the iterates {D l ,v l } converge to this fixed point when {D 0 ,v 0 } starts the iteration close enough to {D ,v }.
Proof It will first be shown that {D ,v } is a fixed point for the iteration (4.10)-(4.11), which is locally asymptotically stable for τ sufficiently large and for ρ sufficiently small. Using (7.2) and substituting D l = D andv l =v on the right side of (4.10), gives D l+1 = D . Also by (7.2), satisfies v =v v 2 . Thus, setting D l = D andv l = v on the right side of (4.11) givesv l+1 =v . Therefore, {D ,v } is a fixed point of the iteration (4.10)-(4.11).
To establish the stability of the fixed point, define and so that (4.10)-(4.11) is given by The claimed stability will follow once it is shown that the Jacobian of this mapping from the (n + 1)-fold Carte- . . , d n ,v } has only eigenvalues with magnitude less than 1 when τ is sufficiently large and ρ is sufficiently small. For (7.23), τ →∞ −→ 0 (7.28) and withĝ = g(d 1 , . . . , d n ,v ) =v , Thus, the Jacobian satisfies By the definition S k = R(Y k ) prior to (4.3), it follows from (7.27) that A j : S k → S k , j = 1, . . . , n, and from (7.31) that B : S k → S k . In the same way, it follows from (7.32) that J : (S k ) n+1 → (S k ) n+1 . It will be shown that B : S k → S k has spectrum in (−1, 1). Let B be expressed as , (7.34) and v T Y c = 0 would contradict the assumption that S 1 = R(Y c ) = R m . Thus, there exist 0 < λ min ≤ λ max such that C satisfies Clearly, λ = 0 is an eigenvalue of B associated with the eigenvectorv ∈ W k . Eigenvalues for eigenvectors in U k will now be estimated. Let ρ be small enough that −1 < 1 − ρλ max < 1 − ρλ min < 1.
Then for x ∈ U k it follows with the definition ofĈ and C that Thus, the spectrum of B : S k → S k lies in (−1, 1). Recalling that the dimension of S k is m − k + 1, let P ∈ R m×(m−k+1) be chosen with orthonormal columns such that where λ i ∈ (−1, 1). Also set Z ∈ R m×(m−k+1) with all zero entries. Further let I ∈ R m×m and 0 ∈ R m×m in (7.37) below denote the identity and the zero matrix, respectively. Then the matrix J satisfies proving that the spectrum of J lies in (−1, 1). Since the Jacobian in (6.26) is arbitrarily well approximated by J when τ is sufficiently large, its spectrum lies strictly within the ball of radius 1 for τ sufficiently large and for ρ sufficiently small. Because of the assumption (7.19), the mapping (7.23)-(7.24) is smooth in a neighborhood of the fixed point {D ,v }, and hence the iterates {D l ,v l } converge to the fixed point when started sufficiently close to it.
Remark 2 Computations demonstrate that the iteration (4.10)-(4.11) can converge to a minimizer for H k even when the condition (7.19) is violated. Such a case is illustrated in Fig. 4. Variations of the example illustrated in Fig. 4, in which data points lie on the boundary of a rectangle instead of a square, indicate an advantage to having sphered the data by 2 means according to (2.8) and (2.10) before proceeding with the methods of Sect. 4.

Convergence of the Iterative Scheme for 1 ICA
The analysis of the scheme (5.11) begins with establishing existence of a maximizer for G l in (5.9). Recall the assumption in Sect. 5 that . . , m, as seen in (5.7). Also, define Lemma 7 is now used to prove Lemma 8. For the following, let σ (v) = {σ (v i )} where v = {v i } and σ (t) = sign(t).

Lemma 8
The first-order necessary optimality condition for a maximizerû of G l over T l satisfyingû ∈ T l as given by Lemma 6 is (8.8) and the sets Y = { j : Y lê j = 0} and S = { j : u T Y lê j = 0} agree. As a consequence, ∃ > 0 such that Proof Letû ∈ T l be a maximizer for G l guaranteed by Lemma 6. By (8.2), By decomposing sums into indices in S and S c , (8.11) where Combining (8.10) and (8.11), it follows This implies the existence of {γ j } j∈S (possibly zero) such that Now with w = v , combining (8.10), (8.11), and (8.14) gives or v = 0. With (8.12), the optimality condition (8.8) follows. According to (8.14), S ⊂ Y holds. Since Y ⊂ S always holds, it follows that Y = S. Note that is smooth for u ∈ B(û , ) for some > 0. As a result, G l is smooth in B(û , ) and can be differentiated directly to obtain (8.9).
Turning to the iteration (5.11), we observe that if convergence to a fixed pointû can be guaranteed, then multiplyinĝ where the last equality follows with (Y T lû ) T σ (Y T lû ) = û T Y l 1 . Since u 2 = û 2 = 1 holds, it follows that u =û and henceû satisfies (8.8). The following theorem provides sufficient conditions under which convergence of our algorithm to a fixed point can be guaranteed.
Thenû is a fixed point of the iteration (5.11), and for τ sufficiently small, the iterates {û k } converge to this fixed point when {û 0 } starts the iteration close enough to {û }.
Proof It will first be shown that {û } is a fixed point for the iteration (5.11) which is locally asymptotically stable for τ sufficiently small. Settingû k+1 =û on the right side of (5.11) and using (8.8) shows that u k+1 =û . Hence with u k+1 2 = û 2 = 1 it follows thatû k+1 = u k+1 =û , and thusû is a fixed point. To establish the stability of the fixed point, define G(u) = g(u) g(u) 2 , so that (5.11) is given bŷ The claimed stability will follow once it is shown that the Jacobian of this mapping evaluated atû has spectral radius less than 1 when τ is sufficiently small. The Jacobian is given by It follows with (8.8) that g(û ) =û and thus g(û ) 2 = û 2 = 1. The Jacobian of G atû is symmetric and is given by provided that 1 > τ û T Y l 1 . For any such τ , the spectral radius of the Jacobian in (8.22) is less than 1. Because of the assumption that S is empty, the mapping (8.18) is smooth in a neighborhood of the fixed point {û }, and hence the iterates {û k } converge to the fixed point when started sufficiently close to it.
Remark 3 Computations demonstrate that the iteration (5.11) converges to a maximizer for G l even when the condition S = { j :ê T j Y T lû = 0} = ∅ is violated.

Application to DCE-MRI Sequences
In this section, the proposed methods are applied to the dynamic contrast-enhanced magnetic resonance image (DCE-MRI) sequence http://math.uni-graz.at/keeling/manus kripten/dcemri.mpg to separate intensity changes due to physiological motion from those due to contrast agent. With such a separation, unavoidable physiological motion may be removed in order to investigate tissues in a stationary state. See also [20] and [22]. To focus on the period in which contrast agent arrives in the imaged tissues, only the first 40 of 134 frames are used for the following decompositions. Each frame consists of an image with 400 × 400 pixels. Thus, in the notation of Sect. 2, the data are Y ∈ R m×n with m = 400 2 40 = n. For a static display of the DCE-MRI sequence, representative stages are shown in Fig. 7: exhale and inhale, with and without contrast agent. Specifically, with Y , V, given by (2.7)   are shown in Fig. 8. Brightness changes in relation to the background are seen throughout organs in Fig. 8b, and this suggests thatv 1 represents intensity changes in the DCE-MRI sequence due to contrast agent. On the other hand, brightness changes are seen mainly on the edges of organs in Fig. 8c, and this suggests thatv 2 represents intensity changes in the DCE-MRI sequence due to physiological motion. The image sequence is shown more dynamically in Fig. 9. The graphs in the left column are the time coursesv T i Y ,v T i Y s , v T i X c , i = 1, 2, for the raw, sphered and independent data, respectively, where Y s and X c are given by (2.8) and (2.14), respectively. The graphs in the right column are corresponding plots in a phase plane. To determine the most significant independent components, all but the top two principal components were discarded. Then V was replaced by its first two columns, Y s by its first two rows and by a diagonal matrix containing the largest two eigenvalues. Also, the independent images V 1 2 U T Q i X c , i = 1, 2, Q i = diag{ê i }, (ê i ) j = δ i j , (9.3) are shown in Fig. 9g, h. As explained in connection with Fig.  8, these can be associated respectively with intensity changes in the DCE-MRI sequence due to contrast agent and to physiological motion. Note that there are only small differences between Figs. 9c and e, between Figs. 9d and f, between Figs. 8b and 9g and between Figs. 8c and 9h. Thus, the separation of intensity changes due to physiological motion from those due to contrast agent is achieved here already with the sphered data. Hence the transformation to independent data had little effect for this particular example. Recall from Sect. 2 that the order and the sign of ICA components are not uniquely determined. This separation will now be considered in the presence of outliers. As seen in the full DCE-MRI sequence, an excessively bright frame may appear suddenly. To simulate this effect, intensities of the final frame of the sequence are increased by a constant factor. Then the same methods used for Fig. 9 are applied to the corrupted data, and the results are shown in Fig. 10 with the same format as that used in Fig. 9. The corrupted data may be seen at the final time shown in the graphs of the first column of Fig. 10. Also the outlier is conspicuous in the phase plane graphs in the right column of Fig. 10. Finally, the images defined by (9.3) with the corrupted data are shown in Figs. 10g and h. Since these clearly differ from their counterparts in Figs. 9g and h, the presence of the single outlier has corrupted the separation of intensity changes due to physiological motion from those due to contrast agent. We would like to report that we obtained results comparable to those seen in Fig. 10 by applying the FastICA code [13] using optional robust merit functions.
For comparison, the 1 -based methods of Sects. 3-5 are now applied to the corrupted data, and the results are shown in Fig. 11 with the same format as that used in Figs. 9 and 10. Now the matrices V , Y c , V , , Y s , and U are understood as explained in Sects. 3-5 as well as in Remark 2. As in the previous cases, all but the top two principal components are discarded and the respective matrices are reduced correspondingly to have only two rows or columns or both. Then (9.2) and (9.3) apply with these 1 -based matrices. As before, the corrupted data may be seen at the final time shown in the graphs of the first column of Fig. 11. Also the outlier is conspicuous in the phase plane graphs in the right column of Fig. 11. Finally, the images defined by (9.3) with the corrupted data are shown in Figs. 11g and h. Note the similarities between Figs. 11e-h and their counterparts in Figs. 9e-h. On this basis, the 1 methods can be seen to have successfully separated intensity changes due to physiological motion from those due to contrast agent in spite of the outlier. With this separation, the data are projected onto the single independent component of Fig. 11g using (2.21) to produce Fig. 9 Representation of components of raw, sphered and independent data for the DCE-MRI sequence. These have been determined by 2 -based methods Fig. 10 Representation of components of raw, sphered and independent data for the DCE-MRI sequence with a single outlier introduced at the final time. These have been determined by 2 -based methods Fig. 11 Representation of components of raw, sphered and independent data for the DCE-MRI sequence with a single outlier introduced at the final time. These have been determined by 1 -based methods the following transformed DCE-MRI sequence manifesting contrast changes free of physiological motion, http://math.uni-graz.at/keeling/manuskripten/dcemri_ ica.mpg.

Conclusion
In this work, robust measures have been introduced for centering complex data in the presence of outliers and for determining principal and independent components of such data. The approach to centering is to use the geometric median. The approach for determining principal components is to find best fit lines through the data, where each line minimizes the sum of distances (not squared) to data points in the subspace orthogonal to other components. The approach for determining independent components is first to sphere the data so that the corresponding axes are aligned with clusters, and then to determine independent axes as those which separate sphered clusters as much as possible. This separation is accomplished by maximizing an 1 counterpart to Rayleigh quotients. To optimize the respective merit functions, iterative methods were proposed and their local convergence was proved. Illustrative examples were presented to demonstrate the benefits of the robust approaches. Finally, the proposed methods were applied to a DCE-MRI sequence to separate intensity changes due to physiological motion from those due to contrast agent, and benefits of the robust methods have been demonstrated with respect to this realistic example.