Hybrid Linear Modeling via Local BestFit Flats
Authors
 First Online:
 Received:
 Accepted:
DOI: 10.1007/s1126301205356
 Cite this article as:
 Zhang, T., Szlam, A., Wang, Y. et al. Int J Comput Vis (2012) 100: 217. doi:10.1007/s1126301205356
Abstract
We present a simple and fast geometric method for modeling data by a union of affine subspaces. The method begins by forming a collection of local bestfit affine subspaces, i.e., subspaces approximating the data in local neighborhoods. The correct sizes of the local neighborhoods are determined automatically by the Jones’ β _{2} numbers (we prove under certain geometric conditions that our method finds the optimal local neighborhoods). The collection of subspaces is further processed by a greedy selection procedure or a spectral method to generate the final model. We discuss applications to trackingbased motion segmentation and clustering of faces under different illuminating conditions. We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the suggested algorithms on these problems and also on synthetic hybrid linear data as well as the MNIST handwritten digits data; and we demonstrate how to use our algorithms for fast determination of the number of affine subspaces.
Keywords
Hybrid linear modeling Subspace clustering Spectral clustering Local PCA Highdimensional data Motion segmentation Face clustering1 Introduction
Several problems from computer vision, such as motion segmentation and face clustering, give rise to modeling data by multiple subspaces. This is referred to as Hybrid Linear Modeling (HLM) or alternatively as “subspace clustering”. In trackingbased motion segmentation, extracted feature points (tracked in all frames) are clustered according to the different moving objects. Under the affine camera model, the vectors of coordinates of feature points corresponding to a moving rigid object lie on an affine subspace of dimension at most 3 (see Costeira and Kanade, 1998). Thus clustering different moving objects is equivalent to clustering different affine subspaces. Similarly, in face clustering, it has been proved that the set of all images of a Lambertian object under a variety of lighting conditions form a convex polyhedral cone in the image space, and this cone can be accurately approximated by a lowdimensional linear subspace (of dimension at most 9) (Epstein et al. 1995; Ho et al. 2003; Basri and Jacobs 2003). One may thus cluster certain images of faces by HLM algorithms.
The mathematical formulation of HLM assumes a data set \(\mathrm{X}=\{\mathbf{x}_{i}\}_{i=1}^{N} \subseteq \mathbb{R}^{D}\) where each x _{ i } lies on (or around) one of K flats (i.e., affine subspaces) and requires to find the partition of X corresponding to the flats. We would like to be able to do this when the data has been corrupted by additive noise and outliers;^{1} in this case we may also want to determine the flats themselves. We first assume here that all flats have the same known dimension d (i.e., they are dflats) and that their number K is known. In Sect. 3 we address to some extent the cases of unknown K and mixed dimensions.
Several algorithms have been suggested for solving the HLM problem (or even the more general problem of clustering manifolds), for example the Kflats (KF) algorithm or any of its variants (Tipping and Bishop 1999; Bradley and Mangasarian 2000; Tseng 2000; Ho et al. 2003; Zhang et al. 2009), methods based on direct matrix factorization (Boult and Brown 1991; Costeira and Kanade 1998; Kanatani 2001; Kanatani 2002), Generalized Principal Component Analysis (GPCA) (Vidal et al. 2005), Local Subspace Affinity (LSA) (Yan and Pollefeys 2006), RANSAC (for HLM) (Yang et al. 2006), Locally Linear Manifold Clustering (LLMC) (Goh and Vidal 2007), Agglomerative Lossy Compression (ALC) (Ma et al. 2007), Spectral Curvature Clustering (SCC) (Chen and Lerman 2009) and Sparse Subspace Clustering (SSC) (Elhamifar and Vidal 2009). Some theoretical guarantees for particular HLM algorithms appear in Chen and Lerman (2009), AriasCastro et al. (2011), Lerman and Zhang (2011) and Soltanolkotabi and Candès (2011). We recommend a recent review on HLM by Vidal (2011).
Many of the algorithms described above require an initial guess of the subspaces. For example, the Kflats algorithm is an iterative method that requires an initialization, and in SCC, one needs to carefully choose collections of d+1 data points that lie close to each of the underlying dflats. Other algorithms require some information about the suspected deviations from the hybrid linear model; for example both RANSAC (for HLM) and ALC ask for a model parameter corresponding to the level of noise.
Here we propose a straightforward geometric method for the estimation of local subspaces, which is inspired by Jones (1990), David and Semmes (1991) and Lerman (2003) as well as Fukunaga and Olsen (1971) and Little et al. (2009a, 2009b). These local subspace estimates can be used to set the model parameters for or initialize an HLM algorithm. The basic idea is that for a data set X sampled from a hybrid linear model (perhaps with some noise), there are many points x such that the principal components of an appropriately sized neighborhood of x give a good approximation to the subspace x belongs to. Using local subspaces to infer the global hybrid linear model was suggested in Yan and Pollefeys (2006) for linear subspaces; however, there they use very small neighborhoods that are not adaptive to the structure of the data (e.g., amount of noise etc.). An “appropriately sized neighborhood” needs to be larger than the noise, so that the subspace is recognized. However, the neighborhood cannot be so large that it contains points from multiple subspaces. The correct choice of this size is carefully quantified in Sect. 2.1. We refer to such “appropriately sized neighborhoods” as “optimal neighborhoods”.
In addition to studying how to estimate local subspaces, we describe two complete HLM algorithms which are natural extensions of the local estimation: LBF (Local Bestfit Flat) and SLBF (spectral LBF). On many data sets, the first obtains state of the art speed with nearly state of the art accuracy (it can also deal with very large data), and the second obtains state of the art accuracy (SLBF) with reasonable run times (it seems to be able to deal to some extent with some nonlinear structures as the ones arising in motion segmentation data). We remark that we test accuracy in various scenarios, but in particular, with intersecting subspaces and with outliers. While in this work we only theoretically justify our choice of “optimal” neighborhoods, we are hopeful about developing a more complete theory justifying our algorithms.
In particular, we believe that such a theory can be valid in the setting suggested by Soltanolkotabi and Candès (2011) for analyzing the SSC algorithm, while having additional noise and restricting the fraction of outliers (or modifying our algorithms so they are even more robust to outliers). We are also interested in rigorously quantifying the limitations of our algorithms (as conjectured in Sect. 4).

We make precise the local bestfit heuristic, using the β _{2} numbers (Jones 1990; David and Semmes 1991; Lerman 2003). We give an algorithm to approximately find optimal neighborhoods in the above sense, in fact, we prove this under certain geometric conditions.

Using the local bestfit heuristic, we introduce the LBF and SLBF algorithms for HLM. At each point of a randomly chosen subset of the data, they use the bestfit flats of the “optimal” neighborhoods to build a global model with different methods (LBF is based on energy minimization and SLBF is a spectral method).

We perform extensive experiments on motion segmentation data (the Hopkins 155 benchmark of Tron and Vidal (2007)), face clustering (the extended Yale face database B), handwritten digits (the MNIST database), and artificial data, showing that both algorithms, in particular SLBF, are accurate on real and synthetic HLM problems, while LBF runs extremely fast (often on the order of ten times faster than most of the previously mentioned methods). For the cropped face data we actually indicate a fundamental problem of local methods like LBF and SLBF, though suggest a workaround that works for this particular data.

We demonstrate how the local bestfit heuristic can be used with other algorithms. In particular, we give experimental evidence to show that the Kflats algorithm (Ho et al. 2003) is improved by initialization that is based on the local bestfit heuristic. We also use this heuristic to estimate the main parameters of both RANSAC (for HLM) (Yang et al. 2006) and ALC (Ma et al. 2007).

We show how the combination of LBF and the elbow method can quickly determine the number of subspaces.
The rest of this paper is organized as follows. In Sect. 2 we describe the LBF and SLBF algorithms and state a theorem giving conditions that guarantee that good neighborhoods can be found. Section 3 carefully tests the LBF and SLBF algorithms (while comparing them to other common HLM algorithms) on both artificial data of synthetic hybrid linear models and real data of motion segmentation in video sequences, face clustering and handwritten digits recognition. It also demonstrates how to determine the number of clusters by applying the fast algorithm of this paper together with the straightforward elbow method. Section 4 concludes with a brief discussion and mentions possibilities for future work.
2 The Local BestFit Flats Heuristic and the LBF and SLBF Algorithms
We describe two methods, LBF and SLBF, which have at their heart an estimation of local flats capturing the global structures of the data (or part of it). Both methods first find a set of candidate flats (the number is an input parameter for LBF). These are bestfit flats for local “optimal” neighborhoods (we describe an algorithm for approximately finding such neighborhoods and justify it in Sect. 2.1). The two algorithms process the candidates in different ways: LBF uses energy minimization and SLBF uses a spectral approach.
2.1 Choosing the Optimal Neighborhood
We choose the candidate flats that capture the global structure of the data by fitting them to ‘optimal’ local neighborhoods of data points. For a point x∈ℝ^{ D }, we define an optimal neighborhood as the largest ball B(x,r) (centered at x and with radius r) that only contains points sampled from the same cluster as x. Indeed, neighborhoods smaller than the optimal one (around x) can mainly contain the noise around an underlying subspace (of the hybrid linear model); consequently their local bestfit flats may not match the underlying flat. On the other hand larger neighborhood than the optimal one (around x) will contain points from more than one underlying flat, and the resulting bestfit flat will again not match any of the underlying flats. We note that the choice of neighborhood B(x,r) is equivalent to the choice of radius r, which we refer to as scale (even though it is also common to refer to log(r) or −log(r) as scale). While it is possible to take a guess at the optimal scale as a parameter (e.g., as commonly done by fixing the number of nearest neighbors to x), we have found that it is possible to choose the optimal scale reasonably well automatically, while adapting it to the given point x.
2.1.1 Theoretical Justification
The following theorem tries to justify our strategy of estimating the optimal scale around each point by showing that in the continuous setting the first local minimizer of β _{2}(x,r):=β _{2}(B(x,r)) is approximately the distance from x to the nearest cluster that does not contain x (here the underlying model is a mixture of Lebesgue measures in strips around several subspaces and x is an arbitrary point on one of these subspaces). Therefore, if we choose the size of neighborhood following Algorithm 1 (adapted to the continuous setting), then we will approximately obtain the optimal neighborhood. It is rather standard to extend such estimates for measures to a probabilistic setting, where i.i.d. data is sampled from the continuous distribution. The theorem will then hold with high probability for sufficiently large sample size (due to technicalities, which also require truncating the support of our continuous measure we avoid these details). The proof of this theorem is in Appendix A.
Theorem 1
The proof of this theorem indicates a weaker condition than (4), which is less intuitive. It also shows that r ^{∗}→r _{0} as w/r _{0}→0 and clarifies by example why the first local minimum of β _{2}(x ^{∗},r _{0}) is often bigger than r _{0} (see Remark 1).
2.1.2 The Complexity of Algorithm 1
2.2 The LBF Algorithm
The LBF algorithm is closely related to RANSAC, since both of them use candidate subspaces to fit the data set. However Algorithm 1 gives LBF an advantage in choosing good candidates, while RANSAC fits a dflat by arbitrarily chosen d+1 points.
2.2.1 The Complexity and Storage of Algorithm 2
For step 2 of this algorithm we need to run Algorithm 1 C times and thus its complexity is of order O((d⋅D+logN)⋅C⋅N). Note that the logN comes from a full sort of N distances, and if we restrict to a fixed number of scales, this can be replaced by a constant. Step 3 of Algorithm 2, requires C SVD decompositions for C matrices of size at most N×D, in order to obtain the first d vectors in ℝ^{ D }. It thus also has a complexity at most O(C⋅d⋅D⋅N).
Step 2 of Algorithm 2 requires the evaluation of the N×C matrix representing the distances \(\x_{i}P_{L_{j}}x_{i}\\) between X={x _{1},x _{2},…,x _{ N }} and L _{1},L _{1},…,L _{ C }. This costs O(C⋅d⋅D⋅N) operations, since each distance from a point to a subspace costs O(d⋅D). Moreover, the p passes have complexity of order O(p⋅(C−K)⋅N). Therefore, step 2 of Algorithm 2 has a complexity of order O(C⋅N⋅(d⋅D+p)). At last, Step 7 of Algorithm 2 has a complexity of order O(K⋅d⋅D⋅N), which comes from the construction of the N×K matrix of distances from N points to K subspaces. Combining these complexities together, we have an overall complexity of O(C⋅N⋅(d⋅D+p+logN)) for the LBF Algorithm; as before, if we fix the number of scales independently from N, the log terms can be replaced by a constant.
To compute the storage requirements of LBF, we note that the data set is saved in an N×D matrix, the candidate subspaces are organized in C projection matrices of size D×d and in addition the algorithm stores an N×C matrix of distances between the data points and the C candidate subspaces. Therefore, the storage of LBF is in the order of O(D⋅N+C⋅D⋅d+N⋅C).
2.3 The SLBF Algorithm
As discussed in Vidal (2011), SLBF is a “spectral clusteringbased method”, similar to SCC, LSA and SSC. These algorithms construct an N×N affinity matrix, whose ijth entry represents the similarity between points i and j, and then apply spectral clustering using this affinity matrix. Ideally, the affinities of points from the same cluster are of order 1 and the affinities of points from different clusters are of order 0. Indeed, for the affinity \(\hat{\mathbf{S}}\) of SLBF, if x _{ i } and x _{ j } are in the same cluster, then we expect that x _{ i } is close to L _{ j } and x _{ j } is close to L _{ i }, which means S _{ i,j } is close to 0 and thus \(\hat{\mathbf{S}}_{i,j}\) is close to 1 (we assume here that L _{ i } and L _{ j } are good estimators for the underlying subspace of the cluster shared by x _{ i } and x _{ j } as suggested by Theorem 1). Otherwise, if x _{ i } and x _{ j } are not in the same cluster, then we expect that x _{ i } is sufficiently far from L _{ j } and x _{ j } is sufficiently far from L _{ i }, which implies that \(\hat{\mathbf{S}}_{i,j}\) is close to 0. The choice of σ _{ j } clearly affects this heuristic argument on the size of \(\hat{\mathbf{S}}_{i,j}\). Theoretically σ _{ j } should be larger than the noise, such that \(\hat{\mathbf{S}}_{i,j}\) is close to 1 when x _{ i } and x _{ j } are in the same cluster, but σ _{ j } cannot be too large so that \(\hat{\mathbf{S}}_{i,j}\) is close to 1 when x _{ i } and x _{ j } are not in the same cluster. Therefore we use (9), where \(\sqrt{\min_{\text{$d$flats\,\,\textit{L}}} \sum_{\mathbf{x}\in\mathcal{N}_{j}}\\mathbf{x}P_{L}\mathbf{x}\^{2}/\mathcal {N}_{j}}\) is the estimated noise of the data set around the point x _{ j } and λ is a parameter. Following the strategy in Chen and Lerman (2009), we choose different values of λ (our fixed default values are [2,2e,2e ^{2},…,2e ^{6}]) and consequently obtain several segmentation results (7 results when using our default values). We then choose the segmentation with the smallest error in (10).
We remark that the robustness of SLBF to outliers can be partly explained by the robustness of spectraltype methods to outliers. Furthermore, it is possible to initially remove some outliers according to very small values of the corresponding diagonal elements of D (see e.g., Chen and Lerman (2009), AriasCastro et al. (2011)).
Similar to SLBF, LSA (Yan and Pollefeys 2006) is also based on fitting local subspaces. However, LSA fits subspace by local neighborhoods of fixed number of points and is not adaptive. Moreover, the local subspaces of LSA are forced to be linear (since the affinity of LSA is based on principal angles between such subspaces) and this further restricts the applicability of LSA. There is also some similarity between the idea of SLBF and that of SCC (Chen and Lerman 2009). Indeed, we may view SCC as fitting candidate subspaces based on d+1 data points (the iterative procedure tries to enforce the points to be from the same cluster). However, in practice they operate very differently, in particular, SCC is not based just on local information (though a local version of SCC follows from AriasCastro et al. (2011)). The SSC algorithm is also a spectral method, but similar to SCC its affinities are global (they are based on sparse representation of data points).
2.3.1 Complexity and Storage of the SLBF Algorithm
Step 1 of Algorithm 3 has a complexity of order O((d⋅D+logN)⋅N ^{2}), since it applies Algorithm 1 to every point in the set X. The most expensive calculation of steps 2–4 in Algorithm 3 is the construction of S, which requires a complexity of order O(d⋅D⋅N ^{2}). The eigenvalue decomposition in step 5 has a complexity of order O(K⋅N ^{2}) and the Kmeans algorithm in step 6 has a complexity of order O(n _{ s }⋅N⋅K ^{2}), where n _{ s } is the iterations in Kmeans.
Combining these complexities together, we have an overall complexity of order O((d⋅D+logN)⋅N ^{2}+n _{ s }⋅N⋅K ^{2}) for SLBF. As before, limiting to a constant number of scales replaces the log term with a constant.
We note that SLBF stores the data set in a D×N matrix, the candidate subspaces in N D×d matrices (recall that in SLBF every data point is assigned a subspace and thus C=N) and it also uses the N×N matrix S. Therefore, the storage of SLBF is in the order of O(N⋅D⋅d+N ^{2}).
2.4 Adaptation of the Proposed Algorithms to Motion Segmentation Data
Note that the first minimum in the Theorem 1 excludes the left endpoint, and thus k=0 is excluded in Algorithm 1. In our experiments, we noticed that on data without too much noise, it is useful to allow the first scale to count as a local minimum and allow k=0 in Algorithm 1. We refer to the implementation of LBF and SLBF with those two techniques tailored for motion segmentation data as LBFMS and SLBFMS.
3 Experimental Results
In this section, we conduct experiments on artificial and real data sets to verify the effectiveness of the proposed algorithm in comparison to other HLM algorithms. We will see that in many situations, the methods we propose are fast and accurate; however, in Sect. 3.3 we will show a failure mode of our method, and discuss how this can be corrected.
In all the experiments below, the number C in Algorithm 2 is 70⋅K, where K is the number of subspaces, the number p in Algorithm 2 is 5⋅K, and the numbers S and T in Algorithm 1 are 2⋅d and 2 respectively, where d is the dimension of the subspace. According to our experience, LBF and SLBF are very robust to changes in parameters, but unsurprisingly, there is a general trade off between accuracy (higher C, higher p, smaller T), and run time (lower C, lower p, larger T). We have chosen these parameters for a balance between run time and accuracy. Nevertheless, we have insisted to use the same parameters for all data sets and experiments, even though particular parameters could obtain even better or near perfect results for some of the data sets. The experiments in Sects. 3.1 and 3.2 run on a computer with Intel Core 2 CPU at 2.66 GHz and 2 GB memory, and experiments in Sects. 3.3 and 3.4 run on a machine with Intel Core 2 Quad Q6600 at 2.4 GHz and 8 GB memory.
3.1 Clustering Results on Artificial Data
We compare our algorithms with the following algorithms: Mixtures of PPCA (MoPPCA) (Tipping and Bishop 1999), Kflats (KF) (Ho et al. 2003), Local Subspace Analysis (LSA) (Yan and Pollefeys 2006), Spectral Curvature Clustering (SCC) (Chen and Lerman 2009), Random Sample Consensus (RANSAC) for HLM (Yang et al. 2006), Agglomerative Lossy Compression (ALC) (Ma et al. 2007) and GPCA with voting/robust GPCA (GPCA) (Ma et al. 2008; Yang et al. 2006). Throughout the rest of the paper, we use the Matlab codes of the GPCA, MoPPCA and KF algorithms from http://perception.csl.uiuc.edu/gpca, the LSA algorithm from http://www.vision.jhu.edu/db, the SCC algorithm from http://www.math.umn.edu/~lerman/scc, the ALC algorithm from http://perception.csl.uiuc.edu/coding/motion/, the RANSAC algorithm from http://www.vision.jhu.edu/code/ and the SSC algorithm from http://www.cis.jhu.edu/~ehsan/ssc.htm.
For the SCC algorithm, we also try a slightly modified version tailored for motion segmentation as in step 6 of Algorithm 3, which we refer to as SCCMS (SCC for motion segmentation): Following the notation of Chen and Lerman (2009, Algorithm 2) we let the matrix U be the N×K matrix whose columns are the top K left singular vectors of \(\mathbf{A}_{C}^{*}\) and also denote by Σ the diagonal K×K matrix whose elements are the top K left singular values of \(\mathbf{A}_{C}^{*}\). Then the Kmeans step of SCCMS is applied directly to the rows of the N×K matrix UΣ ^{1/2} (as opposed to applying it to U (or its rowwise normalization by 1) in SCC).
The MoPPCA algorithm is always initialized with a random guess of the membership of the data points. The LSCC algorithm is initialized by randomly picking 100×K(d+1)tuples (following Chen and Lerman 2009) and KF is initialized with a random guess. Since algorithms like KF tend to converge to local minimum, we use 10 restarts for MoPPCA, 30 restarts for KF, and recorded the misclassification rate of the one with the smallest ℓ _{2} error for both of these algorithms. The number of restarts was restricted by the running time and accuracy. In SSC algorithm, we set the value λ to be 0.01, as suggested in the code.
The RANSAC for HLM and ALC algorithms (Yang et al. 2006; Ma et al. 2007) depend on a user supplied inlier threshold. RANSAC (oracle) and ALC (oracle) use the oracle inlier bound given by the true noise variance of the model and thus clearly have an advantage over the other algorithms listed. RANSAC (ϵ from LBF) and ALC (ϵ from LBF) estimate the inlier threshold by the local bestfit flats heuristic of this paper. That is, they fit bestfit neighborhoods for all N points using the latter heuristic and estimate the least error of approximation by dflats in these N neighborhoods. The inlier bound ϵ is then the average of these errors. If the number of clusters resulting from ALC (ϵ from LBF or oracle) is larger than K, then we choose the K largest clusters and identify the points in the rest of clusters as outliers. For some cases the RANSAC algorithm breaks down and we then report it as N/A. The reason for this is that RANSAC (for HLM) (Yang et al. 2006) is very sensitive to the estimate of ε and an overestimate can result in removal of points belonging to more than one subspace, so that the algorithm may exhaust all points before detecting K subspaces.
We remark that GPCA cannot naturally deal with outliers, therefore we use robust GPCA with Multivariate Trimming (Yang et al. 2006) and the parameters ‘angleTolerance’ and ‘boundarythreshold’ are set to be 0.3 and 0.4 respectively.
The artificial data represents various instances of K linear subspaces in ℝ^{ D }. If their dimensions are fixed and equal d, we follow Chen and Lerman (2009) and refer to the setting as d ^{ K }∈ℝ^{ D }. If they are mixed, then we follow Ma et al. (2008) and refer to the setting as (d _{1},…,d _{ K })∈ℝ^{ D }. Fixing K and d (or d _{1},…,d _{ K }), we randomly generate 100 different instances of corresponding hybrid linear models according to the code in http://perception.csl.uiuc.edu/gpca. More precisely, for each of the 100 experiments, K linear subspaces of the corresponding dimensions in ℝ^{ D } are randomly generated. The random variables sampled within each subspace are sums of two other variables. One of them is sampled from a uniform distribution in a ddimensional ball of radius 1 of that subspace (centered at the origin for the case of linear subspaces). The other is sampled from a Ddimensional multivariate normal distribution with mean 0 and covariance matrix 0.05^{2}⋅I _{ D×D }. Then, for each subspace 250 samples are generated according to the distribution just described. Next, the data is further corrupted with 5 % or 30 % uniformly distributed outliers in a cube of sidelength determined by the maximal distance of the former 250 samples to the origin (using the same code).
Since most algorithms (SCC, LSA, MoPPCA, LBF, SLBF, RANSAC, SSC) do not support mixed dimensions natively, we assume each subspace has the maximum dimension in the experiment. GPCA and ALC support mixed dimensions natively, and GPCA is the only algorithm for which we specify the dimensions for each subspace in mixeddimension case (note that the knowledge of dimensions are unnecessary in ALC algorithm).
Mean percentage of misclassified points in artificial data for linearsubspace cases or affinesubspace case
Linear 
2^{2}∈ℝ^{4} 
4^{2}∈ℝ^{6} 
2^{4}∈ℝ^{4} 
10^{2}∈ℝ^{15} 
(4,5,6)∈ℝ^{10} 
(1,5)∈ℝ^{6}  

Outl. % 
0 
30 
0 
30 
0 
30 
0 
30 
0 
30 
0 
30  
LSCC 
e % 
2.6 
6.9 
0.0 
2.6 
0.1 
22.4 
0.5 
3.8 
1.8 
28.2 
N/A 
34.6 
t(s) 
1.1 
0.8 
1.0 
1.8 
1.5 
2.0 
13.3 
5.7 
5.1 
8.4 
N/A 
1.9  
LSCCMS 
e % 
2.7 
10.0 
0.0 
4.1 
0.1 
36.7 
0.7 
31.9 
1.4 
19.8 
N/A 
32.9 
t(s) 
1.1 
0.5 
1.1 
1.4 
1.7 
1.5 
5.1 
5.6 
4.0 
4.6 
N/A 
2.0  
LSA 
e % 
18.4 
19.6 
0.1 
12.7 
0.4 
21.0 
0.1 
9.9 
5.9 
6.6 
27.4 
35.4 
t(s) 
6.8 
16.0 
7.1 
20.8 
23.8 
54.4 
11.7 
31.5 
20.1 
54.4 
6.6 
13.8  
KF 
e % 
2.5 
15.8 
2.5 
18.4 
0.1 
34.3 
0.0 
33.8 
1.0 
30.6 
20.2 
23.5 
t(s) 
0.5 
0.6 
0.2 
0.8 
0.7 
1.8 
0.4 
1.0 
0.7 
2.8 
0.3 
0.5  
MoPPCA 
e % 
2.5 
14.2 
0.0 
17.7 
0.1 
34.2 
0.0 
38.8 
1.6 
34.7 
23.4 
24.0 
t(s) 
0.3 
0.5 
0.2 
0.7 
0.7 
2.0 
0.2 
1.1 
1.1 
3.3 
0.5 
0.5  
GPCA 
e % 
6.0 
2.5 
0.0 
2.0 
0.1 
6.3 
0.0 
14.6 
14.6 
N/A 
5.9 
N/A 
t(s) 
2.1 
38.0 
1.9 
85.2 
10.8 
136.2 
11.2 
546.0 
73.8 
N/A 
0.7 
N/A  
LBF 
e % 
2.8 
3.7 
0.0 
2.3 
0.1 
11.5 
0.0 
1.9 
1.5 
1.5 
18.8 
14.1 
t(s) 
0.6 
0.5 
0.5 
0.5 
1.8 
2.7 
0.6 
0.8 
1.1 
1.4 
0.5 
0.5  
LBFMS 
e % 
2.7 
3.0 
0.0 
2.6 
0.1 
11.7 
0.0 
2.2 
1.3 
1.5 
19.5 
13.7 
t(s) 
0.6 
0.5 
0.4 
0.5 
1.7 
2.6 
0.4 
0.6 
0.9 
1.3 
0.4 
0.4  
SLBF 
e % 
5.2 
6.3 
0.1 
7.0 
0.1 
23.9 
0.0 
6.2 
2.0 
2.4 
11.1 
13.5 
t(s) 
11.2 
20.7 
9.4 
21.7 
65.0 
174.9 
9.5 
23.3 
23.2 
64.2 
9.3 
15.3  
SLBFMS 
e % 
7.8 
11.7 
0.1 
6.6 
0.2 
46.6 
0.0 
4.8 
1.9 
2.6 
19.7 
22.1 
t(s) 
12.0 
24.0 
8.8 
24.4 
68.1 
202.0 
8.4 
23.5 
22.0 
72.4 
9.8 
16.3  
RANSAC (oracle) 
e % 
2.7 
2.6 
2.9 
2.1 
8.0 
9.4 
0.5 
5.8 
1.7 
1.5 
N/A 
31.6 
t(s) 
0.1 
0.1 
0.1 
0.2 
0.1 
0.2 
5.9 
6.7 
1.5 
7.1 
N/A 
0.2  
RANSAC (ϵ from LBF) 
e % 
3.2 
2.6 
2.1 
2.4 
7.7 
9.8 
0.4 
6.7 
1.8 
1.5 
N/A 
30.6 
t(s) 
0.1 
0.1 
0.1 
0.2 
0.1 
0.2 
5.9 
6.7 
1.5 
7.0 
N/A 
0.3  
ALC (oracle) 
e % 
4.1 
3.4 
0.1 
16.3 
0.1 
30.1 
50.0 
50.0 
5.4 
36.1 
0.3 
0.4 
t(s) 
7.3 
23.2 
7.7 
33.6 
28.4 
136.3 
13.9 
172.6 
23.0 
180.1 
7.8 
17.3  
ALC (ϵ from LBF) 
e % 
4.5 
5.7 
0.1 
10.0 
0.1 
14.0 
50.0 
50.0 
2.5 
1.8 
0.4 
0.3 
t(s) 
8.0 
28.0 
8.1 
37.9 
29.6 
121.9 
16.6 
152.4 
24.0 
151.6 
8.3 
18.1  
SSC 
e % 
19.5 
34.3 
0.2 
43.5 
0.4 
52.8 
47.0 
44.9 
11.5 
54.0 
9.4 
15.9 
t(s) 
114.8 
236.2 
97.6 
247.9 
227.7 
591.3 
106.0 
276.6 
185.5 
437.9 
94.1 
142.1  
SCC 
e % 
0.0 
0.6 
0.0 
0.0 
0.0 
0.5 
0.0 
0.7 
0.0 
5.8 
N/A 
N/A 
t(s) 
1.2 
1.0 
1.1 
2.0 
1.4 
2.5 
6.1 
13.7 
5.6 
6.0 
N/A 
N/A  
SCCMS 
e % 
0.0 
2.2 
0.0 
0.5 
0.0 
5.8 
0.0 
0.0 
0.0 
3.1 
N/A 
N/A 
t(s) 
1.2 
0.7 
1.2 
1.6 
1.7 
2.2 
5.4 
6.0 
4.6 
4.8 
N/A 
N/A  
LSA 
e % 
0.1 
11.0 
0.0 
4.7 
0.4 
41.7 
0.0 
0.0 
0.0 
1.1 
37.5 
37.9 
t(s) 
6.7 
16.1 
7.1 
20.8 
22.2 
54.0 
11.7 
32.2 
38.3 
54.0 
6.6 
13.9  
KF 
e % 
0.2 
15.1 
0.1 
26.0 
0.3 
37.1 
0.0 
24.9 
0.0 
23.5 
24.8 
27.1 
t(s) 
0.8 
0.6 
0.4 
0.7 
1.0 
1.4 
0.6 
1.7 
1.0 
1.4 
0.5 
0.5  
MoPPCA 
e % 
0.2 
23.7 
0.1 
38.3 
0.5 
39.8 
0.0 
45.2 
0.0 
46.8 
30.8 
30.4 
t(s) 
0.9 
0.5 
0.5 
0.6 
1.1 
1.4 
0.9 
1.9 
1.9 
2.0 
0.5 
0.5  
GPCA 
e % 
0.2 
18.4 
0.2 
22.2 
0.4 
38.1 
0.0 
27.9 
0.3 
N/A 
N/A 
N/A 
t(s) 
1.8 
43.7 
3.3 
104.0 
8.3 
209.3 
11.8 
501.1 
69.1 
N/A 
N/A 
N/A  
LBF 
e % 
0.0 
2.0 
0.0 
0.7 
0.0 
4.5 
0.0 
0.3 
0.0 
0.0 
4.7 
11.2 
t(s) 
0.7 
0.6 
0.5 
0.6 
1.9 
2.8 
0.6 
0.8 
1.2 
1.5 
0.4 
0.5  
LBFMS 
e % 
0.0 
2.7 
0.0 
1.5 
0.0 
5.2 
0.0 
0.5 
0.0 
0.0 
3.9 
10.5 
t(s) 
0.6 
0.5 
0.4 
0.5 
1.7 
2.7 
0.4 
0.6 
1.0 
1.3 
0.4 
0.4  
SLBF 
e % 
0.0 
1.0 
0.0 
0.0 
0.0 
0.1 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
t(s) 
9.3 
19.1 
5.8 
19.0 
37.7 
143.1 
6.3 
19.4 
35.1 
61.4 
5.9 
14.8  
SLBFMS 
e % 
0.0 
0.1 
0.0 
0.0 
0.0 
0.1 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
t(s) 
8.8 
21.7 
5.6 
21.9 
38.0 
175.5 
5.9 
21.1 
40.1 
66.7 
5.9 
14.3  
RANSAC (oracle) 
e % 
13.8 
11.6 
9.8 
9.6 
30.9 
27.0 
1.9 
8.3 
1.2 
3.4 
N/A 
23.6 
t(s) 
0.1 
0.2 
0.4 
1.8 
0.4 
0.8 
6.4 
6.8 
3.7 
7.4 
N/A 
0.5  
RANSAC (ϵ from LBF) 
e % 
13.6 
11.6 
11.6 
10.4 
29.9 
28.5 
1.4 
9.6 
1.2 
2.4 
N/A 
23.1 
t(s) 
0.1 
0.2 
0.4 
1.9 
0.4 
0.8 
6.4 
6.7 
3.7 
7.4 
N/A 
0.5  
ALC (oracle) 
e % 
0.0 
0.0 
0.0 
0.0 
0.0 
25.1 
0.0 
40.0 
0.0 
65.0 
0.0 
0.0 
t(s) 
17.6 
25.2 
16.6 
39.1 
64.2 
119.3 
20.0 
43.0 
39.7 
92.7 
18.3 
36.8  
ALC (ϵ from LBF) 
e % 
0.0 
0.4 
0.0 
0.0 
0.0 
0.3 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
t(s) 
18.7 
26.8 
17.2 
29.8 
65.2 
113.6 
24.4 
55.5 
47.9 
85.2 
18.8 
38.9  
SSC 
e % 
0.0 
1.9 
0.0 
0.1 
0.1 
6.4 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
t(s) 
135.9 
226.8 
176.0 
134.7 
283.8 
592.4 
187.0 
311.9 
338.6 
504.1 
127.1 
183.9 
3.2 Clustering Results on Motion Segmentation Data
We test the proposed algorithms on the Hopkins 155 database of motion segmentation, which is available at http://www.vision.jhu.edu/data/hopkins155. This data set contains 155 video sequences along with the coordinates of certain features extracted and tracked for each sequence in all its frames. The main task is to cluster the feature vectors (across all frames) according to the different moving objects and background in each video. It consists of three types of videos: checker, traffic and articulated (see Fig. 2 for demonstration of frames of such videos).
More formally, for a given video sequence, we denote the number of frames by F. In each sequence, we have either one or two independently moving objects, and the background can also move due to the motion of the camera. We let K be the number of moving objects plus the background, so that K is 2 or 3 (and distinguish accordingly between twomotions and threemotions). For each sequence, there are also N feature points y _{1},y _{2},…,y _{ N }∈ℝ^{3} that are detected on the objects and the background. Let z _{ ij }∈ℝ^{2} be the coordinates of the feature point y _{ j } in the ith image frame for every 1≤i≤F and 1≤j≤N. Then z _{ j }=[z _{1j },z _{2j },…,z _{ Fj }]∈ℝ^{2F } is the trajectory of the jth feature point across the F frames. The actual task of motion segmentation is to separate these trajectory vectors z _{1},z _{2},…,z _{ N } into K clusters representing the K underlying motions.
It has been shown (Costeira and Kanade 1998) that under the affine camera model, the trajectory vectors corresponding to different moving objects and the background across the F image frames live in distinct affine subspaces of dimension at most three in ℝ^{2F }. Following this theory, we implement our algorithm with d=3.
We compare our algorithm with the following ones: improved GPCA for motion segmentation (GPCA) (Vidal et al. 2008), Kflats (KF) (Ho et al. 2003) (implemented for linear subspaces), Local Linear Manifold Clustering (LLMC) (Goh and Vidal 2007), Local Subspace Analysis (LSA) (Yan and Pollefeys 2006), Multi Stage Learning (MSL) (Sugaya and Kanatani 2004), Spectral Curvature Clustering (SCC) (Chen and Lerman 2009) and SCCMS (see description earlier), Sparse Subspace Clustering (SSC) (Elhamifar and Vidal 2009), and RANSAC for HLM (Yang et al. 2006).
Our misclassification rates for SCC are different than Chen and Lerman (2009) and Lauer and Schnorr (2009) and our misclassification rates for SSC are different than Elhamifar and Vidal (2009) (the difference between our and their results differ more than twice the standard deviations of misclassification rates obtained here). This can be explained by possible evolutions of the codes since then (at least for SSC). We remark though that the misclassification rates of SCCMS here are even slightly better than the misclassification rates of SCC in Chen and Lerman (2009).
The mean and median percentage of misclassified points for twomotions and threemotions in Hopkins 155 database
Checker 
Traffic 
Articulated 
All  

Mean 
Median 
Mean 
Median 
Mean 
Median 
Mean 
Median  
2motion  
GPCA 
6.09 
1.03 
1.41 
0.00 
2.88 
0.00 
4.59 
0.38 
LLMC 5 
4.37 
0.00 
0.84 
0.00 
6.16 
1.37 
3.62 
0.00 
LSA 4K 
2.57 
0.27 
5.43 
1.48 
4.10 
1.22 
3.45 
0.59 
LBF(4K,3) 
3.65 
0.00 
3.89 
0.00 
4.43 
0.15 
3.78 
0.00 
LBFMS(4K,3) 
2.90 
0.00 
1.64 
0.00 
2.51 
0.06 
2.54 
0.00 
SLBF(2F,3) 
1.59 
0.00 
0.20 
0.00 
0.80 
0.00 
1.16 
0.00 
SLBFMS(2F,3) 
1.28 
0.00 
0.21 
0.00 
0.94 
0.00 
0.98 
0.00 
SCC(4K,3) 
2.42 
0.00 
4.44 
0.00 
9.51 
2.04 
3.60 
0.00 
SCCMS(4K,3) 
2.00 
0.00 
0.35 
0.00 
4.11 
1.12 
1.77 
0.00 
SSCN(4K,3) 
1.29 
0.00 
0.29 
0.00 
0.97 
0.00 
1.00 
0.00 
MSL 
4.46 
0.00 
2.23 
0.00 
7.23 
0.00 
4.14 
0.00 
RANSAC 
6.52 
1.75 
2.55 
0.21 
7.25 
2.64 
5.56 
1.18 
3motion  
GPCA 
31.95 
32.93 
19.83 
19.55 
16.85 
28.66 
28.66 
28.26 
LLMC 4K 
12.01 
9.22 
7.79 
5.47 
9.38 
9.38 
11.02 
6.81 
LLMC 5 
10.70 
9.21 
2.91 
0.00 
5.60 
5.60 
8.85 
3.19 
LSA 4K 
5.80 
1.77 
25.07 
23.79 
7.25 
7.25 
9.73 
2.33 
LSA 5 
30.37 
31.98 
27.02 
34.01 
23.11 
23.11 
29.28 
31.63 
LBF(4K,3) 
8.50 
1.26 
16.31 
13.52 
20.75 
20.75 
10.77 
1.70 
LBFMS(4K,3) 
6.97 
1.15 
7.06 
0.62 
21.38 
21.38 
7.81 
0.98 
SLBF(2F,3) 
4.57 
0.94 
0.38 
0.00 
2.66 
2.66 
3.63 
0.64 
SLBFMS(2F,3) 
3.33 
0.39 
0.24 
0.00 
2.13 
2.13 
2.64 
0.22 
SCC(4K,3) 
7.80 
1.04 
8.05 
2.37 
7.07 
7.07 
7.81 
0.67 
SCCMS(4K,3) 
6.28 
0.80 
4.09 
0.58 
7.22 
7.22 
5.89 
0.68 
SSCN(4K,3) 
3.22 
0.29 
0.53 
0.00 
2.13 
2.13 
2.62 
0.22 
MSL 
10.38 
4.61 
1.80 
0.00 
2.71 
2.71 
8.23 
1.76 
RANSAC 
25.78 
26.01 
12.83 
11.45 
21.38 
21.38 
22.94 
22.03 
By adapting the parameters of SLBFMS (or alternatively, SLBF, LBF, LBFMS), we can further improve its misclassification rates on Hopkins 155 (e.g., total 0.81 % for twomotions and 2.12 % for threemotions by SLBFMS). However, we have fixed in advance all parameters and insisted using the same parameters on all four kinds of data (see the third paragraph of Sect. 3).
The standard deviation to the mean and median percentage of misclassified points for twomotions and threemotions in Hopkins 155 database
Checker 
Traffic 
Articulated 
All  

Mean 
Median 
Mean 
Median 
Mean 
Median 
Mean 
Median  
2motion  
LBF(4K,3) 
0.71 
0.00 
1.22 
0.00 
1.04 
0.66 
0.50 
0.00 
LBFMS(4K,3) 
0.53 
0.00 
1.06 
0.00 
1.14 
0.28 
0.47 
0.00 
WLBF(4K,3) 
0.53 
0.00 
0.98 
0.00 
1.35 
0.00 
0.47 
0.00 
SLBFMS(4K,3) 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
SCC(4K,3) 
0.27 
0.00 
1.51 
0.00 
2.34 
1.52 
0.38 
0.00 
SCCMS(4K,3) 
0.33 
0.00 
0.25 
0.00 
1.03 
0.46 
0.25 
0.00 
SSCN(4K,3) 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
3motion  
LBF(4K,3) 
1.52 
0.58 
3.71 
9.69 
7.37 
7.37 
1.43 
0.65 
LBFMS(4K,3) 
1.48 
0.45 
3.81 
2.35 
6.59 
6.59 
1.42 
0.40 
SLBF(4K,3) 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
SLBFMS(4K,3) 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
SCC(4K,3) 
1.20 
0.53 
5.70 
7.00 
1.77 
1.77 
1.43 
0.49 
SCCMS(4K,3) 
0.94 
0.50 
3.25 
0.54 
2.54 
2.54 
0.92 
0.33 
SSCN(4K,3) 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
Average total computation times for all 155 sequences
RANSAC 
LBFMS(4K,3) 
LBF(4K,3) 
SCCMS(4K,3) 
SLBFMS(2F,3) 
SLBF(2F,3) 
SSCN(4K,3) 

60 s 
73 s 
91 s 
196 s 
28 min 
31 min 
427 min 
3.3 Clustering Results on the Extended Yale Face Database B
We test LBF, LBFMS, SLBF and SLBFMS and compare them with ALC, Kflats, and SSC on the extended Yale face database B (Lee et al. 2005), which is available on http://vision.ucsd.edu/leekc/ExtYaleDatabase/ExtYaleB.html. We will see that this data set shows a failure mode of our algorithms; and we will show how we can engineer a workaround.
We use subsets of the extended Yale face database B consisting of face images of K=2,3,…,10 persons under 64 varying lighting conditions. Our objective is to cluster these images according to the persons. In implementation, for any fixed K we repeat each algorithm on 100 randomly chosen subsets of K persons. The HLM model is applicable to this database, because the images of a face under variable lighting lies in a threedimensional linear subspace if shadow is not considered (Lee et al. 2005), and a ninedimensional subspace with shadow considered (Basri and Jacobs 2003). In our experiments, we found that the images of a person in this database lie roughly in a 5dimensional subspace, and therefore we first reduce the dimension of the data to 5K (recall that K is the number of persons). We do not include the GPCA algorithm since it is slow and does not work well on this database. We also do not include SCC and RANSAC since the code provided returns errors for some examples. The setting of ALC (voting with K) follows Rao et al. (2010, Sect. 2.2) exactly: it chooses ε from 101 values in the range 10^{−5}–10^{3} (see the code in http://perception.csl.uiuc.edu/coding/motion/#Software).
Mean percentage of misclassified points and mean running time on clustering the extended Yale face database
K 
2 
3 
4 
5 
6 
7 
8 
9 
10  

LBF (without whitening) 
e % 
32.49 
54.42 
57.45 
56.00 
56.24 
56.94 
59.53 
59.66 
60.74 
t(s) 
0.24 
0.48 
0.82 
1.26 
1.93 
2.97 
4.18 
5.81 
8.05  
LBFMS (without whitening) 
e % 
18.27 
36.22 
48.24 
50.18 
49.99 
50.68 
53.08 
54.06 
54.73 
t(s) 
0.12 
0.21 
0.36 
0.57 
0.89 
1.41 
2.06 
2.98 
4.13  
LBF 
e % 
7.94 
8.33 
12.89 
17.83 
27.40 
31.89 
35.04 
38.53 
38.95 
t(s) 
0.24 
0.50 
0.87 
1.38 
2.09 
3.28 
4.58 
6.38 
8.57  
LBFMS 
e % 
8.40 
9.51 
12.18 
15.57 
19.18 
22.88 
27.20 
30.39 
33.17 
t(s) 
0.12 
0.22 
0.37 
0.58 
0.89 
1.41 
2.07 
2.94 
4.02  
SLBF 
e % 
11.12 
14.78 
20.42 
26.52 
32.96 
36.91 
40.49 
42.99 
46.63 
t(s) 
4.17 
12.72 
25.70 
44.89 
72.99 
111.58 
165.47 
226.56 
310.30  
SLBFMS 
e % 
9.12 
12.48 
18.61 
25.27 
30.50 
33.97 
36.22 
38.66 
41.44 
t(s) 
3.84 
12.20 
23.88 
41.24 
64.10 
95.73 
142.09 
192.34 
262.40  
ALC (voting with K) 
e % 
3.46 
6.08 
14.59 
29.59 
67.06 
69.04 
76.00 
73.94 
77.16 
t(s) 
42.99 
122.29 
258.20 
451.07 
699.52 
1090.96 
1625.10 
2384.69 
3343.93  
ALC (ϵ from LBF) 
e % 
10.43 
15.23 
32.20 
42.15 
58.10 
62.54 
70.84 
81.14 
84.25 
t(s) 
0.95 
2.49 
5.54 
11.54 
24.38 
45.27 
78.05 
132.35 
211.15  
SCC 
e % 
5.39 
11.82 
29.39 
41.96 
49.56 
54.51 
55.49 
57.24 
58.94 
t(s) 
1.62 
3.85 
9.52 
15.37 
22.71 
32.45 
54.54 
56.91 
75.92  
SCCMS 
e % 
4.51 
15.05 
36.00 
51.68 
59.66 
64.15 
68.71 
71.18 
74.01 
t(s) 
1.62 
4.20 
9.28 
14.49 
22.08 
31.71 
54.21 
56.99 
73.10  
SSC 
e % 
6.45 
8.10 
10.04 
10.32 
11.02 
11.85 
12.47 
13.41 
15.44 
t(s) 
28.36 
46.45 
67.11 
92.75 
128.46 
182.65 
259.66 
340.12 
612.21  
Kflats 
e % 
7.20 
12.12 
19.06 
26.77 
32.59 
35.18 
38.58 
42.00 
44.40 
t(s) 
0.16 
0.37 
0.76 
1.29 
2.14 
3.25 
5.18 
6.91 
9.60 
The standard deviation(%) to the mean percentage of misclassified points on the extended Yale face database
Real K 
2 
3 
4 
5 
6 
7 
8 
9 
10 

LBF (without whitening) 
20.46 
14.73 
4.87 
3.89 
5.54 
4.99 
4.63 
3.95 
3.22 
LBFMS (without whitening) 
18.23 
18.77 
13.40 
6.74 
4.52 
5.51 
5.31 
4.90 
4.14 
LBF 
5.27 
3.73 
7.97 
9.86 
11.21 
10.38 
8.27 
6.52 
6.20 
LBFMS 
4.25 
3.08 
5.33 
6.24 
7.73 
8.02 
8.29 
8.05 
7.25 
SLBF 
4.76 
5.37 
5.08 
5.25 
5.48 
5.42 
4.57 
4.74 
3.79 
SLBFMS 
4.77 
5.37 
5.84 
4.91 
3.75 
3.76 
2.87 
3.01 
3.22 
ALC (voting with K) 
2.21 
6.93 
13.87 
14.89 
16.84 
24.71 
18.05 
21.62 
16.98 
ALC (ϵ from LBF) 
13.14 
12.96 
14.91 
16.40 
15.22 
12.22 
10.89 
6.76 
6.10 
SCC 
5.21 
11.71 
14.65 
10.60 
6.68 
5.14 
4.67 
4.32 
5.03 
SCCMS 
2.84 
13.66 
14.66 
10.41 
8.29 
6.72 
5.61 
5.93 
5.46 
SSC 
4.57 
3.76 
4.52 
3.82 
3.59 
2.87 
3.18 
3.45 
5.21 
Kflats 
4.67 
6.86 
8.53 
8.89 
7.29 
6.41 
6.67 
4.84 
5.43 
However, most points are actually closer to the subspace spanned by the same face than to the subspace spanned by the other face, if only by a little, and a global method such as SSC is still able to find and discriminate between the two affine clusters. The problem of data having large variance in directions irrelevant to a classification task is not unusual. A standard method of dealing with this situation is to “whiten” the data; i.e. reduce the value of the large singular values. A very crude whitening is obtained by simply removing the first few principal components. If we exclude the first two principal components after reducing the dimension to 5K for LBF/SLBF algorithms, we see in Table 5 that the results are greatly improved and become competitive.^{2} With more sophisticated whitening, the results can be further improved.
3.4 Clustering Results on MNIST Data Set
Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=5 for GPCA, D=10 for other algorithms)
Subsets 
[1 2] 
[1 3] 
[1 7] 
[4 7] 
[2 4 8] 
[3 6 8] 
[1 2 3]  

K 
2 
2 
2 
2 
3 
3 
3  
LBF 
e % 
8.0 
8.5 
12.9 
25.5 
28.8 
28.1 
20.2 
t(s) 
0.4 
0.4 
0.3 
0.4 
0.7 
0.7 
0.7  
LBFMS 
e % 
9.7 
7.8 
8.8 
24.0 
40.2 
33.5 
21.5 
t(s) 
0.2 
0.2 
0.2 
0.2 
0.5 
0.4 
0.4  
SLBF 
e % 
0.5 
1.0 
2.0 
3.0 
3.8 
19.7 
17.3 
t(s) 
13.9 
13.7 
13.5 
14.5 
41.9 
41.0 
42.7  
SLBFMS 
e % 
0.5 
1.0 
2.0 
3.0 
3.8 
19.7 
17.3 
t(s) 
12.8 
13.7 
13.0 
14.6 
38.6 
46.3 
39.0  
ALC (voting with K) 
e % 
0.2 
2.2 
3.5 
48.5 
4.2 
42.7 
45.3 
t(s) 
830.5 
823.3 
813.3 
753.2 
1789.5 
1871.8 
1987.7  
ALC (ϵ from LBF) 
e % 
20.3 
32.0 
51.8 
27.5 
4.0 
30.3 
14.5 
t(s) 
23.2 
22.5 
21.6 
23.0 
55.6 
54.7 
54.0  
SCC 
e % 
7.0 
6.4 
11.4 
23.4 
22.8 
26.7 
39.2 
t(s) 
1.2 
1.5 
1.4 
1.3 
2.5 
2.7 
2.3  
SCCMS 
e % 
6.3 
7.9 
10.5 
23.2 
23.3 
26.9 
32.8 
t(s) 
0.9 
0.8 
1.1 
1.0 
1.9 
1.9 
1.5  
GPCA 
e % 
22.3 
30.8 
32.5 
47.0 
48.2 
33.8 
31.0 
t(s) 
8.7 
9.2 
9.4 
10.8 
24.9 
24.5 
22.5  
Kflats 
e % 
11.1 
6.8 
6.3 
29.1 
43.9 
40.7 
29.2 
t(s) 
0.4 
0.4 
0.4 
0.4 
0.9 
0.8 
0.6  
SSC 
e % 
4.5 
3.5 
9.0 
21.0 
19.5 
24.5 
49.3 
t(s) 
220.6 
196.6 
200.8 
203.2 
322.6 
333.0 
338.2 
Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=50)
Subsets 
[1 2] 
[1 3] 
[1 7] 
[4 7] 
[2 4 8] 
[3 6 8] 
[1 2 3]  

K 
2 
2 
2 
2 
3 
3 
3  
LBF 
e % 
20.5 
13.1 
18.2 
30.2 
26.3 
24.1 
19.2 
t(s) 
2.8 
2.8 
2.6 
3.1 
5.2 
5.1 
4.7  
LBFMS 
e % 
12.5 
16.9 
10.7 
19.1 
23.5 
27.3 
24.3 
t(s) 
1.3 
1.3 
1.3 
1.3 
2.3 
2.3 
2.3  
SLBF 
e % 
8.3 
4.3 
2.3 
13.8 
4.3 
17.5 
21.7 
t(s) 
15.1 
15.0 
14.6 
16.8 
37.5 
38.5 
39.5  
SLBFMS 
e % 
5.5 
3.3 
5.0 
5.5 
3.2 
18.5 
21.7 
t(s) 
11.8 
12.3 
12.3 
12.5 
34.3 
36.9 
34.4  
ALC (voting with K) 
e % 
47.0 
46.0 
48.8 
100.0 
100.0 
100.0 
65.3 
t(s) 
1469.2 
1445.6 
1489.2 
679.0 
1530.1 
1528.5 
3032.4  
ALC (ϵ from LBF) 
e % 
50.5 
50.8 
50.5 
99.8 
99.8 
99.8 
67.0 
t(s) 
93.0 
93.6 
91.0 
9.4 
18.2 
17.9 
163.5  
SCC 
e % 
5.8 
4.9 
5.3 
17.1 
23.0 
29.7 
33.6 
t(s) 
0.9 
1.0 
1.1 
0.9 
1.6 
2.0 
1.7  
SCCMS 
e % 
5.1 
5.4 
5.1 
26.2 
28.6 
41.7 
33.0 
t(s) 
0.9 
1.0 
1.2 
1.0 
1.8 
1.9 
2.0  
GPCA 
e % 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
t(s) 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A  
Kflats 
e % 
10.9 
14.9 
13.5 
30.4 
45.3 
41.6 
26.9 
t(s) 
2.8 
2.9 
2.9 
3.1 
6.2 
5.6 
5.1  
SSC 
e % 
16.8 
2.0 
3.2 
20.0 
11.3 
17.8 
45.5 
t(s) 
411.8 
402.7 
395.1 
396.0 
760.9 
763.1 
777.0 
The standard deviation to the mean percentage of misclassified points on clustering MNIST data set (D=5 for GPCA, D=10 for other algorithms)
Subsets 
[1 2] 
[1 3] 
[1 7] 
[4 7] 
[2 4 8] 
[3 6 8] 
[1 2 3] 

K 
2 
2 
2 
2 
3 
3 
3 
LBF 
3.5 
4.1 
10.0 
11.4 
11.6 
8.3 
9.5 
LBFMS 
5.9 
3.8 
10.0 
10.0 
10.3 
7.2 
7.8 
SLBF 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
SLBFMS 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
ALC (voting with K) 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
ALC (ϵ from LBF) 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
SCC 
2.3 
2.7 
4.6 
9.9 
9.4 
7.5 
11.9 
SCCMS 
2.0 
3.7 
5.2 
10.2 
8.3 
8.5 
9.2 
GPCA 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
Kflats 
7.6 
8.5 
7.8 
5.7 
7.4 
7.5 
5.9 
SSC 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
The standard deviation of the mean percentage of misclassified points on clustering MNIST data set (D=50)
Subsets 
[1 2] 
[1 3] 
[1 7] 
[4 7] 
[2 4 8] 
[3 6 8] 
[1 2 3] 

K 
2 
2 
2 
2 
3 
3 
3 
LBF 
5.6 
8.0 
8.3 
10.6 
11.0 
6.0 
6.0 
LBFMS 
8.7 
10.5 
11.4 
11.2 
12.3 
8.9 
9.1 
SLBF 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
SLBFMS 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
ALC (voting with K) 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
ALC (ϵ from LBF) 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
SCC 
0.6 
1.0 
0.9 
10.3 
3.7 
4.3 
13.9 
SCCMS 
0.4 
0.7 
0.9 
15.5 
5.4 
4.5 
5.8 
GPCA 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
Kflats 
7.2 
11.3 
11.1 
7.5 
7.3 
8.1 
7.7 
SSC 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
0.0 
From Tables 7 and 8, SLBF and SLBFMS are the best algorithms among all the methods in terms of misclassification rates, although these misclassification rates are larger when K=3. SCC, SCCMS, SSC, LBF and LBFMS are also good algorithms for this data set. ALC is almost as good as SLBF and SLBFMS when K=2, but it fails when K=3. LBF, LBFMS and Kflats are the fastest algorithms in MNIST data set.
3.5 Automatic Determination of the Number of Flats
In the following sections, we compare SOD (LBF), i.e., SOD applying LBF, SOD (LBFMS), SOD (SLBF), SOD (SLBFMS), SOD (SCC), SOD (SCCMS) and SOD(SSC) with ALC (Ma et al. 2007) and part of GPCA (Vidal et al. 2005) on a number of artificial data sets and real data sets. These experiments run on a machine with Intel Core 2 Quad Q6600 at 2.4 GHz and 8 GB memory.
3.5.1 Finding the Number of Clusters on Artificial Data
We test SOD with LBF and SLBF on artificial data (where the number of clusters is not provided to the user) and compare them with some other methods (three variations of ALC, number of clusters by GPCA and SOD with SSC and SCC). The artificial data sets are generated by the Matlab code borrowed from the GPCA (Vidal et al. 2005) package on http://perception.csl.uiuc.edu/gpca. For each subspace 100d initial data points are uniformly sampled in a unit cube in this subspace centered around the origin and then corrupted with Gaussian noise in ℝ^{ D } of standard deviation 0.05. For the last four experiments, we restrict the angle between subspaces to be at least π/8 for separation. The dimension d is given and we let K _{max}=10 in SOD.
The mean percentage of incorrectness (e %) for finding the number of clusters K and the computation time in seconds t(s) on artificial data
No minimum angle 
Minimum angle =π/8  

1^{6}∈ℝ^{5} 
2^{4}∈ℝ^{5} 
3^{3}∈ℝ^{5} 
1^{6}∈ℝ^{3} 
2^{4}∈ℝ^{3} 
3^{3}∈ℝ^{4} 
10^{2}∈ℝ^{15} 
1^{6}∈ℝ^{3} 
2^{4}∈ℝ^{3} 
3^{3}∈ℝ^{4} 
10^{2}∈ℝ^{15}  
SOD (LBF) 
e % 
22 
2 
0 
58 
32 
12 
0 
2 
6 
5 
0 
t(s) 
10.43 
13.76 
14.83 
9.84 
13.08 
14.49 
34.16 
9.95 
13.22 
14.47 
34.04  
SOD (LBFMS) 
e % 
13 
1 
3 
67 
33 
9 
0 
3 
8 
6 
0 
t(s) 
8.70 
11.90 
12.92 
8.37 
11.54 
12.84 
27.56 
8.42 
11.60 
12.84 
27.69  
SOD (SLBF) 
e % 
75 
10 
5 
0 
90 
95 
55 
0 
85 
90 
55 
t(s) 
1097.19 
2148.06 
2895.85 
1076.24 
1774.74 
2629.26 
16124.50 
1224.96 
2387.70 
2830.83 
16510.13  
SOD (SLBFMS) 
e % 
90 
95 
70 
0 
90 
85 
85 
0 
75 
80 
80 
t(s) 
908.76 
2094.68 
3141.77 
927.25 
1740.03 
2695.59 
15754.05 
990.88 
2302.66 
3010.64 
16493.95  
ALC (voting) 
e %(K) 
24 
12 
11 
32 
30 
17 
100 
5 
9 
9 
100 
t(s) 
2094.75 
2700.07 
3530.26 
1207.54 
2346.69 
3628.24 
119584.04 
1184.08 
2354.19 
3956.05 
117353.17  
ALC (ϵ from LBF) 
e %(K) 
1 
0 
1 
20 
20 
3 
58 
0 
3 
1 
63 
t(s) 
23.72 
43.50 
57.50 
19.76 
36.67 
53.25 
1516.02 
19.81 
36.60 
53.01 
1770.77  
ALC (oracle) 
e %(K) 
1 
0 
0 
34 
31 
1 
16 
0 
10 
1 
13 
t(s) 
23.74 
43.44 
59.14 
20.49 
37.49 
53.59 
1370.92 
20.22 
37.41 
54.11 
1354.11  
GPCA 
e %(K) 
88 
100 
100 
27 
100 
100 
100 
13 
100 
100 
100 
t(s) 
0.03 
0.09 
0.12 
0.06 
0.09 
0.12 
1.30 
0.04 
0.09 
0.12 
1.30  
SOD (SCC) 
e %(K) 
35 
21 
1 
63 
39 
17 
0 
9 
32 
11 
1 
t(s) 
32.09 
61.26 
95.79 
25.83 
59.41 
76.13 
475.45 
26.74 
41.95 
61.53 
466.79  
SOD (SCCMS) 
e%(K) 
71 
32 
2 
80 
50 
12 
0 
46 
33 
3 
0 
t(s) 
31.78 
67.77 
111.15 
22.29 
55.25 
74.07 
475.50 
24.53 
51.98 
75.03 
471.31  
SOD (SSC) 
e %(K) 
10 
80 
70 
100 
70 
70 
100 
50 
80 
80 
100 
t(s) 
39.88 
2634.80 
3039.55 
1708.37 
2447.01 
2925.27 
14918.10 
1452.43 
2101.84 
2641.68 
14227.32 
As in Table 11, ALC (oracle) and ALC (ϵ from LBF) work the best for low dimensions (d=1,2,3), but in real problems this choice (the noise level) for ϵ is usually unknown. The local bestfit flat heuristic provides a good estimation for the distortion rate and helps ALC reduce its running time. ALC (voting) is not as good as SOD (LBF) for artificial data. All options of ALC suffer from the computation complexity, especially ALC (voting). SOD (LBF) and SOD (LBFMS) get reasonable outputs and have obvious advantage of computing time. GPCA is very fast, but does not work well.
3.5.2 Finding the Number of Clusters on the Extended Yale Face Database B
The mean percentage of incorrectness (e %) for finding the correct number of clusters K and the computation time in seconds t(s) on the extended Yale face database
Real K 
2 
3 
4 
5 
6  

SOD (LBF) 
e %(K) 
62 
61 
69 
78 
84 
t(s) 
1.30 
3.60 
5.69 
11.30 
15.84  
SOD (LBFMS) 
e %(K) 
65 
75 
78 
81 
80 
t(s) 
0.67 
1.65 
2.49 
4.90 
6.83  
SOD (SLBF) 
e %(K) 
24 
60 
70 
86 
98 
t(s) 
115.97 
303.02 
338.35 
729.74 
811.40  
SOD (SLBFMS) 
e %(K) 
20 
60 
76 
92 
96 
t(s) 
106.87 
272.88 
306.22 
649.50 
721.42  
ALC (voting) 
e %(K) 
100 
100 
100 
100 
100 
t(s) 
42.99 
122.29 
258.20 
451.07 
699.52  
ALC (ϵ from LBF) 
e %(K) 
42 
36 
76 
86 
100 
t(s) 
0.95 
2.49 
5.54 
11.54 
24.38  
GPCA 
e %(K) 
100 
100 
100 
100 
100 
t(s) 
0.07 
0.13 
0.52 
0.71 
1.02  
SOD (SSC) 
e %(K) 
100 
8 
12 
28 
38 
t(s) 
172.50 
389.66 
567.39 
1015.99 
1336.57 
We see from Table 12 that SOD only performs well with SSC with K smaller than 4. We note that this is due to the difficulty of the database. Indeed for a simpler database such as Yale Face database B (Georghiades et al. 2001) of uncropped faces, SOD (SLBF), SOD (SLBFMS), ALC (ϵ from LBF) and ALC (voting) have perfect detection for K≤10 (whitening is not applied then).
3.5.3 Finding the Number of Clusters on MNIST Data Set
The mean percentage of incorrectness (e %) for finding the correct number of clusters K and the computation time in seconds t(s) on MNIST data set (D=10)
Subsets 
[1 2] 
[1 3] 
[1 7] 
[4 7] 
[2 4 8] 
[3 6 8] 
[1 2 3]  

K 
2 
2 
2 
2 
3 
3 
3  
SOD (LBF) 
e % 
16.8 
3.8 
50.8 
50.4 
75.6 
70.0 
54.8 
t(s) 
3.5 
3.2 
3.0 
3.3 
7.7 
7.5 
7.3  
SOD (LBFMS) 
e % 
9.6 
6.6 
33.4 
68.2 
80.4 
76.6 
44.2 
t(s) 
1.9 
1.9 
1.9 
1.8 
4.6 
4.6 
4.7  
SOD (SLBF) 
e % 
0.0 
0.0 
0.0 
0.0 
0.0 
100.0 
0.0 
t(s) 
173.9 
164.6 
160.3 
248.6 
710.1 
610.9 
548.5  
SOD (SLBFMS) 
e % 
0.0 
0.0 
0.0 
0.0 
0.0 
100.0 
0.0 
t(s) 
164.6 
159.9 
150.1 
228.5 
676.6 
586.4 
556.2  
ALC (voting) 
e % 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
t(s) 
830.4 
823.2 
813.2 
753.2 
1789.5 
1871.8 
1987.5  
ALC (ϵ from LBF) 
e % 
100.0 
100.0 
100.0 
100.0 
100.0 
0.0 
100.0 
t(s) 
23.2 
22.5 
21.5 
22.9 
55.6 
54.7 
54.0  
GPCA 
e % 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
t(s) 
1.0 
1.0 
1.0 
1.1 
2.8 
2.8 
2.7  
SOD (SCC) 
e %(K) 
3.8 
7.8 
66.4 
81.8 
64.4 
47.6 
82.6 
t(s) 
14.5 
13.3 
14.7 
16.9 
37.5 
34.1 
35.0  
SOD (SCCMS) 
e %(K) 
2.4 
16.4 
53.0 
77.4 
70.4 
49.6 
77.8 
t(s) 
13.7 
13.8 
13.5 
16.4 
38.0 
35.6 
29.4  
SOD (SSC) 
e %(K) 
0.0 
0.0 
0.0 
100.0 
0.0 
100.0 
100.0 
t(s) 
233.6 
210.3 
213.3 
218.4 
380.0 
386.4 
390.5 
The mean percentage of incorrectness (e %) for finding the correct number of clusters K and the computation time in seconds t(s) on MNIST data set (D=50)
Subsets 
[1 2] 
[1 3] 
[1 7] 
[4 7] 
[2 4 8] 
[3 6 8] 
[1 2 3]  

K 
2 
2 
2 
2 
3 
3 
3  
SOD (LBF) 
e % 
45.0 
35.0 
54.0 
79.0 
72.0 
67.0 
60.0 
t(s) 
22.9 
23.5 
22.2 
24.9 
56.2 
54.6 
51.1  
SOD (LBFMS) 
e % 
32.0 
22.0 
38.0 
66.0 
44.0 
82.0 
58.0 
t(s) 
12.2 
12.2 
12.2 
12.2 
29.3 
29.4 
29.4  
SOD (SLBF) 
e % 
0.0 
0.0 
0.0 
0.0 
0.0 
100.0 
100.0 
t(s) 
204.2 
198.1 
207.8 
295.8 
864.5 
766.5 
706.1  
SOD (SLBFMS) 
e % 
0.0 
0.0 
100.0 
0.0 
0.0 
100.0 
100.0 
t(s) 
213.7 
201.7 
176.6 
259.9 
748.1 
640.0 
681.1  
ALC (voting) 
e % 
100.0 
100.0 
100.0 
100.0 
100.00 
100.0 
100.0 
t(s) 
1469.2 
1445.6 
1489.2 
679.0 
1530.1 
1528.5 
3032.4  
ALC (ϵ from LBF) 
e % 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
t(s) 
93.0 
93.6 
91.0 
9.4 
18.2 
17.9 
163.5  
GPCA 
e % 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
t(s) 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A 
N/A  
SOD (SCC) 
e %(K) 
0.0 
4.0 
1.0 
50.5 
78.8 
30.3 
83.8 
t(s) 
14.9 
10.6 
11.6 
11.6 
24.7 
26.2 
25.4  
SOD (SCCMS) 
e %(K) 
0.0 
0.0 
0.0 
42.4 
89.9 
97.0 
93.9 
t(s) 
12.6 
13.0 
14.7 
13.9 
34.0 
36.8 
30.7  
SOD (SSC) 
e %(K) 
0.0 
0.0 
0.0 
0.0 
0.0 
100.0 
100.0 
t(s) 
426.4 
417.6 
409.3 
413.5 
823.8 
821.2 
836.8 
For all the methods, determining the number K of clusters becomes very difficult when the real K is larger than 3. For real K≤3, we see from Table 13 that when we project data to 10dimensional space, ALC and GPCA fail in most cases, except for ALC (ϵ from LBF) on digits [3 6 8]. SOD (SLBF), SOD (SLBFMS) and SOD (SSC) outperform all others although they are not very efficient.
3.6 Initializing KFlats with the Local BestFit Heuristic
Here we demonstrate that our choice of neighborhoods in Algorithm 1 can be used to get a more robust initialization of Kflats. We work with geometric farthest insertion. For fixed neighborhood sizes, say of m neighbors, this goes as follows: we pick a random point x _{0} and then find the bestfit flat F _{0} for the m point neighborhood of x _{0}. Then we find the point x _{1} in our data farthest from F _{0}, find the bestfit flat F _{1} of the m neighborhood of x _{1}, and then choose the point x _{2} farthest from F _{0} and F _{1} to continue. We stop when we have K flats; we use these as an initialization for Kflats.
In Fig. 3 we plot the number of neighbors picked by our algorithm for each point of a realization of data set #3.
4 Conclusions and Future Work
We presented a very simple geometric method for HLM based on selecting a set of local bestfit flats. The size of the local neighborhoods is determined automatically using the ℓ _{2} β numbers; it is proven under certain geometric conditions that our method approximately finds the optimal local neighborhoods. We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the algorithm on synthetic and real hybrid linear data.
We believe that one promising next step is to adapt the method for multimanifold clustering. As it is, our method, while quite good at unions of flats, cannot successfully handle unions of curved manifolds. We expect that by gluing together groups of local bestfit flats related by some smoothness conditions, we will be able to approach the problem of clustering data which lies on unions of smooth manifolds.
 1.
Most points are roughly as close to an affine set they don’t belong to as they are to their nearest O(d) neighbors;
 2.
A large fraction of the points have optimal neighborhoods contained in only one of the affine clusters, the principal components of these neighborhoods are good approximations to the clusters; and LBF and SLBF recover good approximations to the two affine clusters, or
 3.
The data looks locally lower than ddimensional, even though each cluster is globally ddimensional, and has high curvature; in this case, there are pure optimal neighborhoods, but the local estimation does not accurately represent the affine clusters.
Throughout the paper outliers are corrupted data points, i.e., points generated by a distribution, which assigns sufficiently small probability for small neighborhoods around the underlying subspaces. This is different than corrupting selected entries of data points.
Acknowledgements
This work was supported by NSF grants DMS0612608, DMS0811203, DMS0915064 and DMS0956072. Thanks to the action editor and the reviewers for the careful reading and comments; Peter Jones, Mauro Maggioni and Amit Singer for discussions that motivated our exploration for a multiscale SVDbased HLM algorithm; Ehsan Elhamifar and René Vidal for answering various questions regarding the SSC code and providing us an initial version before the code was available to the public; Allen Yang for clarifying the estimation of the number of clusters in GPCA; and the IMA for a stimulating multimanifold modeling workshop.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.