Hybrid Linear Modeling via Local Best-Fit Flats

We present a simple and fast geometric method for modeling data by a union of affine subspaces. The method begins by forming a collection of local best-fit affine subspaces, i.e., subspaces approximating the data in local neighborhoods. The correct sizes of the local neighborhoods are determined automatically by the Jones’ β2 numbers (we prove under certain geometric conditions that our method finds the optimal local neighborhoods). The collection of subspaces is further processed by a greedy selection procedure or a spectral method to generate the final model. We discuss applications to tracking-based motion segmentation and clustering of faces under different illuminating conditions. We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the suggested algorithms on these problems and also on synthetic hybrid linear data as well as the MNIST handwritten digits data; and we demonstrate how to use our algorithms for fast determination of the number of affine subspaces.


Introduction
Several problems from computer vision, such as motion segmentation and face clustering, give rise to modeling data by multiple subspaces.This is referred to as Hybrid Linear Modeling (HLM) or alternatively as "subspace clustering".In tracking-based motion segmentation, extracted feature points (tracked in all frames) are clustered according to the different moving objects.Under the affine camera model, the vectors of coordinates of feature points corresponding to a moving rigid object lie on an affine subspace of dimension at most 3 (see [8]).Thus clustering different moving objects is equivalent to clustering different affine subspaces.Similarly, in face clustering, it has been proved that the set of all images of a Lambertian object under a variety of lighting conditions form a convex polyhedral cone in the image space, and this cone can be accurately approximated by a low-dimensional linear subspace (of dimension at most 9) [12,16,2].One may thus cluster certain images of faces by HLM algorithms.
The mathematical formulation of HLM assumes a data set X = {x i } N i=1 ⊆ R D where each x i lies on (or around) one of K flats (i.e., affine subspaces) and requires to find the partition of X corresponding to the flats.We would like to be able to do this when the data has been corrupted by additive noise and outliers 1 ; in this case we may also want to determine the flats themselves.We first assume here that all flats have the same known dimension d (i.e., they are dflats) and that their number K is known.In Section 3 we address to some extent the cases of unknown K and mixed dimensions.
Many of the algorithms described above require an initial guess of the subspaces.For example, the K-flats algorithm is an iterative method that requires an initialization, and in SCC, one needs to carefully choose collections of d + 1 data points that lie close to each of the underlying dflats.Other algorithms require some information about the suspected deviations from the hybrid linear model; for example both RANSAC (for HLM) and ALC ask for a model parameter corresponding to the level of noise.
Here we propose a straightforward geometric method for the estimation of local subspaces, which is inspired by [17,10,22] and [13,26,25].These local subspace estimates can be used to set the model parameters for or initialize an HLM algorithm.The basic idea is that for a data set X sampled from a hybrid linear model (perhaps with some noise), there are many points x such that the principal components of an appropriately sized neighborhood of x give a good approximation to the subspace x belongs to.Using local subspaces to infer the global hybrid linear model was suggested in [40] for linear subspaces; however, there they use very small neighborhoods that are not adaptive to the structure of the data (e.g., amount of noise etc.).An "appropriately sized neighborhood" needs to be larger than the noise, so that the subspace is recognized.However, the neighborhood cannot be so large that it contains points from multiple subspaces.The correct choice of this size is carefully quantified in Section 2.1.
In addition to studying how to estimate local subspaces, we describe two complete HLM algorithms which are natural extensions of the local estimation: LBF (Local Best-fit Flat) and SLBF (spectral LBF).On many data sets, the first obtains state of the art speed with nearly state of the art accuracy (it can also deal with very large data), and the second obtains state of the art accuracy (SLBF) with reasonable run times (it seems to be able to deal to some extent with some nonlinear structures as the ones arising in motion segmentation data).We remark that we test accuracy in various scenarios, but in particular, with intersecting subspaces and with outliers.While in this work we only theoretically justify our choice of initializer, we are hopeful about developing a more complete theory justifying our algorithms.
In particular, we believe that such a theory can be valid in the setting suggested by Soltanolkotabi and Candès [31] for analyzing the SSC algorithm, while having additional noise and restricting the fraction of outliers (or modifying our algorithms so they are even more robust to outliers).We are also interested in rigorously quantifying the limitations of our algorithms (as conjectured in Section 4).
We summarize the main contributions of this work, which is the full length version of [45], as follows.
• We make precise the local best-fit heuristic, using the β 2 numbers [17,10,22].We give an algorithm to approximately find optimal neighborhoods in the above sense, in fact, we prove this under certain geometric conditions.• Using the local best-fit heuristic, we introduce the LBF and SLBF algorithms for HLM.At each point of a randomly chosen subset of the data, they use the best-fit flats of the "optimal" neighborhoods to build a global model with different methods (LBF is based on energy minimization and SLBF is a spectral method).• We perform extensive experiments on motion segmentation data (the Hopkins 155 benchmark of [35]), face clustering (the extended Yale face database B), handwritten digits (the MNIST database), and artificial data, showing that both algorithms, in particular SLBF, are accurate on real and synthetic HLM problems, while LBF runs extremely fast (often on the order of ten times faster than most of the previously mentioned methods).For the cropped face data we actually indicate a fundamental problem of local methods like LBF and SLBF, though suggest a workaround that works for this particular data.• We demonstrate how the local best-fit heuristic can be used with other algorithms.In particular, we give experimental evidence to show that the K-flats algorithm [16] is improved by initialization that is based on the local best-fit heuristic.We also use this heuristic to estimate the main parameters of both RANSAC (for HLM) [42] and ALC [27].• We show how the combination of LBF and the elbow method can quickly determine the number of subspaces.
The rest of this paper is organized as follows.In Section 2 we describe the LBF and SLBF algorithms and state a theorem giving conditions that guarantee that good neighborhoods can be found.Section 3 carefully tests the LBF and SLBF algorithms (while comparing them to other common HLM algorithms) on both artificial data of synthetic hybrid linear models and real data of motion segmentation in video sequences, face clustering and handwritten digits recognition.It also demonstrates how to determine the number of clusters by applying the fast algorithm of this paper together with the straightforward elbow method.Section 4 concludes with a brief discussion and mentions possibilities for future work.

The local best-fit flats heuristic and the LBF and SLBF algorithms
We describe two methods, LBF and SLBF, which have at their heart an estimation of local flats capturing the global structures of the data (or part of it).Both methods first find a set of candidate flats (the number is an input parameter for LBF).These are best-fit flats for local "optimal" neighborhoods (we describe an algorithm for approximately finding such neighborhoods and justify it in Section 2.1).The two algorithms process the candidates in different ways: LBF uses energy minimization and SLBF uses a spectral approach.

Choosing the optimal neighborhood
We choose the candidate flats that capture the global structure of the data by fitting them to 'optimal' local neighborhoods of data points.For a point x ∈ R D , we define an optimal neighborhood as the largest ball B(x, r) (centered at x and with radius r) that only contains points sampled from the same cluster as x.Indeed, neighborhoods smaller than the optimal one (around x) can mainly contain the noise around an underlying subspace (of the hybrid linear model); consequently their local best-fit flats may not match the underlying flat.On the other hand larger neighborhood than the optimal one (around x) will contain points from more than one underlying flat, and the resulting best-fit flat will again not match any of the underlying flats.We note that the choice of neighborhood B(x, r) is equivalent to the choice of radius r, which we refer to as scale (even though it is also common to refer to log(r) or -log(r) as scale).While it is possible to take a guess at the optimal scale as a parameter (e.g., as commonly done by fixing the number of nearest neighbors to x), we have found that it is possible to choose the optimal scale reasonably well automatically, while adapting it to the given point x.
We will start at the smallest scale (i.e., smallest radius containing only d + 1 points) and look at larger and larger neighborhoods of a given point x 0 .At the smallest scale, any noise may cause the local neighborhood to have higher dimension than d.As we add points to the neighborhood, it becomes better and better approximated in a scale-invariant sense (e.g., by scaling the neighborhood to have radius 1 and computing the error of approximation by best-fit flat then) until points belonging to other flats enter the neighborhood.
To be more precise, we define the scale-invariant error for a neighborhood N ≡ B(x 0 , r) of x 0 by the formula: where |N | denotes the number of points in N , P L denotes the projection onto the flat L and the minimization is over all d-flats in R D .We note that the numerator is the approximation error by best-fit ℓ 2 flat at scale r and the denominator is the scale r.The notion of scale-invariant error was introduced and utilized in [17,10,22].Using this scale-invariant error we can reformulate our criterion for choosing the optimal neighborhood more precisely.That is, we start with the smallest neighborhood containing S nearest neighbors of x, increase the number of nearest neighbors by T in each iteration and check the β 2 number of each neighborhood using (1).We estimate the optimal neighborhood as the last one for which β 2 is smaller than β 2 of the previous neighborhood (that is, we search for the first local minimizer of β 2 (N )).This procedure is summarized in Algorithm 1.It is experimentally robust to outliers if x is an inlier, since the nearest neighbors of inliers also tend to be inliers.

Algorithm 1 Neighborhood size selection for HLM by randomized local best-fit flats
X, S: start size, T : step size Output: N (x): a neighborhood of x. Steps:

Theoretical justification
The following theorem tries to justify our strategy of estimating the optimal scale around each point by showing that in the continuous setting the first local minimizer of β 2 (x, r) := β 2 (B(x, r)) is approximately the distance from x to the nearest cluster that does not contain x (here the underlying model is a mixture of Lebesgue measures in strips around several subspaces and x is an arbitrary point on one of these subspaces).Therefore, if we choose the size of neighborhood following Algorithm 1 (adapted to the continuous setting), then we will approximately obtain the optimal neighborhood.It is rather standard to extend such estimates for measures to a probabilistic setting, where i.i.d.data is sampled from the continuous distribution.The theorem will then hold with high probability for sufficiently large sample size (due to technicalities, which also require truncating the support of our continuous measure we avoid these details).The proof of this theorem is in the Appendix.
Our theorem uses the following analog of the discrete β 2 of (1) for a measure µ and a ball B(x, r) (see also [22]): where throughout the paper dist(•, •) denotes the Euclidean distance, for example in this case dist(x, L) := x − P L x .The theorem also assumes that the underlying measure is supported on a union of then there exists r 0 < r * < 1.09 r 0 such that That is, the first local minimum of β 2 (x * , r) (as a function of r) occurs in (r 0 , 1.09 r 0 ).
The proof of this theorem indicates a weaker condition than (4), which is less intuitive.It also shows that r * → r 0 as w/r 0 → 0 and clarifies by example why the first local minimum of β 2 (x * , r 0 ) is often bigger than r 0 (see Remark 1).

The complexity of Algorithm 1
Algorithm 1 requires sorting the neighbors of x according to their distance to x; the computational cost of this preprocessing step is O(D We thus note that if T is in the order of N , e.g., T = max (N/300, 2), the total complexity of Algorithm 1 is O((d • D + log N ) • N ).Note that if we limit the number of scales that we search, then the two log terms above can be replaced by a constant.

The LBF algorithm
The LBF algorithm searches for a good set of flats from the candidates (described above) in a greedy fashion.A measure of goodness of a K tuple of flats G is chosen; here, it will be the average l 1 distance of each point to its nearest flat, i.e., After randomly initializing K flats from the list of candidates, p passes are made through the data points.In each of the passes, we replace a random current flat with the candidate that minimizers the value of G.We then move to the next pass, picking a random flat, etc. Algorithm 2 sketches this procedure (where the greedy minimization of G is described in step 5).
The simplest choice of G is the sum of the squared distances of each point in X to its nearest flat, i.e., having the power 2 in (6).However, in some scenarios the l 1 energy of ( 6) is more robust to outliers than the mean squared error (see [23,24] for theoretical support and [44] for experimental support).This method also allows using energy functions, which are hard to minimize (even heuristically).Indeed, it only requires evaluating the energy on the candidate configurations.For example, when the data set requires stronger robustness to outliers, one may use the following energy: The LBF algorithm is closely related to RANSAC, since both of them use candidate subspaces to fit the data set.However Algorithm 1 gives LBF an advantage in choosing good candidates, while RANSAC fits a d-flat by arbitrarily chosen d + 1 points.

Algorithm 2 LBF: energy minimization over randomized local best-fit flats
Step 5 of Algorithm 2 requires the evaluation of the N × C matrix representing the distances To compute the storage requirements of LBF, we note that the data set is saved in an N × D matrix, the candidate subspaces are organized in C projection matrices of size D × d and in addition the algorithm stores an N × C matrix of distances between the data points and the C candidate subspaces.Therefore, the storage of LBF is in the order of O(D

The SLBF algorithm
The SLBF algorithm (which is sketched in Algorithm 3) processes the candidate subspaces via a spectral clustering method.It first finds the neighborhoods {N i } N i=1 for all points {x i } N i=1 via Algorithm 1 and fits d-flats {L i } N i=1 (via PCA) in these neighborhoods.It then forms the N × N matrices S and Ŝ as follows: and where (we explain the choice of λ below, when we clarify ( 9)).Finally, it applies spectral clustering with the matrix Ŝ.More precisely, SLBF follows the main algorithm of [29, Section 2], replacing the matrix A there by Ŝ, multiplying the unit eigenvectors of Step 3 (of [29, Section 2]) by the corresponding square roots of eigenvalues and skipping step 4. We remark that the two last changes to [29, Section 2] are commonly used so that the similarity matrix Ŝ can be considered as a Gram matrix, see e.g., Euclidean MDS [9] and ISOMAP [33].As discussed in [37], SLBF is a "spectral clusteringbased method", similar to SCC, LSA and SSC.These algorithms construct an N × N affinity matrix, whose ij-th entry represents the similarity between points i and j, and then apply spectral clustering using this affinity matrix.Ideally, the affinities of points from the same cluster are of order 1 and the affinities of points from different clusters are of order 0. Indeed, for the affinity Ŝ of SLBF, if x i and x j are in the same cluster, then we expect that x i is close to L j and x j is close to L i , which means S i,j is close to 0 and thus Ŝi,j is close to 1 (we assume here that L i and L j are good estimators for the underlying subspace of the cluster shared by x i and x j as suggested by Theorem 1).Otherwise, if x i and x j are not in the same cluster, then we expect that x i is sufficiently far from L j and x j is sufficiently far from L i , which implies that Ŝi,j is close to 0. The choice of σ j clearly affects this heuristic argument on the size of Ŝi,j .Theoretically σ j should be larger than the noise, such that Ŝi,j is close to 1 when x i and x j are in the same cluster,

Algorithm 3 SLBF: spectral clustering based on best-fit flats
, each approximated by a single flat.

Steps:
1.For each point x i , fit a subspace L i by Algorithm 1 2. Construct the N × N matrix S and Ŝ by ( 7), ( 8) and ( 9) 5. Let U be the N × K matrix whose columns are the top K eigenvectors of Ŝ, and Σ be the K ×K matrix representing the top K eigenvalues of Ŝ 6. Apply K-means to the rows of N × K matrix U Σ 1/2 and partition X accordingly 7. Repeat steps 2-6 with the default values of λ (see input) to obtain several segmentations and choose the segmentation minimizing the error: but σ j cannot be too large so that Ŝi,j is close to 1 when x i and x j are not in the same cluster.Therefore we use (9), where min d-flats L x∈Nj x − P L x 2 /|N j | is the estimated noise of the data set around the point x j and λ is a parameter.Following the strategy in [7], we choose different values of λ (our fixed default values are and consequently obtain several segmentation results (7 results when using our default values).We then choose the segmentation with the smallest error in (10).
We remark that SLBF is robust to outliers.Indeed, the original data points are embedded as the rows of UΣ 1 2 (see step 6 of Algorithm 3) and have magnitude smaller than 1.Therefore the subsequent application of K-means does not suffer from points of arbitrarily large values.To verify that the magnitude of the embedded points is smaller than 1, we note that the diagonal elements in Ŝ are smaller than 1.Since UΣU T ≤ Ŝ the diagonal elements of UΣU T are also smaller than 1.Therefore, the norm of the rows of UΣ are also smaller than 1.
Similar to SLBF, LSA [40] is also based on fitting local subspaces.However, LSA fits subspace by local neighborhoods of fixed number of points and is not adaptive.Moreover, the local subspaces of LSA are forced to be linear (since the affinity of LSA is based on principal angles be-tween such subspaces) and this further restricts the applicability of LSA.There is also some similarity between the idea of SLBF and that of SCC [5,7].Indeed, we may view SCC as fitting candidate subspaces based on d+1 data points (the iterative procedure tries to enforce the points to be from the same cluster).However, in practice they operate very differently, in particular, SCC is not based just on local information (though a local version of SCC follows from [1]).The SSC algorithm is also a spectral method, but similar to SCC its affinities are global (they are based on sparse representation of data points).

Complexity and storage of the SLBF algorithm
Step 1 of Algorithm 3 has a complexity of order O((d , since it applies Algorithm 1 to every point in the set X.The most expensive calculation of steps 2-4 in Algorithm 3 is the construction of S, which requires a complexity of order O(d • D • N 2 ).The eigenvalue decomposition in step 5 has a complexity of order O(K • N 2 ) and the K-means algorithm in step 6 has a complexity of order Combining these complexities together, we have an overall complexity of order O((d for SLBF.As before, limiting to a constant number of scales replaces the log term with a constant.
We note that SLBF stores the data set in a D ×N matrix, the candidate subspaces in N D × d matrices (recall that in SLBF every data point is assigned a subspace and thus C = N ) and it also uses the N × N matrix S. Therefore, the storage of SLBF is in the order of O(N

Adaptation of the proposed algorithms to motion segmentation data
Note that the first minimum in the Theorem 1 excludes the left endpoint, and thus k = 0 is excluded in Algorithm 1).In our experiments, we noticed that on data without too much noise, it is useful to allow the first scale to count as a local minimum and allow k = 0 in Algorithm 1).We refer to the implementation of LBF and SLBF with those two techniques tailored for motion segmentation data as LBF-MS and SLBF-MS.

Experimental results
In this section, we conduct experiments on artificial and real data sets to verify the effectiveness of the proposed algorithm in comparison to other HLM algorithms.We will see that in many situations, the methods we propose are fast and accurate; however, in Section 3.3 we will show a failure mode of our method, and discuss how this can be corrected.
Table 1 Mean percentage of misclassified points in artificial data for linear-subspace cases or affine-subspace case.(4,5,6) (1, 5) Linear Linear We measure the accuracy of those algorithms by the rate of misclassified points with outliers excluded, that is error% = # of misclassified inliers # of total inliers × 100% .
In all the experiments below, the number C in Algorithm 2 is 70 • K, where K is the number of subspaces, the number p in Algorithm 2 is 5 • K, and the numbers S and T in Algorithm 1 are 2 • d and 2 respectively, where d is the dimension of the subspace.According to our experience, LBF and SLBF are very robust to changes in parameters, but unsurprisingly, there is a general trade off between accuracy (higher C, higher p, smaller T ), and run time (lower C, lower p, larger T ).We have chosen these parameters for a balance between run time and accuracy.Nevertheless, we have insisted to use the same parameters for all data sets and experiments, even though particular parameters could obtain even better or near perfect results for some of the data sets.The experiments in Sections 3.1 and 3.2 run on a computer with Intel Core 2 CPU at 2.66GHz and 2 GB memory, and experiments in Sections 3.3 and 3.4 run on a machine with Intel Core 2 Quad Q6600 at 2.4GHz and 8 GB memory.
For the SCC algorithm, we also try a slightly modified version tailored for motion segmentation as in step 6 of Algorithm 3, which we refer to as SCC-MS (SCC for motion segmentation): Following the notation of [Algorithm 2] [7] we let the matrix U be the N × K matrix whose columns are the top K left singular vectors of A * C and also denote by Σ the diagonal K × K matrix whose elements are the top K left singular values of A * C .Then the K-means step of SCC-MS is applied directly to the rows of the N × K matrix U Σ 1/2 (as opposed to applying it to U (or its row-wise normalization by 1) in SCC).
The MoPPCA algorithm is always initialized with a random guess of the membership of the data points.The LSCC algorithm is initialized by randomly picking 100×K (d+1)tuples (following [7]) and KF is initialized with a random guess.Since algorithms like KF tend to converge to local minimum, we use 10 restarts for MoPPCA, 30 restarts for KF, and recorded the misclassification rate of the one with the smallest ℓ 2 error for both of these algorithms.The number of restarts was restricted by the running time and accuracy.In SSC algorithm, we set the value λ to be 0.01, as suggested in the code.
The RANSAC for HLM and ALC algorithms [42,27] depend on a user supplied inlier threshold.RANSAC (oracle) and ALC (oracle) use the oracle inlier bound given by the true noise variance of the model and thus clearly have an advantage over the other algorithms listed.RANSAC (ǫ from LBF) and ALC (ǫ from LBF) estimate the inlier threshold by the local best-fit flats heuristic of this paper.That is, they fit best-fit neighborhoods for all N points using the latter heuristic and estimate the least error of approximation by d-flats in these N neighborhoods.The inlier bound ǫ is then the average of these errors.If the number of clusters resulting from ALC (ǫ from LBF or oracle) is larger than K, then we choose the K largest clusters and identify the points in the rest of clusters as outliers.For some cases the RANSAC algorithm breaks down and we then report it as N/A.The reason for this is that RANSAC (for HLM) [42] is very sensitive to the estimate of ε and an overestimate can result in removal of points belonging to more than one subspace, so that the algorithm may exhaust all points before detecting K subspaces.
We remark that GPCA cannot naturally deal with outliers, therefore we use robust GPCA with Multivariate Trimming [41] and the parameters 'angleTolerance' and 'boundarythreshold' are set to be 0.3 and 0.4 respectively.
The artificial data represents various instances of K linear subspaces in R D .If their dimensions are fixed and equal d, we follow [7] and refer to the setting as d K ∈ R D .If they are mixed, then we follow [28] and refer to the setting as (d 1 , . . ., d K ) ∈ R D .Fixing K and d (or d 1 , . . ., d K ), we randomly generate 100 different instances of corresponding hybrid linear models according to the code in http://perception.csl.uiuc.edu/gMore precisely, for each of the 100 experiments, K linear subspaces of the corresponding dimensions in R D are randomly generated.The random variables sampled within each subspace are sums of two other variables.One of them is sampled from a uniform distribution in a d-dimensional ball of radius 1 of that subspace (centered at the origin for the case of linear subspaces).The other is sampled from a D-dimensional multivariate normal distribution with mean 0 and covariance matrix 0.05 2 • I D×D .Then, for each subspace 250 samples are generated according to the distribution just described.Next, the data is further corrupted with 5% or 30% uniformly distributed outliers in a cube of side-  length determined by the maximal distance of the former 250 samples to the origin (using the same code).Since most algorithms (SCC, LSA, MoPPCA, LBF, SLBF, RANSAC, SSC) do not support mixed dimensions natively, we assume each subspace has the maximum dimension in the experiment.GPCA and ALC support mixed dimensions natively, and GPCA is the only algorithm for which we specify the dimensions for each subspace in mixed-dimension case (note that the knowledge of dimensions are unnecessary in ALC algorithm).
The mean (over 100 instances) misclassification rates and the mean running time of the various algorithms are recorded in Table 1.From Table 1 we can see that our algorithms, i.e., LBF, LBF-MS, SLBF, SLBF-MS, perform well in various artificial instances of hybrid linear modeling (with both linear subspaces and affine subspaces), and their advantage is especially obvious with many outliers and affine subspaces.The robustness to outliers is a result of our use of both ℓ 1 loss function (see e.g., [23,24]) and random sampling.The SLBF and SLBF-MS are better at the affine cases because of their use of spectral clustering.Also unlike many other methods, the proposed methods natively support affine subspace models (the particular data has non-intersecting subspaces, which makes advantageous to some other algorithms, e.g., SSC).The results of RANSAC (ǫ from LBF) and ALC (ǫ from LBF) show that the local best-fit heuristic can be effectively used to estimate the main parameter of RANSAC and ALC, i.e., to estimate the local noise.Table  1 The misclassification rate of some algorithms for the Hopkins 155 database.The y-axis represent the percentage of data sets that have misclassification rates (under corresponding algorithms) lower than that of x-axis.
1 also shows that the running time of LBF/LBF-MS is less than the running time of most other algorithms, especially GPCA, LSA, RANSAC, ALC and SSC.The difference is large enough that we can also use the proposed algorithm as an initialization for the others.LBF and LBF-MS algorithms are slower than a single run of K-flats, but it usually takes many restarts of K-flats to get a decent result.Notice that the choice of C and p in our algorithm function in a similar manner to the number of restarts in KF.SLBF and SLBF-MS cost more time when N is large, because of the construction of the N × N matrix in spectral clustering, but it still has a comparable speed to LSA and is faster than SSC, which are two spectral-clustering based algorithms.

Clustering results on motion segmentation data
We test the proposed algorithms on the Hopkins 155 database of motion segmentation, which is available at http://www.vision.jhu.edu/data/hopkins155.This data set contains 155 video sequences along with the coordinates of certain features extracted and tracked for each sequence in all its frames.The main task is to cluster the feature vectors (across all frames) according to the different moving objects and background in each video.It consists of three types of videos: checker, traffic and articulated (see Figure 2 for demonstration of frames of such videos).
More formally, for a given video sequence, we denote the number of frames by F .In each sequence, we have either one or two independently moving objects, and the background can also move due to the motion of the camera.We let K be the number of moving objects plus the background, so that K is 2 or 3 (and distinguish accordingly between two-motions and three-motions).For each sequence, there are also N feature points y 1 , y 2 , • • • , y N ∈ R 3 that are detected on the objects and the background.Let z ij ∈ R 2 be the coordinates of the feature point y j in the i th image frame for every 1 ≤ i ≤ F and 1 ≤ j ≤ N .Then It has been shown [8] that under the affine camera model, the trajectory vectors corresponding to different moving objects and the background across the F image frames live in distinct affine subspaces of dimension at most three in R 2F .Following this theory, we implement our algorithm with d = 3.
Our misclassification rates for SCC are different than [6] and [20] and our misclassification rates for SSC are different than [11] (the difference between our and their results differ more than twice the standard deviations of misclassification rates obtained here).This can be explained by possible evolutions of the codes since then (at least for SSC).We remark though that the misclassification rates of SCC-MS here are even slightly better than the misclassification rates of SCC in [6].
From Table 2 and Figure 1 we can see that our algorithms work well for the Hopkins database.Of all the methods tested, SLBF-MS and SSC-N are the most accurate algorithms.Besides SLBF/SLBF-MS and SSC-N, only SCC-MS is better than LBF-MS.However, From Table 4, LBF-MS ran more than 100 times faster than SSC-N and SLBF-MS is also more than 10 times faster than SSC.In many of the cases, the ℓ 1 energy (as well as the ℓ 2 energy) was lower for the labels obtained by LBF than the true labels.We thus suspect that the reason SLBF/SLBF-MS works better than LBF/LBF-MS is that good clustering of the Hopkins data requires additional type of information (e.g., spectral information) to be combined with subspace clustering (i.e., hybrid linear modeling).
By adapting the parameters of SLBF-MS (or alternatively, SLBF, LBF, LBF-MS), we can further improve its misclassification rates on Hopkins 155 (e.g., total 0.81% for two-motions and 2.12% for three-motions by SLBF-MS).However, we have fixed in advance all parameters and insisted using the same parameters on all four kinds of data (see the third paragraph of Section 3).
From Table 3 we can see that SLBF-MS, SLBF and SSC-N have negligible randomness.Indeed, their only randomness come from the K-means step, but this randomness is effectively reduced because of the restarting strategy.LBF and LBF-MS are more random, but still have comparable standard deviations with other good algorithms on Hopkins 155 database such as SCC/SCC-MS.

Clustering results on the extended Yale face database B
We test LBF, LBF-MS, SLBF and SLBF-MS and compare them with ALC, K-flats, and SSC on the extended Yale face database B [21], which is available on http://vision.ucsd.edu/leekc/ExtYaleDatabase/ExtYaleB.html.We will see that this data set shows a failure mode of our algorithms; and we will show how we can engineer a workaround.
We use subsets of the extended Yale face database B consisting of face images of K = 2, 3, • • • , 10 persons under 64 varying lighting conditions.Our objective is to cluster these images according to the persons.In implementation, for any fixed K we repeat each algorithm on 100 randomly chosen subsets of K persons.The HLM model is applicable to this database, because the images of a face under variable lighting lies in a three-dimensional linear subspace if shadow is not considered [21], and a nine-dimensional subspace with shadow considered [2].In our experiments, we found that the images of a person in this database lie roughly in a 5-dimensional subspace, and therefore we first reduce the dimension of the data to 5K (recall that K is the number of persons).We do not include the GPCA algorithm since it is slow and does not work well on this database.We also do not include SCC and RANSAC since the code provided returns errors for some examples.The setting of ALC (voting with K) follows [30, sec.(2.2)] exactly: it chooses ε from 101 values in the range 10 −5 − 10 3 (see the code in http://perception.csl.uiuc.edu/coding/motion/#Software).
We can see from the first row of Table 5 that LBF does a poor job discriminating the linear clusters in this data set.The failure occurs because of a combination of two factors: the first is the relatively sparse sampling of the data, with only 64 points per 5-dimensional cluster, and the second, the relative nearness of the underlying subspaces to each other.In particular, almost any neighborhood of any given point (even very small neighborhood) has points from the other affine clusters and consequently there is no "optimal" scale.For example, in the 128 face images from persons 1 and 2, more than a fifth of the points are closer to the subspace spanned by the first 5 principal components of the points in the other cluster than to their second nearest neighbors, and more than two thirds of the points are closer to the other subspace than to their 4th nearest neighbors.In some sense, this is a single 5-dimensional set, rather than two 5-dimensional sets.For example, the average distance of a point to the 5-dimensional best fit subspace by the points in the same cluster is 2.7 × 10 3 , and the average distance to the 5-dimensional best fit subspace of the whole data set is 3.3×10 3 , whereas the average norm of a point in the data set is 1.1 × 10 4 .Thus one loses little in terms of relative fitting error by considering the set as spanned by a single subspace.
However, most points are actually closer to the subspace spanned by the same face than to the subspace spanned by the other face, if only by a little, and a global method such as SSC is still able to find and discriminate between the two affine clusters.The problem of data having large variance in directions irrelevant to a classification task is not unusual.A standard method of dealing with this situation is to "whiten" the data; i.e. reduce the value of the large singular values.A very crude whitening is obtained by simply removing the first few principal components.If we exclude the first two principal components after reducing the dimension to 5K for LBF/SLBF algorithms, we see in Table 5 that the results are greatly improved and become competitve 3 .With more sophisticated whitening, the results can be further improved.

Clustering results on MNIST data set
Finally, we work on the MNIST data set (available at http://ya nn.lecun.com/exdb/mnist/).This data set consists of several thousand 28 × 28 images of the digits 0 through 9. We work on some subsets of the data which contain 2 or 3 digits and choose 200 images for each digit at random.We apply PCA to reduce the dimension to D = 5 for GPCA and to both D = 10 and D = 50 for the rest of algorithms.The choice of both D = 10 and D = 50 provide richer testing opportunities, this is however unavailable for GPCA, which cannot handle D = 50 and often get stuck with D = 10.We process the data the same way as in Section 3.3.We run each experiment 500 times, using d = 3 and the correct number of clusters, and record the misclassification rates, the standard deviation and running time in Tables 7, 9, 8 and 10.
From Table 7 and 8, SLBF and SLBF-MS are the best algorithms among all the methods in terms of misclassification rates, although these misclassification rates are larger when K = 3. SCC, SCC-MS, SSC, LBF and LBF-MS are also good algorithms for this data set.ALC is almost as good as SLBF and SLBF-MS when K = 2, but it fails when K = 3. LBF, LBF-MS and K-flats are the fastest algorithms in MNIST data set.

Automatic determination of the number of flats
We explain how to use the elbow method to determine the number of affine clusters in any HLM algorithm, in particular LBF and SLBF.Fixing an arbitrary HLM algorithm with the correct input of number of clusters K, let F j , j = 3 Removing principal components harms the performance of the other algorithms.
1, . . ., K be the K flats returned by that algorithm and W K be the sum of squared distances of all data points to the flat, among these K flats, corresponding to their clusters.That is, We note that W K decreases as K increases.
A classical method for determining the number of clusters is to find the "elbow", or the K past which adding more clusters does not significantly decrease the error.We search for the elbow by finding the maximum of the Second Order Difference (SOD) of the logarithm of W K [43]: The optimal K is thus found by where K = 2, . . ., K max .

Finding the number of clusters on artificial data
We test SOD with LBF and SLBF on artificial data (where the number of clusters is not provided to the user) and compare them with some other methods (three variations of ALC, number of clusters by GPCA and SOD with SSC and SCC).The artificial data sets are generated by the Matlab code borrowed from the GPCA [38] package on http://perception.csl.uiuc.edu/gpca.For each subspace 100d initial data points are uniformly sampled in a unit cube in this subspace centered around the origin and then corrupted with Gaussian noise in R D of standard deviation 0.05.For the last four experiments, we restrict the angle between subspaces to be at least π/8 for separation.The dimension d is given and we let K max = 10 in SOD.
In ALC (voting), we try 101 different values from 10 −5 to 10 3 for ǫ (as in [30]) and choose the estimated K by majority.In ALC (ǫ from LBF), we choose the average noise in the neighborhood using the local best-fit heuristic as the distortion rate ǫ.In ALC (oracle), we input the true noise level (ǫ = 0.05) as the distortion rate.For GPCA, we use the original idea of [38, eqs. (26), (28)] to find the number of clusters (see our implementation in the supplemental webpage).We project the data onto a d + 1-dimensional subspace by PCA and let the tolerance of rank detection be   Table 9 The standard deviation to the mean percentage of misclassified points on clustering MNIST data set (D=5 for GPCA, D=10 for other algorithms).   .This algorithm is independent of other parts of the GPCA algorithm and is thus extremely fast and can perform in high ambient dimensions.We even tried other ideas of [28, eqs.(3.28), (3.29)] (for the same given dimension d), while applying them to several HLM algorithms (even though they were originally presented for GPCA).Nevertheless, they did not work well and we thus did not report them.Each experiment is repeated 100 times (except for SOD(SSC), which is repeated 10 times due to its low speed) and the error rates of finding the number of clusters K and the computation time (in seconds) are recorded in Table 11.
As in Table 11, ALC (oracle) and ALC (ǫ from LBF) work the best for low dimensions (d = 1, 2, 3), but in real problems this choice (the noise level) for ǫ is usually unknown.The local best-fit flat heuristic provides a good estimation for the distortion rate and helps ALC reduce its running time.ALC (voting) is not as good as SOD (LBF) for artificial data.All options of ALC suffer from the computa-Table 11 The mean percentage of incorrectness (e%) for finding the number of clusters K and the computation time in seconds t(s) on artificial data.

Finding the number of clusters on the extended Yale face database B
We use the extended Yale face database B as in Section 3.3 for testing the above algorithms for detecting the number of clusters.The ambient dimension is reduced to D = 5K by PCA for all of the methods and the intrinsic dimension of subspaces is set as d = 5 (see Section 3.3).For SOD with different clustering algorithms, we let K max = 6, 8, 8, 10 and 10 respectively for 2 to 6 clusters.For GPCA, we let tolerance be 0.05 which does not affect the performance in this experiment.Each experiment is repeated 500 times (except for SOD(SSC), which is repeated 10 times due to its low speed).Following Section 3.3, we apply LBF, LBF-MS, SLBF and SLBF-MS with whitening.The error rates of finding the correct number of clusters and the computation time are recorded in Table 12.
We see from Table 12 that SOD only performs well with SSC with K smaller than 4. We note that this is due to the difficulty of the database.Indeed for a simpler database such as Yale Face database B [14] of uncropped faces, SOD (SLBF), SOD (SLBF-MS), ALC (ǫ from LBF) and ALC (voting) have perfect detection for K ≤ 10 (whitening is not applied then).

Finding the number of clusters on MNIST data set
We preprocess MNIST data set exactly the same way as we did in Section 3.4.The ambient dimension is reduced to both D = 10 and D = 50 by PCA for all of the methods including GPCA and 3 is given as the intrinsic subspace dimension.For SOD with different clustering algorithms, we let K max = 6, and 8 respectively for 2 and 3 clusters.For GPCA, we let the tolerance be 0.05 which does not affect the performance in this experiment.Each experiment is repeated 500 times (except for SOD(SSC), which is repeated 10 times due to its low speed).The error rates of finding the correct number of clusters and the computation time are recorded in Tables 13 and 14.
For all the methods, determining the number K of clusters becomes very difficult when the real K is larger than 3.For real K ≤ 3, we see from Table 13 that when we project data to 10-dimensional space, ALC and GPCA fail in most cases, except for ALC (ǫ from LBF) on digits [3 6 8].SOD (SLBF), SOD (SLBF-MS) and SOD (SSC) outperform all others although they are not very efficient.

Initializing K-flats with the local best-fit heuristic
Here we demonstrate that our choice of neighborhoods in Algorithm 1 can be used to get a more robust initialization of K-flats.We work with geometric farthest insertion.For fixed neighborhood sizes, say of m neighbors, this goes as follows: we pick a random point x 0 and then find the best-fit flat F 0 for the m point neighborhood of x 0 .Then we find the point x 1 in our data farthest from F 0 , find the best-fit flat F 1 of the m neighborhood of x 1 , and then choose the point x 2 farthest from F 0 and F 1 to continue.We stop when we have K flats; we use these as an initialization for K-flats.
We work on three data sets.Data set #1 consists of 1500 points on three parallel 2-planes in R 3 .500 points are drawn from the unit square in x, y plane, and then 500 more from    the x, y, z +.2 plane, and then 500 more from the x, y, z +.4 plane.This data set is designed to favor the use of small neighborhoods.The next data set is three random flats with 15% Gaussian noise and 5% outliers, generated using the Matlab code from GPCA, as in Section 3.1.This data set is designed to favor large neighborhood choices.Finally, we work on a data set with 1500 points sampled from 3 planes in R 2 as in Figure 3.The error rates of K-flats with farthest insertion initialization with fixed neighborhoods of size 10, 20, ..., 160 are plotted against the error rates for farthest insertion with adapted neighborhoods (searched over the same range), averaged over 400 runs in Figure 4.Although our method did not always beat the best fixed neighborhood, it was quite close; and it always significantly better than the wrong fixed neighborhood size.Both methods did significantly better than a random initialization.
In Figure 3 we plot the number of neighbors picked by our algorithm for each point of a realization of data set #3.

Conclusions and future work
We presented a very simple geometric method for HLM based on selecting a set of local best-fit flats.The size of the local neighborhoods is determined automatically using the ℓ 2 β numbers; it is proven under certain geometric conditions that our method approximately finds the optimal local neighborhoods.We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the algorithm on synthetic and real hybrid linear data.
We believe that one promising next step is to adapt the method for multi-manifold clustering.As it is, our method, while quite good at unions of flats, cannot successfully handle unions of curved manifolds.We expect that by gluing together groups of local best-fit flats related by some smooth-ness conditions, we will be able to approach the problem of clustering data which lies on unions of smooth manifolds.
We also believe that it will be possible to provide a theoretical framework for performance guarantees with noise for LBF and SLBF.Specifically, we hope to prove a quantitative form of the following alternative: suppose the data lies on the union of d-dimensional affine sets, perhaps with additive noise and outliers.Then either 1.Most points are roughly as close to an affine set they don't belong to as they are to their nearest O(d) neighbors; 2. A large fraction of the points have optimal neighborhoods contained in only one of the affine clusters, the principal components of these neighborhoods are good approximations to the clusters; and LBF and SLBF recover good approximations to the two affine clusters, or 3.The data looks locally lower than d-dimensional, even though each cluster is globally d-dimensional, and has high curvature; in this case, there are pure optimal neighborhoods, but the local estimation does not accurately represent the affine clusters.

A Proof of Theorem 1
Assume without loss of generality that i * = 1.Note that when r ≤ r 0 , B(x * , r) ∩ T (L 1 , w) = B(x * , r) ∩ supp(µ) and that L 1 is the minimizer of the RHS of (2).Combining these observations with (2) and the fact that β 2 (x * , r) is invariant to scaling of r and w, we immediately obtain that for r < r 0 : In particular, β 2 (x * , r) is constant for all 0 ≤ r ≤ w.
Next, we will prove (5) with a weaker requirement on r * .More precisely, we define r * = max(r * 1 , r * 2 ), where and Fig. 4 Using our neighborhood choice to improve initialization of k-flats: the first row is the visualization of three data sets, and the seconds row shows the corresponding figures such that the vertical axis is accuracy, and the horizontal axis is fixed neighborhood size in geometric farthest insertion for initialization of K flats.The red line is the result of using adapted neighborhoods.The data sets are #1,#2, and #3 as described in Section 3.6.Random initialization leads to misclassification rates of .4 or greater for all three data sets.
We further assume that w < r 0 and  We will later show that (4) implies (19) and we will also verify that r 0 ≤ r * < 1.09 r 0 .
Without loss of generality we assume that argmin i>1 dist(x * , L i ) = 2, and let x 0 be the center of mass of µ 1 + µ 2 in B(0, r).Then for any r > r 0 ( We claim that when r = r * , the minimizer in the second expression in the RHS of (20) (denoted by L 0 ) satisfies that dim(L 0 ∩ L 2 ) = d − 1 and (L 0 ∩ L ⊥ 2 ) ⊥ L 2 .We denote the orthonormal vector passes through x * and x 0 by u 1 , one of the d orthonormal vectors that span L 2 by u 2 , and one of the D − d − 1 orthonormal vectors that span (span(L 2 )) ⊥ by u 2 .We will prove that u 1 is the top eigenvector of We note that Denote the center of mass of B(x * , r * ) ∩ T (L 2 , w) by x 1 , notice that x 0 − x * < r 0 + w, the center of mass of B(x * , r * ) ∩ T (L 1 , w) is x * , and x * , x 0 and x 1 satisfies Combining ( 22), ( 23), ( 24) and ( 25) we have the estimation Therefore for any point x 1 in B(x * , r * ) ∩ T (L 2 , w), using ( 19) and (26) we have ( Since any points in B(x * , r * ) ∩ T (L 2 , w) has a distance to x 0 smaller than r * , we have Combining ( 27) and ( 28), the first inequality in ( 21) is proved.By direct integration one obtains that the average of (u T 2 x 1 ) 2 for We also have ( Using the fact that r * ≥ r * 2 , we have (31) Combining ( 29), ( 30) and ( 31), the second inequality in ( 21) is also proved.

),
If w < r 0 , then β 2 (x * , r) (as a function of r) is constant on [0, w] and decreases on [w, r 0 ].If also when d > 1, since each distance from a point to a subspace costs O(d • D).Moreover, the p passes have complexity of order O(p • (C − K) • N ).Therefore, step 5 of Algorithm 2 has a complexity of order O(C • N • (d • D + p)).At last, Step 7 of Algorithm 2 has a complexity of order O(K • d • D • N ), which comes from the construction of the N × K matrix of distances from N points to K subspaces.Combining these complexities together, we have an overall complexity of O(C • N • (d • D + p + log N )) for the LBF Algorithm; as before, if we fix the number of scales independently from N , the log terms can be replaced by a constant.
Fig.1The misclassification rate of some algorithms for the Hopkins 155 database.The y-axis represent the percentage of data sets that have misclassification rates (under corresponding algorithms) lower than that of x-axis.

Fig. 2
Fig. 2 Frames of the traffic, articulated and checker (from left to right) videos in Hopkins 155 database.
is the trajectory of the j th feature point across the F frames.The actual task of motion segmentation is to separate these trajectory vectors z 1 , z 2 , • • • , z N into K clusters representing the K underlying motions.
(chosen by trying different values and picking the one obtaining the lowest error)

Fig. 3
Fig.3 Color map of neighborhood size obtained by the local bestfit flat heuristic.The color value represents the number of neighbors chosen at that point.Note that the algorithm chooses smaller neighborhoods for points closer to the intersection of the planes.
• N + N • log N ).In order to obtain β 2 (N k ), we need to obtain the top d singular values of the |N k | × D data matrix representing the |N k | points, which requires a complexity of O(d • D • |N k |).To find N (x), we need to generate β 2 (N k ) for any |N k | = S + kT , where k = 1, 2, • • • , (N − S)/T , hence the complexity for obtaining N (x) is of order: data, λ: a parameter (or several parameters if we use step 7, with default values [2, 2e, 2e 2 , • • • , 2e 6 ]), other parameters used by Algorithm 1.Output

Table 2
The mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database.

Table 3
The standard deviation to the mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database.

Table 4
Average total computation times for all 155 sequences.

Table 5
Mean percentage of misclassified points and mean running time on clustering the extended Yale face database B.

Table 6
The standard deviation(%) to the mean percentage of misclassified points on the extended Yale face database B.

Table 7
Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=5 for GPCA, D=10 for other algorithms).

Table 8
Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=50).

Table 10
The standard deviation of the mean percentage of misclassified points on clustering MNIST data set (D=50).

Table 12
The mean percentage of incorrectness (e%) for finding the correct number of clusters K and the computation time in seconds t(s) on the extended Yale face database B.

Table 13
The mean percentage of incorrectness (e%) for finding the correct number of clusters K and the computation time in seconds t(s) on MNIST data set (D=10).

Table 14
The mean percentage of incorrectness (e%) for finding the correct number of clusters K and the computation time in seconds t(s) on MNIST data set (D=50).