1 Introduction

Several problems from computer vision, such as motion segmentation and face clustering, give rise to modeling data by multiple subspaces. This is referred to as Hybrid Linear Modeling (HLM) or alternatively as “subspace clustering”. In tracking-based motion segmentation, extracted feature points (tracked in all frames) are clustered according to the different moving objects. Under the affine camera model, the vectors of coordinates of feature points corresponding to a moving rigid object lie on an affine subspace of dimension at most 3 (see Costeira and Kanade, 1998). Thus clustering different moving objects is equivalent to clustering different affine subspaces. Similarly, in face clustering, it has been proved that the set of all images of a Lambertian object under a variety of lighting conditions form a convex polyhedral cone in the image space, and this cone can be accurately approximated by a low-dimensional linear subspace (of dimension at most 9) (Epstein et al. 1995; Ho et al. 2003; Basri and Jacobs 2003). One may thus cluster certain images of faces by HLM algorithms.

The mathematical formulation of HLM assumes a data set \(\mathrm{X}=\{\mathbf{x}_{i}\}_{i=1}^{N} \subseteq \mathbb{R}^{D}\) where each x i lies on (or around) one of K flats (i.e., affine subspaces) and requires to find the partition of X corresponding to the flats. We would like to be able to do this when the data has been corrupted by additive noise and outliers;Footnote 1 in this case we may also want to determine the flats themselves. We first assume here that all flats have the same known dimension d (i.e., they are d-flats) and that their number K is known. In Sect. 3 we address to some extent the cases of unknown K and mixed dimensions.

Several algorithms have been suggested for solving the HLM problem (or even the more general problem of clustering manifolds), for example the K-flats (KF) algorithm or any of its variants (Tipping and Bishop 1999; Bradley and Mangasarian 2000; Tseng 2000; Ho et al. 2003; Zhang et al. 2009), methods based on direct matrix factorization (Boult and Brown 1991; Costeira and Kanade 1998; Kanatani 2001; Kanatani 2002), Generalized Principal Component Analysis (GPCA) (Vidal et al. 2005), Local Subspace Affinity (LSA) (Yan and Pollefeys 2006), RANSAC (for HLM) (Yang et al. 2006), Locally Linear Manifold Clustering (LLMC) (Goh and Vidal 2007), Agglomerative Lossy Compression (ALC) (Ma et al. 2007), Spectral Curvature Clustering (SCC) (Chen and Lerman 2009) and Sparse Subspace Clustering (SSC) (Elhamifar and Vidal 2009). Some theoretical guarantees for particular HLM algorithms appear in Chen and Lerman (2009), Arias-Castro et al. (2011), Lerman and Zhang (2011) and Soltanolkotabi and Candès (2011). We recommend a recent review on HLM by Vidal (2011).

Many of the algorithms described above require an initial guess of the subspaces. For example, the K-flats algorithm is an iterative method that requires an initialization, and in SCC, one needs to carefully choose collections of d+1 data points that lie close to each of the underlying d-flats. Other algorithms require some information about the suspected deviations from the hybrid linear model; for example both RANSAC (for HLM) and ALC ask for a model parameter corresponding to the level of noise.

Here we propose a straightforward geometric method for the estimation of local subspaces, which is inspired by Jones (1990), David and Semmes (1991) and Lerman (2003) as well as Fukunaga and Olsen (1971) and Little et al. (2009a, 2009b). These local subspace estimates can be used to set the model parameters for or initialize an HLM algorithm. The basic idea is that for a data set X sampled from a hybrid linear model (perhaps with some noise), there are many points x such that the principal components of an appropriately sized neighborhood of x give a good approximation to the subspace x belongs to. Using local subspaces to infer the global hybrid linear model was suggested in Yan and Pollefeys (2006) for linear subspaces; however, there they use very small neighborhoods that are not adaptive to the structure of the data (e.g., amount of noise etc.). An “appropriately sized neighborhood” needs to be larger than the noise, so that the subspace is recognized. However, the neighborhood cannot be so large that it contains points from multiple subspaces. The correct choice of this size is carefully quantified in Sect. 2.1. We refer to such “appropriately sized neighborhoods” as “optimal neighborhoods”.

In addition to studying how to estimate local subspaces, we describe two complete HLM algorithms which are natural extensions of the local estimation: LBF (Local Best-fit Flat) and SLBF (spectral LBF). On many data sets, the first obtains state of the art speed with nearly state of the art accuracy (it can also deal with very large data), and the second obtains state of the art accuracy (SLBF) with reasonable run times (it seems to be able to deal to some extent with some nonlinear structures as the ones arising in motion segmentation data). We remark that we test accuracy in various scenarios, but in particular, with intersecting subspaces and with outliers. While in this work we only theoretically justify our choice of “optimal” neighborhoods, we are hopeful about developing a more complete theory justifying our algorithms.

In particular, we believe that such a theory can be valid in the setting suggested by Soltanolkotabi and Candès (2011) for analyzing the SSC algorithm, while having additional noise and restricting the fraction of outliers (or modifying our algorithms so they are even more robust to outliers). We are also interested in rigorously quantifying the limitations of our algorithms (as conjectured in Sect. 4).

We summarize the main contributions of this work, which is the full length version of Zhang et al. (2010), as follows.

  • We make precise the local best-fit heuristic, using the β 2 numbers (Jones 1990; David and Semmes 1991; Lerman 2003). We give an algorithm to approximately find optimal neighborhoods in the above sense, in fact, we prove this under certain geometric conditions.

  • Using the local best-fit heuristic, we introduce the LBF and SLBF algorithms for HLM. At each point of a randomly chosen subset of the data, they use the best-fit flats of the “optimal” neighborhoods to build a global model with different methods (LBF is based on energy minimization and SLBF is a spectral method).

  • We perform extensive experiments on motion segmentation data (the Hopkins 155 benchmark of Tron and Vidal (2007)), face clustering (the extended Yale face database B), handwritten digits (the MNIST database), and artificial data, showing that both algorithms, in particular SLBF, are accurate on real and synthetic HLM problems, while LBF runs extremely fast (often on the order of ten times faster than most of the previously mentioned methods). For the cropped face data we actually indicate a fundamental problem of local methods like LBF and SLBF, though suggest a workaround that works for this particular data.

  • We demonstrate how the local best-fit heuristic can be used with other algorithms. In particular, we give experimental evidence to show that the K-flats algorithm (Ho et al. 2003) is improved by initialization that is based on the local best-fit heuristic. We also use this heuristic to estimate the main parameters of both RANSAC (for HLM) (Yang et al. 2006) and ALC (Ma et al. 2007).

  • We show how the combination of LBF and the elbow method can quickly determine the number of subspaces.

The rest of this paper is organized as follows. In Sect. 2 we describe the LBF and SLBF algorithms and state a theorem giving conditions that guarantee that good neighborhoods can be found. Section 3 carefully tests the LBF and SLBF algorithms (while comparing them to other common HLM algorithms) on both artificial data of synthetic hybrid linear models and real data of motion segmentation in video sequences, face clustering and handwritten digits recognition. It also demonstrates how to determine the number of clusters by applying the fast algorithm of this paper together with the straightforward elbow method. Section 4 concludes with a brief discussion and mentions possibilities for future work.

2 The Local Best-Fit Flats Heuristic and the LBF and SLBF Algorithms

We describe two methods, LBF and SLBF, which have at their heart an estimation of local flats capturing the global structures of the data (or part of it). Both methods first find a set of candidate flats (the number is an input parameter for LBF). These are best-fit flats for local “optimal” neighborhoods (we describe an algorithm for approximately finding such neighborhoods and justify it in Sect. 2.1). The two algorithms process the candidates in different ways: LBF uses energy minimization and SLBF uses a spectral approach.

2.1 Choosing the Optimal Neighborhood

We choose the candidate flats that capture the global structure of the data by fitting them to ‘optimal’ local neighborhoods of data points. For a point x∈ℝD, we define an optimal neighborhood as the largest ball B(x,r) (centered at x and with radius r) that only contains points sampled from the same cluster as x. Indeed, neighborhoods smaller than the optimal one (around x) can mainly contain the noise around an underlying subspace (of the hybrid linear model); consequently their local best-fit flats may not match the underlying flat. On the other hand larger neighborhood than the optimal one (around x) will contain points from more than one underlying flat, and the resulting best-fit flat will again not match any of the underlying flats. We note that the choice of neighborhood B(x,r) is equivalent to the choice of radius r, which we refer to as scale (even though it is also common to refer to log(r) or −log(r) as scale). While it is possible to take a guess at the optimal scale as a parameter (e.g., as commonly done by fixing the number of nearest neighbors to x), we have found that it is possible to choose the optimal scale reasonably well automatically, while adapting it to the given point x.

We will start at the smallest scale (i.e., smallest radius containing only d+1 points) and look at larger and larger neighborhoods of a given point x 0. At the smallest scale, any noise may cause the local neighborhood to have higher dimension than d. As we add points to the neighborhood, it becomes better and better approximated in a scale-invariant sense (e.g., by scaling the neighborhood to have radius 1 and computing the error of approximation by best-fit flat then) until points belonging to other flats enter the neighborhood. To be more precise, we define the scale-invariant error for a neighborhood \(\mathcal {N}\equiv B(\mathbf{x}_{0},r)\) of x 0 by the formula:

$$ \beta_2(\mathcal {N}) = \frac{\min_{\text{$d$-flats $L$}} \sqrt{\sum_{\mathbf{y}\in \mathcal {N}} \|\mathbf{y}-P_{L}\mathbf{y}\|^2/|\mathcal {N}|}}{\max_{\mathbf{x}\in \mathcal {N}}\|\mathbf{x}-\mathbf{x}_0\|}, $$
(1)

where \(|\mathcal {N}|\) denotes the number of points in \(\mathcal {N}\), P L denotes the projection onto the flat L and the minimization is over all d-flats in ℝD. We note that the numerator is the approximation error by best-fit 2 flat at scale r and the denominator is the scale r. The notion of scale-invariant error was introduced and utilized in Jones (1990), David and Semmes (1991) and Lerman (2003).

Using this scale-invariant error we can reformulate our criterion for choosing the optimal neighborhood more precisely. That is, we start with the smallest neighborhood containing S nearest neighbors of x, increase the number of nearest neighbors by T in each iteration and check the β 2 number of each neighborhood using (1). We estimate the optimal neighborhood as the last one for which β 2 is smaller than β 2 of the previous neighborhood (that is, we search for the first local minimizer of \(\beta_{2}(\mathcal {N})\)). This procedure is summarized in Algorithm 1. It is experimentally robust to outliers if x is an inlier, since the nearest neighbors of inliers also tend to be inliers.

Algorithm 1
figure 1

Neighborhood size selection for HLM by randomized local best-fit flats

2.1.1 Theoretical Justification

The following theorem tries to justify our strategy of estimating the optimal scale around each point by showing that in the continuous setting the first local minimizer of β 2(x,r):=β 2(B(x,r)) is approximately the distance from x to the nearest cluster that does not contain x (here the underlying model is a mixture of Lebesgue measures in strips around several subspaces and x is an arbitrary point on one of these subspaces). Therefore, if we choose the size of neighborhood following Algorithm 1 (adapted to the continuous setting), then we will approximately obtain the optimal neighborhood. It is rather standard to extend such estimates for measures to a probabilistic setting, where i.i.d. data is sampled from the continuous distribution. The theorem will then hold with high probability for sufficiently large sample size (due to technicalities, which also require truncating the support of our continuous measure we avoid these details). The proof of this theorem is in Appendix A.

Our theorem uses the following analog of the discrete β 2 of (1) for a measure μ and a ball B(x,r) (see also Lerman 2003):

(2)

where throughout the paper \({\rm dist}(\cdot, \cdot)\) denotes the Euclidean distance, for example in this case \({\rm dist}(\mathbf{x},L):=\|\mathbf{x}-P_{L}\mathbf{x}\|\). The theorem also assumes that the underlying measure is supported on a union of K tubes \(T(L_{i},w_{i}):=\{\mathbf{x}\in \mathbb{R}^{D}: \|\mathbf{x}-P_{L_{i}}\mathbf{x}\|<w_{i}\}\), i=1,…,K (centered around the d-flats L 1,…,L K respectively).

Theorem 1

Let K≥2, d<D, \(\{L_{i}\}_{i=1}^{K}\) be K d-flats inD, \(\{\mu_{i}\}_{i=1}^{K}\) be K Lebesgue measures on tubes \(\{T(L_{i},w)\}_{i=1}^{K}\subset \mathbb{R}^{D}\) respectively and let \(\mu=\sum_{i=1}^{K}\mu_{i}\). For fixed 1≤i K and fixed \(\mathbf{x}^{*}\in L_{i^{*}}\), let

$$ r_0 = r_0(\mathbf{x}^*) = {\rm dist}\biggl(\mathbf{x}^*,\bigcup_{i\neq i^*}T(L_i,w)\biggr). $$
(3)

If w<r 0, then β 2(x ,r) (as a function of r) is constant on [0,w] and decreases on [w,r 0]. If also

$$ \frac{w}{r_0} < \begin{cases} \min\bigl(0.02,\sqrt{\frac{(D+1)}{150\sqrt{2}(D-1)K}}\bigr), &\hbox{when\ $d=1$;} \\[8pt] \min\bigl(0.02,\sqrt{\frac{(D-d+2)}{6(50)^{\frac{d}{2}}(D-d)K}}\bigr), &\hbox{when\ $d>1$,} \end{cases} $$
(4)

then there exists r 0<r <1.09 r 0 such that

$$ \beta_2\bigl(\mathbf{x}^*,r^*\bigr) > \beta_2\bigl(\mathbf{x}^*, r_0\bigr). $$
(5)

That is, the first local minimum of β 2(x ,r) (as a function of r) occurs in (r 0,1.09 r 0).

The proof of this theorem indicates a weaker condition than (4), which is less intuitive. It also shows that r r 0 as w/r 0→0 and clarifies by example why the first local minimum of β 2(x ,r 0) is often bigger than r 0 (see Remark 1).

2.1.2 The Complexity of Algorithm 1

Algorithm 1 requires sorting the neighbors of x according to their distance to x; the computational cost of this preprocessing step is O(DN+N⋅logN). In order to obtain \(\beta_{2}(\mathcal {N}_{k})\), we need to obtain the top d singular values of the \(|\mathcal {N}_{k}|\times D\) data matrix representing the \(|\mathcal {N}_{k}|\) points, which requires a complexity of \(O(d\cdot D\cdot |\mathcal {N}_{k}|)\). To find \(\mathcal {N}(x)\), we need to generate \(\beta_{2}(\mathcal {N}_{k})\) for any \(|\mathcal {N}_{k}|=S+kT\), where k=1,2,…,(NS)/T, hence the complexity for obtaining \(\mathcal {N}(x)\) is of order:

$$O\Biggl(d\cdot D\cdot \sum_{k=1}^{(N-S)/T}(S+kT)\Biggr)\leq O \bigl(d\cdot D\cdot N^2/2T\bigr). $$

We thus note that if T is in the order of N, e.g., T=max(N/300,2), the total complexity of Algorithm 1 is O((dD+logN)⋅N). Note that if we limit the number of scales that we search, then the two log terms above can be replaced by a constant.

2.2 The LBF Algorithm

The LBF algorithm searches for a good set of flats from the candidates (described above) in a greedy fashion. A measure G of goodness of K flats is chosen; here, it will be the average distance of points to their nearest flats, i.e.,

$$ G=G_{\mathrm{X}}\bigl(\{L_1,\ldots,L_K\}\bigr) = \sum_{\mathbf{x}\in \mathrm{X}} {\rm dist}\Biggl(\mathbf{x},\bigcup_{i=1}^{K} L_i\Biggr). $$
(6)

After randomly initializing K flats from the list of candidates, p passes are made through the data points. In each of the passes, we replace a random current flat with the candidate that minimizers the value of G. We then move to the next pass, picking a random flat, etc. Algorithm 2 sketches this procedure (where the greedy minimization of G is described in step 2).

Algorithm 2
figure 2

LBF: energy minimization over randomized local best-fit flats

The simplest choice of G is the sum of the squared distances of each point in X to its nearest flat, i.e., having the power 2 in (6). However, in some scenarios the l 1 energy of (6) is more robust to outliers than the mean squared error (see Lerman and Zhang (2011, 2012) for theoretical support and Zhang et al. (2009) for experimental support). This method also allows using energy functions, which are hard to minimize (even heuristically). Indeed, it only requires evaluating the energy on the candidate configurations. For example, when the data set requires stronger robustness to outliers, one may use the following energy:

$$G'=G'_{\mathrm{X}}\bigl(\{L_1,\ldots,L_K\}\bigr) = \mathop{\mathrm{Median}}\limits_{\mathbf{x}\in \mathrm{X}} {\rm dist}\Biggl(\mathbf{x}, \bigcup_{i=1}^K L_i\Biggr). $$

The LBF algorithm is closely related to RANSAC, since both of them use candidate subspaces to fit the data set. However Algorithm 1 gives LBF an advantage in choosing good candidates, while RANSAC fits a d-flat by arbitrarily chosen d+1 points.

2.2.1 The Complexity and Storage of Algorithm 2

For step 2 of this algorithm we need to run Algorithm 1 C times and thus its complexity is of order O((dD+logN)⋅CN). Note that the logN comes from a full sort of N distances, and if we restrict to a fixed number of scales, this can be replaced by a constant. Step 3 of Algorithm 2, requires C SVD decompositions for C matrices of size at most N×D, in order to obtain the first d vectors in ℝD. It thus also has a complexity at most O(CdDN).

Step 2 of Algorithm 2 requires the evaluation of the N×C matrix representing the distances \(\|x_{i}-P_{L_{j}}x_{i}\|\) between X={x 1,x 2,…,x N } and L 1,L 1,…,L C . This costs O(CdDN) operations, since each distance from a point to a subspace costs O(dD). Moreover, the p passes have complexity of order O(p⋅(CK)⋅N). Therefore, step 2 of Algorithm 2 has a complexity of order O(CN⋅(dD+p)). At last, Step 7 of Algorithm 2 has a complexity of order O(KdDN), which comes from the construction of the N×K matrix of distances from N points to K subspaces. Combining these complexities together, we have an overall complexity of O(CN⋅(dD+p+logN)) for the LBF Algorithm; as before, if we fix the number of scales independently from N, the log terms can be replaced by a constant.

To compute the storage requirements of LBF, we note that the data set is saved in an N×D matrix, the candidate subspaces are organized in C projection matrices of size D×d and in addition the algorithm stores an N×C matrix of distances between the data points and the C candidate subspaces. Therefore, the storage of LBF is in the order of O(DN+CDd+NC).

2.3 The SLBF Algorithm

The SLBF algorithm (which is sketched in Algorithm 3) processes the candidate subspaces via a spectral clustering method. It first finds the neighborhoods \(\{\mathcal{N}_{i}\}_{i=1}^{N}\) for all points \(\{\mathbf{x}_{i}\}_{i=1}^{N}\) via Algorithm 1 and fits d-flats \(\{L_{i}\}_{i=1}^{N}\) (via PCA) in these neighborhoods. It then forms the N×N matrices S and \(\hat{\mathbf{S}}\) as follows:

$$ \mathbf{S}_{i,j}=\sqrt{{\rm dist}(\mathbf{x}_i,L_j)\,{\rm dist}(\mathbf{x}_j,L_i)}, $$
(7)

and

$$ \hat{\mathbf{S}}_{i,j}=\exp \bigl(-\mathbf{S}_{i,j}/2\sigma_j^2\bigr)+\exp\bigl(-\mathbf{S}_{i,j}/2\sigma_i^2\bigr), $$
(8)

where

$$ \sigma_j=\lambda\sqrt{\min_{\text{$d$-flats\,\,L}} \sum_{\mathbf{x}\in\mathcal{N}_j}\|\mathbf{x}-P_{L}\mathbf{x}\|^2/|\mathcal {N}_j|} $$
(9)

(we explain the choice of λ below, when we clarify (9)). Finally, it applies spectral clustering with the matrix \(\hat{\mathbf{S}}\). More precisely, SLBF follows the main algorithm of Ng et al. (2001, Sect. 2), replacing the matrix A there by \(\hat{\mathbf{S}}\), multiplying the unit eigenvectors of Step 3 (of Ng et al. (2001, Sect. 2)) by the corresponding square roots of eigenvalues and skipping step 4. We remark that the two last changes to Ng et al. (2001, Sect. 2) are commonly used so that the normalized similarity matrix (defined in step 4 of Algorithm 3) can be considered as a Gram matrix, see e.g., Euclidean MDS (Cox and Cox 2001) and ISOMAP (Tenenbaum 2000).

Algorithm 3
figure 3

SLBF: spectral clustering based on best-fit flats

As discussed in Vidal (2011), SLBF is a “spectral clustering-based method”, similar to SCC, LSA and SSC. These algorithms construct an N×N affinity matrix, whose ij-th entry represents the similarity between points i and j, and then apply spectral clustering using this affinity matrix. Ideally, the affinities of points from the same cluster are of order 1 and the affinities of points from different clusters are of order 0. Indeed, for the affinity \(\hat{\mathbf{S}}\) of SLBF, if x i and x j are in the same cluster, then we expect that x i is close to L j and x j is close to L i , which means S i,j is close to 0 and thus \(\hat{\mathbf{S}}_{i,j}\) is close to 1 (we assume here that L i and L j are good estimators for the underlying subspace of the cluster shared by x i and x j as suggested by Theorem 1). Otherwise, if x i and x j are not in the same cluster, then we expect that x i is sufficiently far from L j and x j is sufficiently far from L i , which implies that \(\hat{\mathbf{S}}_{i,j}\) is close to 0. The choice of σ j clearly affects this heuristic argument on the size of \(\hat{\mathbf{S}}_{i,j}\). Theoretically σ j should be larger than the noise, such that \(\hat{\mathbf{S}}_{i,j}\) is close to 1 when x i and x j are in the same cluster, but σ j cannot be too large so that \(\hat{\mathbf{S}}_{i,j}\) is close to 1 when x i and x j are not in the same cluster. Therefore we use (9), where \(\sqrt{\min_{\text{$d$-flats\,\,\textit{L}}} \sum_{\mathbf{x}\in\mathcal{N}_{j}}\|\mathbf{x}-P_{L}\mathbf{x}\|^{2}/|\mathcal {N}_{j}|}\) is the estimated noise of the data set around the point x j and λ is a parameter. Following the strategy in Chen and Lerman (2009), we choose different values of λ (our fixed default values are [2,2e,2e 2,…,2e 6]) and consequently obtain several segmentation results (7 results when using our default values). We then choose the segmentation with the smallest error in (10).

We remark that the robustness of SLBF to outliers can be partly explained by the robustness of spectral-type methods to outliers. Furthermore, it is possible to initially remove some outliers according to very small values of the corresponding diagonal elements of D (see e.g., Chen and Lerman (2009), Arias-Castro et al. (2011)).

Similar to SLBF, LSA (Yan and Pollefeys 2006) is also based on fitting local subspaces. However, LSA fits subspace by local neighborhoods of fixed number of points and is not adaptive. Moreover, the local subspaces of LSA are forced to be linear (since the affinity of LSA is based on principal angles between such subspaces) and this further restricts the applicability of LSA. There is also some similarity between the idea of SLBF and that of SCC (Chen and Lerman 2009). Indeed, we may view SCC as fitting candidate subspaces based on d+1 data points (the iterative procedure tries to enforce the points to be from the same cluster). However, in practice they operate very differently, in particular, SCC is not based just on local information (though a local version of SCC follows from Arias-Castro et al. (2011)). The SSC algorithm is also a spectral method, but similar to SCC its affinities are global (they are based on sparse representation of data points).

2.3.1 Complexity and Storage of the SLBF Algorithm

Step 1 of Algorithm 3 has a complexity of order O((dD+logN)⋅N 2), since it applies Algorithm 1 to every point in the set X. The most expensive calculation of steps 2–4 in Algorithm 3 is the construction of S, which requires a complexity of order O(dDN 2). The eigenvalue decomposition in step 5 has a complexity of order O(KN 2) and the K-means algorithm in step 6 has a complexity of order O(n s NK 2), where n s is the iterations in K-means.

Combining these complexities together, we have an overall complexity of order O((dD+logN)⋅N 2+n s NK 2) for SLBF. As before, limiting to a constant number of scales replaces the log term with a constant.

We note that SLBF stores the data set in a D×N matrix, the candidate subspaces in N D×d matrices (recall that in SLBF every data point is assigned a subspace and thus C=N) and it also uses the N×N matrix S. Therefore, the storage of SLBF is in the order of O(NDd+N 2).

2.4 Adaptation of the Proposed Algorithms to Motion Segmentation Data

Note that the first minimum in the Theorem 1 excludes the left endpoint, and thus k=0 is excluded in Algorithm 1. In our experiments, we noticed that on data without too much noise, it is useful to allow the first scale to count as a local minimum and allow k=0 in Algorithm 1. We refer to the implementation of LBF and SLBF with those two techniques tailored for motion segmentation data as LBF-MS and SLBF-MS.

3 Experimental Results

In this section, we conduct experiments on artificial and real data sets to verify the effectiveness of the proposed algorithm in comparison to other HLM algorithms. We will see that in many situations, the methods we propose are fast and accurate; however, in Sect. 3.3 we will show a failure mode of our method, and discuss how this can be corrected.

We measure the accuracy of those algorithms by the rate of misclassified points with outliers excluded, that is

$$ \mathrm{error}\% = \frac{\mathrm{\#\ of\ misclassified\ inliers}}{\mathrm{\#\ of\ total\ inliers}}\times 100\ \%. $$
(11)

In all the experiments below, the number C in Algorithm 2 is 70⋅K, where K is the number of subspaces, the number p in Algorithm 2 is 5⋅K, and the numbers S and T in Algorithm 1 are 2⋅d and 2 respectively, where d is the dimension of the subspace. According to our experience, LBF and SLBF are very robust to changes in parameters, but unsurprisingly, there is a general trade off between accuracy (higher C, higher p, smaller T), and run time (lower C, lower p, larger T). We have chosen these parameters for a balance between run time and accuracy. Nevertheless, we have insisted to use the same parameters for all data sets and experiments, even though particular parameters could obtain even better or near perfect results for some of the data sets. The experiments in Sects. 3.1 and 3.2 run on a computer with Intel Core 2 CPU at 2.66 GHz and 2 GB memory, and experiments in Sects. 3.3 and 3.4 run on a machine with Intel Core 2 Quad Q6600 at 2.4 GHz and 8 GB memory.

3.1 Clustering Results on Artificial Data

We compare our algorithms with the following algorithms: Mixtures of PPCA (MoPPCA) (Tipping and Bishop 1999), K-flats (KF) (Ho et al. 2003), Local Subspace Analysis (LSA) (Yan and Pollefeys 2006), Spectral Curvature Clustering (SCC) (Chen and Lerman 2009), Random Sample Consensus (RANSAC) for HLM (Yang et al. 2006), Agglomerative Lossy Compression (ALC) (Ma et al. 2007) and GPCA with voting/robust GPCA (GPCA) (Ma et al. 2008; Yang et al. 2006). Throughout the rest of the paper, we use the Matlab codes of the GPCA, MoPPCA and KF algorithms from http://perception.csl.uiuc.edu/gpca, the LSA algorithm from http://www.vision.jhu.edu/db, the SCC algorithm from http://www.math.umn.edu/~lerman/scc, the ALC algorithm from http://perception.csl.uiuc.edu/coding/motion/, the RANSAC algorithm from http://www.vision.jhu.edu/code/ and the SSC algorithm from http://www.cis.jhu.edu/~ehsan/ssc.htm.

For the SCC algorithm, we also try a slightly modified version tailored for motion segmentation as in step 6 of Algorithm 3, which we refer to as SCC-MS (SCC for motion segmentation): Following the notation of Chen and Lerman (2009, Algorithm 2) we let the matrix U be the N×K matrix whose columns are the top K left singular vectors of \(\mathbf{A}_{C}^{*}\) and also denote by Σ the diagonal K×K matrix whose elements are the top K left singular values of \(\mathbf{A}_{C}^{*}\). Then the K-means step of SCC-MS is applied directly to the rows of the N×K matrix 1/2 (as opposed to applying it to U (or its row-wise normalization by 1) in SCC).

The MoPPCA algorithm is always initialized with a random guess of the membership of the data points. The LSCC algorithm is initialized by randomly picking 100×K(d+1)-tuples (following Chen and Lerman 2009) and KF is initialized with a random guess. Since algorithms like KF tend to converge to local minimum, we use 10 restarts for MoPPCA, 30 restarts for KF, and recorded the misclassification rate of the one with the smallest 2 error for both of these algorithms. The number of restarts was restricted by the running time and accuracy. In SSC algorithm, we set the value λ to be 0.01, as suggested in the code.

The RANSAC for HLM and ALC algorithms (Yang et al. 2006; Ma et al. 2007) depend on a user supplied inlier threshold. RANSAC (oracle) and ALC (oracle) use the oracle inlier bound given by the true noise variance of the model and thus clearly have an advantage over the other algorithms listed. RANSAC (ϵ from LBF) and ALC (ϵ from LBF) estimate the inlier threshold by the local best-fit flats heuristic of this paper. That is, they fit best-fit neighborhoods for all N points using the latter heuristic and estimate the least error of approximation by d-flats in these N neighborhoods. The inlier bound ϵ is then the average of these errors. If the number of clusters resulting from ALC (ϵ from LBF or oracle) is larger than K, then we choose the K largest clusters and identify the points in the rest of clusters as outliers. For some cases the RANSAC algorithm breaks down and we then report it as N/A. The reason for this is that RANSAC (for HLM) (Yang et al. 2006) is very sensitive to the estimate of ε and an overestimate can result in removal of points belonging to more than one subspace, so that the algorithm may exhaust all points before detecting K subspaces.

We remark that GPCA cannot naturally deal with outliers, therefore we use robust GPCA with Multivariate Trimming (Yang et al. 2006) and the parameters ‘angleTolerance’ and ‘boundarythreshold’ are set to be 0.3 and 0.4 respectively.

The artificial data represents various instances of K linear subspaces in ℝD. If their dimensions are fixed and equal d, we follow Chen and Lerman (2009) and refer to the setting as d K∈ℝD. If they are mixed, then we follow Ma et al. (2008) and refer to the setting as (d 1,…,d K )∈ℝD. Fixing K and d (or d 1,…,d K ), we randomly generate 100 different instances of corresponding hybrid linear models according to the code in http://perception.csl.uiuc.edu/gpca. More precisely, for each of the 100 experiments, K linear subspaces of the corresponding dimensions in ℝD are randomly generated. The random variables sampled within each subspace are sums of two other variables. One of them is sampled from a uniform distribution in a d-dimensional ball of radius 1 of that subspace (centered at the origin for the case of linear subspaces). The other is sampled from a D-dimensional multivariate normal distribution with mean 0 and covariance matrix 0.052I D×D . Then, for each subspace 250 samples are generated according to the distribution just described. Next, the data is further corrupted with 5 % or 30 % uniformly distributed outliers in a cube of sidelength determined by the maximal distance of the former 250 samples to the origin (using the same code).

Since most algorithms (SCC, LSA, MoPPCA, LBF, SLBF, RANSAC, SSC) do not support mixed dimensions natively, we assume each subspace has the maximum dimension in the experiment. GPCA and ALC support mixed dimensions natively, and GPCA is the only algorithm for which we specify the dimensions for each subspace in mixed-dimension case (note that the knowledge of dimensions are unnecessary in ALC algorithm).

The mean (over 100 instances) misclassification rates and the mean running time of the various algorithms are recorded in Table 1. From Table 1 we can see that our algorithms, i.e., LBF, LBF-MS, SLBF, SLBF-MS, perform well in various artificial instances of hybrid linear modeling (with both linear subspaces and affine subspaces), and their advantage is especially obvious with many outliers and affine subspaces. The robustness to outliers is a result of our use of both 1 loss function (see e.g., Lerman and Zhang 2011, 2012) and random sampling. The SLBF and SLBF-MS are better at the affine cases because of their use of spectral clustering. Also unlike many other methods, the proposed methods natively support affine subspace models (the particular data has non-intersecting subspaces, which makes advantageous to some other algorithms, e.g., SSC). The results of RANSAC (ϵ from LBF) and ALC (ϵ from LBF) show that the local best-fit heuristic can be effectively used to estimate the main parameter of RANSAC and ALC, i.e., to estimate the local noise. Table 1 also shows that the running time of LBF/LBF-MS is less than the running time of most other algorithms, especially GPCA, LSA, RANSAC, ALC and SSC. The difference is large enough that we can also use the proposed algorithm as an initialization for the others. LBF and LBF-MS algorithms are slower than a single run of K-flats, but it usually takes many restarts of K-flats to get a decent result. Notice that the choice of C and p in our algorithm function in a similar manner to the number of restarts in KF. SLBF and SLBF-MS cost more time when N is large, because of the construction of the N×N matrix in spectral clustering, but it still has a comparable speed to LSA and is faster than SSC, which are two spectral-clustering based algorithms.

Table 1 Mean percentage of misclassified points in artificial data for linear-subspace cases or affine-subspace case

3.2 Clustering Results on Motion Segmentation Data

We test the proposed algorithms on the Hopkins 155 database of motion segmentation, which is available at http://www.vision.jhu.edu/data/hopkins155. This data set contains 155 video sequences along with the coordinates of certain features extracted and tracked for each sequence in all its frames. The main task is to cluster the feature vectors (across all frames) according to the different moving objects and background in each video. It consists of three types of videos: checker, traffic and articulated (see Fig. 2 for demonstration of frames of such videos).

More formally, for a given video sequence, we denote the number of frames by F. In each sequence, we have either one or two independently moving objects, and the background can also move due to the motion of the camera. We let K be the number of moving objects plus the background, so that K is 2 or 3 (and distinguish accordingly between two-motions and three-motions). For each sequence, there are also N feature points y 1,y 2,…,y N ∈ℝ3 that are detected on the objects and the background. Let z ij ∈ℝ2 be the coordinates of the feature point y j in the ith image frame for every 1≤iF and 1≤jN. Then z j =[z 1j ,z 2j ,…,z Fj ]∈ℝ2F is the trajectory of the jth feature point across the F frames. The actual task of motion segmentation is to separate these trajectory vectors z 1,z 2,…,z N into K clusters representing the K underlying motions.

It has been shown (Costeira and Kanade 1998) that under the affine camera model, the trajectory vectors corresponding to different moving objects and the background across the F image frames live in distinct affine subspaces of dimension at most three in ℝ2F. Following this theory, we implement our algorithm with d=3.

We compare our algorithm with the following ones: improved GPCA for motion segmentation (GPCA) (Vidal et al. 2008), K-flats (KF) (Ho et al. 2003) (implemented for linear subspaces), Local Linear Manifold Clustering (LLMC) (Goh and Vidal 2007), Local Subspace Analysis (LSA) (Yan and Pollefeys 2006), Multi Stage Learning (MSL) (Sugaya and Kanatani 2004), Spectral Curvature Clustering (SCC) (Chen and Lerman 2009) and SCC-MS (see description earlier), Sparse Subspace Clustering (SSC) (Elhamifar and Vidal 2009), and RANSAC for HLM (Yang et al. 2006).

For GPCA (improved for motion segmentation), LLMC, LSA, MSL and RANSAC (for HLM), we copy the results from http://www.vision.jhu.edu/data/hopkins155 (they are based on experiments reported in Tron and Vidal (2007) and Goh and Vidal (2007)). We perform our own experiments for SCC, SCC-MS, SSC-N (SSC-B is not reported since it did not perform as well as SSC-N), LBF, LBF-MS, SLBF, SLBF-MS, we perform the experiments on our own and record the mean misclassification rate and the median misclassification rate for each algorithm for any fixed K (two or three-motions) and for the different type of motions (“checker”, “traffic” and “articulated”). Each experiment (testing the latter set of algorithms) was repeated 500 times. The average misclassification rates, standard deviation and running time are recorded in Tables 5 and 6 and demonstrated in Fig. 1.

Fig. 1
figure 4

The misclassification rate of some algorithms for the Hopkins 155 database. The y-axis represent the percentage of data sets that have misclassification rates (under corresponding algorithms) lower than that of x-axis

Fig. 2
figure 5

Frames of the traffic, articulated and checker (from left to right) videos in Hopkins 155 database

Our misclassification rates for SCC are different than Chen and Lerman (2009) and Lauer and Schnorr (2009) and our misclassification rates for SSC are different than Elhamifar and Vidal (2009) (the difference between our and their results differ more than twice the standard deviations of misclassification rates obtained here). This can be explained by possible evolutions of the codes since then (at least for SSC). We remark though that the misclassification rates of SCC-MS here are even slightly better than the misclassification rates of SCC in Chen and Lerman (2009).

From Table 2 and Fig. 1 we can see that our algorithms work well for the Hopkins database. Of all the methods tested, SLBF-MS and SSC-N are the most accurate algorithms. Besides SLBF/SLBF-MS and SSC-N, only SCC-MS is better than LBF-MS. However, from Table 4, LBF-MS ran more than 100 times faster than SSC-N and SLBF-MS is also more than 10 times faster than SSC. In many of the cases, the 1 energy (as well as the 2 energy) was lower for the labels obtained by LBF than the true labels. We thus suspect that the reason SLBF/SLBF-MS works better than LBF/LBF-MS is that good clustering of the Hopkins data requires additional type of information (e.g., spectral information) to be combined with subspace clustering (i.e., hybrid linear modeling).

Table 2 The mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database

By adapting the parameters of SLBF-MS (or alternatively, SLBF, LBF, LBF-MS), we can further improve its misclassification rates on Hopkins 155 (e.g., total 0.81 % for two-motions and 2.12 % for three-motions by SLBF-MS). However, we have fixed in advance all parameters and insisted using the same parameters on all four kinds of data (see the third paragraph of Sect. 3).

From Table 3 we can see that SLBF-MS, SLBF and SSC-N have negligible randomness. Indeed, their only randomness come from the K-means step, but this randomness is effectively reduced because of the restarting strategy. LBF and LBF-MS are more random, but still have comparable standard deviations with other good algorithms on Hopkins 155 database such as SCC/SCC-MS.

Table 3 The standard deviation to the mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database
Table 4 Average total computation times for all 155 sequences

3.3 Clustering Results on the Extended Yale Face Database B

We test LBF, LBF-MS, SLBF and SLBF-MS and compare them with ALC, K-flats, and SSC on the extended Yale face database B (Lee et al. 2005), which is available on http://vision.ucsd.edu/leekc/ExtYaleDatabase/ExtYaleB.html. We will see that this data set shows a failure mode of our algorithms; and we will show how we can engineer a workaround.

We use subsets of the extended Yale face database B consisting of face images of K=2,3,…,10 persons under 64 varying lighting conditions. Our objective is to cluster these images according to the persons. In implementation, for any fixed K we repeat each algorithm on 100 randomly chosen subsets of K persons. The HLM model is applicable to this database, because the images of a face under variable lighting lies in a three-dimensional linear subspace if shadow is not considered (Lee et al. 2005), and a nine-dimensional subspace with shadow considered (Basri and Jacobs 2003). In our experiments, we found that the images of a person in this database lie roughly in a 5-dimensional subspace, and therefore we first reduce the dimension of the data to 5K (recall that K is the number of persons). We do not include the GPCA algorithm since it is slow and does not work well on this database. We also do not include SCC and RANSAC since the code provided returns errors for some examples. The setting of ALC (voting with K) follows Rao et al. (2010, Sect. 2.2) exactly: it chooses ε from 101 values in the range 10−5–103 (see the code in http://perception.csl.uiuc.edu/coding/motion/#Software).

We can see from the first row of Table 5 that LBF does a poor job discriminating the linear clusters in this data set. The failure occurs because of a combination of two factors: the first is the relatively sparse sampling of the data, with only 64 points per 5-dimensional cluster, and the second, the relative nearness of the underlying subspaces to each other. In particular, almost any neighborhood of any given point (even very small neighborhood) has points from the other affine clusters and consequently there is no “optimal” scale. For example, in the 128 face images from persons 1 and 2, more than a fifth of the points are closer to the subspace spanned by the first 5 principal components of the points in the other cluster than to their second nearest neighbors, and more than two thirds of the points are closer to the other subspace than to their 4th nearest neighbors. In some sense, this is a single 5-dimensional set, rather than two 5-dimensional sets. For example, the average distance of a point to the 5-dimensional best fit subspace by the points in the same cluster is 2.7×103, and the average distance to the 5-dimensional best fit subspace of the whole data set is 3.3×103, whereas the average norm of a point in the data set is 1.1×104. Thus one loses little in terms of relative fitting error by considering the set as spanned by a single subspace.

Table 5 Mean percentage of misclassified points and mean running time on clustering the extended Yale face database
Table 6 The standard deviation(%) to the mean percentage of misclassified points on the extended Yale face database

However, most points are actually closer to the subspace spanned by the same face than to the subspace spanned by the other face, if only by a little, and a global method such as SSC is still able to find and discriminate between the two affine clusters. The problem of data having large variance in directions irrelevant to a classification task is not unusual. A standard method of dealing with this situation is to “whiten” the data; i.e. reduce the value of the large singular values. A very crude whitening is obtained by simply removing the first few principal components. If we exclude the first two principal components after reducing the dimension to 5K for LBF/SLBF algorithms, we see in Table 5 that the results are greatly improved and become competitive.Footnote 2 With more sophisticated whitening, the results can be further improved.

3.4 Clustering Results on MNIST Data Set

Finally, we work on the MNIST data set (available at http://yann.lecun.com/exdb/mnist/). This data set consists of several thousand 28×28 images of the digits 0 through 9. We work on some subsets of the data which contain 2 or 3 digits and choose 200 images for each digit at random. We apply PCA to reduce the dimension to D=5 for GPCA and to both D=10 and D=50 for the rest of algorithms. The choice of both D=10 and D=50 provide richer testing opportunities, this is however unavailable for GPCA, which cannot handle D=50 and often get stuck with D=10. We process the data the same way as in Sect. 3.3. We run each experiment 500 times, using d=3 and the correct number of clusters, and record the misclassification rates, the standard deviation and running time in Tables 7, 9, 8 and 10.

Table 7 Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=5 for GPCA, D=10 for other algorithms)
Table 8 Mean percentage of misclassified points and mean running time on clustering MNIST data set (D=50)
Table 9 The standard deviation to the mean percentage of misclassified points on clustering MNIST data set (D=5 for GPCA, D=10 for other algorithms)
Table 10 The standard deviation of the mean percentage of misclassified points on clustering MNIST data set (D=50)

From Tables 7 and 8, SLBF and SLBF-MS are the best algorithms among all the methods in terms of misclassification rates, although these misclassification rates are larger when K=3. SCC, SCC-MS, SSC, LBF and LBF-MS are also good algorithms for this data set. ALC is almost as good as SLBF and SLBF-MS when K=2, but it fails when K=3. LBF, LBF-MS and K-flats are the fastest algorithms in MNIST data set.

3.5 Automatic Determination of the Number of Flats

We explain how to use the elbow method to determine the number of affine clusters in any HLM algorithm, in particular LBF and SLBF. Fixing an arbitrary HLM algorithm with the correct input of number of clusters K, let F j , j=1,…,K be the K flats returned by that algorithm and W K be the sum of squared distances of all data points to the flat, among these K flats, corresponding to their clusters. That is,

$$ W_K = \sum_{j=1}^K \sum_{{\mathbf{x}}\in C_j} {\rm dist}^2({\mathbf{x}}, F_j). $$
(12)

We note that W K decreases as K increases.

A classical method for determining the number of clusters is to find the “elbow”, or the K past which adding more clusters does not significantly decrease the error. We search for the elbow by finding the maximum of the Second Order Difference (SOD) of the logarithm of W K (Yue et al. 2008):

$$ \mathrm{SOD}(\ln W_K) = \ln W_{K-1} + \ln W_{K+1} - 2\ln W_K. $$
(13)

The optimal K is thus found by

$$ K_{\mathrm{SOD}} = \arg \max_K \mathrm{SOD}(\ln W_K), $$
(14)

where K=2,…,K max.

In the following sections, we compare SOD (LBF), i.e., SOD applying LBF, SOD (LBF-MS), SOD (SLBF), SOD (SLBF-MS), SOD (SCC), SOD (SCC-MS) and SOD(SSC) with ALC (Ma et al. 2007) and part of GPCA (Vidal et al. 2005) on a number of artificial data sets and real data sets. These experiments run on a machine with Intel Core 2 Quad Q6600 at 2.4 GHz and 8 GB memory.

3.5.1 Finding the Number of Clusters on Artificial Data

We test SOD with LBF and SLBF on artificial data (where the number of clusters is not provided to the user) and compare them with some other methods (three variations of ALC, number of clusters by GPCA and SOD with SSC and SCC). The artificial data sets are generated by the Matlab code borrowed from the GPCA (Vidal et al. 2005) package on http://perception.csl.uiuc.edu/gpca. For each subspace 100d initial data points are uniformly sampled in a unit cube in this subspace centered around the origin and then corrupted with Gaussian noise in ℝD of standard deviation 0.05. For the last four experiments, we restrict the angle between subspaces to be at least π/8 for separation. The dimension d is given and we let K max=10 in SOD.

In ALC (voting), we try 101 different values from 10−5 to 103 for ϵ (as in Rao et al. 2010) and choose the estimated K by majority. In ALC (ϵ from LBF), we choose the average noise in the neighborhood using the local best-fit heuristic as the distortion rate ϵ. In ALC (oracle), we input the true noise level (ϵ=0.05) as the distortion rate. For GPCA, we use the original idea of Vidal et al. (2005, Eqs. (26), (28)) to find the number of clusters (see our implementation in the supplemental webpage). We project the data onto a (d+1)-dimensional subspace by PCA and let the tolerance of rank detection be 0.05 (chosen by trying different values and picking the one obtaining the lowest error). This algorithm is independent of other parts of the GPCA algorithm and is thus extremely fast and can perform in high ambient dimensions. We even tried other ideas of Ma et al. (2008, Eqs. (3.28), (3.29)) (for the same given dimension d), while applying them to several HLM algorithms (even though they were originally presented for GPCA). Nevertheless, they did not work well and we thus did not report them. Each experiment is repeated 100 times (except for SOD(SSC), which is repeated 10 times due to its low speed) and the error rates of finding the number of clusters K and the computation time (in seconds) are recorded in Table 11.

Table 11 The mean percentage of incorrectness (e %) for finding the number of clusters K and the computation time in seconds t(s) on artificial data

As in Table 11, ALC (oracle) and ALC (ϵ from LBF) work the best for low dimensions (d=1,2,3), but in real problems this choice (the noise level) for ϵ is usually unknown. The local best-fit flat heuristic provides a good estimation for the distortion rate and helps ALC reduce its running time. ALC (voting) is not as good as SOD (LBF) for artificial data. All options of ALC suffer from the computation complexity, especially ALC (voting). SOD (LBF) and SOD (LBF-MS) get reasonable outputs and have obvious advantage of computing time. GPCA is very fast, but does not work well.

3.5.2 Finding the Number of Clusters on the Extended Yale Face Database B

We use the extended Yale face database B as in Sect. 3.3 for testing the above algorithms for detecting the number of clusters. The ambient dimension is reduced to D=5K by PCA for all of the methods and the intrinsic dimension of subspaces is set as d=5 (see Sect. 3.3). For SOD with different clustering algorithms, we let K max=6, 8, 8, 10 and 10 respectively for 2 to 6 clusters. For GPCA, we let tolerance be 0.05 which does not affect the performance in this experiment. Each experiment is repeated 500 times (except for SOD(SSC), which is repeated 10 times due to its low speed). Following Sect. 3.3, we apply LBF, LBF-MS, SLBF and SLBF-MS with whitening. The error rates of finding the correct number of clusters and the computation time are recorded in Table 12.

Table 12 The mean percentage of incorrectness (e %) for finding the correct number of clusters K and the computation time in seconds t(s) on the extended Yale face database

We see from Table 12 that SOD only performs well with SSC with K smaller than 4. We note that this is due to the difficulty of the database. Indeed for a simpler database such as Yale Face database B (Georghiades et al. 2001) of uncropped faces, SOD (SLBF), SOD (SLBF-MS), ALC (ϵ from LBF) and ALC (voting) have perfect detection for K≤10 (whitening is not applied then).

3.5.3 Finding the Number of Clusters on MNIST Data Set

We preprocess MNIST data set exactly the same way as we did in Sect. 3.4. The ambient dimension is reduced to both D=10 and D=50 by PCA for all of the methods including GPCA and 3 is given as the intrinsic subspace dimension. For SOD with different clustering algorithms, we let K max=6, and 8 respectively for 2 and 3 clusters. For GPCA, we let the tolerance be 0.05 which does not affect the performance in this experiment. Each experiment is repeated 500 times (except for SOD(SSC), which is repeated 10 times due to its low speed). The error rates of finding the correct number of clusters and the computation time are recorded in Tables 13 and 14.

Table 13 The mean percentage of incorrectness (e %) for finding the correct number of clusters K and the computation time in seconds t(s) on MNIST data set (D=10)
Table 14 The mean percentage of incorrectness (e %) for finding the correct number of clusters K and the computation time in seconds t(s) on MNIST data set (D=50)

For all the methods, determining the number K of clusters becomes very difficult when the real K is larger than 3. For real K≤3, we see from Table 13 that when we project data to 10-dimensional space, ALC and GPCA fail in most cases, except for ALC (ϵ from LBF) on digits [3 6 8]. SOD (SLBF), SOD (SLBF-MS) and SOD (SSC) outperform all others although they are not very efficient.

3.6 Initializing K-Flats with the Local Best-Fit Heuristic

Here we demonstrate that our choice of neighborhoods in Algorithm 1 can be used to get a more robust initialization of K-flats. We work with geometric farthest insertion. For fixed neighborhood sizes, say of m neighbors, this goes as follows: we pick a random point x 0 and then find the best-fit flat F 0 for the m point neighborhood of x 0. Then we find the point x 1 in our data farthest from F 0, find the best-fit flat F 1 of the m neighborhood of x 1, and then choose the point x 2 farthest from F 0 and F 1 to continue. We stop when we have K flats; we use these as an initialization for K-flats.

We work on three data sets. Data set #1 consists of 1500 points on three parallel 2-planes in ℝ3. 500 points are drawn from the unit square in x,y plane, and then 500 more from the x,y,z+0.2 plane, and then 500 more from the x,y,z+0.4 plane. This data set is designed to favor the use of small neighborhoods. The next data set is three random flats with 15 % Gaussian noise and 5 % outliers, generated using the Matlab code from GPCA, as in Sect. 3.1. This data set is designed to favor large neighborhood choices. Finally, we work on a data set with 1500 points sampled from 3 planes in ℝ2 as in Fig. 3. The error rates of K-flats with farthest insertion initialization with fixed neighborhoods of size 10,20,…,160 are plotted against the error rates for farthest insertion with adapted neighborhoods (searched over the same range), averaged over 400 runs in Fig. 4. Although our method did not always beat the best fixed neighborhood, it was quite close; and it always significantly better than the wrong fixed neighborhood size. Both methods did significantly better than a random initialization.

Fig. 3
figure 6

Color map of neighborhood size obtained by the local best-fit flat heuristic. The color value represents the number of neighbors chosen at that point. Note that the algorithm chooses smaller neighborhoods for points closer to the intersection of the planes (Color figure online)

Fig. 4
figure 7

Using our neighborhood choice to improve initialization of k-flats: the first row is the visualization of three data sets, and the seconds row shows the corresponding figures such that the vertical axis is accuracy, and the horizontal axis is fixed neighborhood size in geometric farthest insertion for initialization of K flats. The red line is the result of using adapted neighborhoods. The data sets are #1, #2, and #3 as described in Sect. 3.6. Random initialization leads to misclassification rates of 0.4 or greater for all three data sets (Color figure online)

In Fig. 3 we plot the number of neighbors picked by our algorithm for each point of a realization of data set #3.

4 Conclusions and Future Work

We presented a very simple geometric method for HLM based on selecting a set of local best-fit flats. The size of the local neighborhoods is determined automatically using the 2 β numbers; it is proven under certain geometric conditions that our method approximately finds the optimal local neighborhoods. We give extensive experimental evidence demonstrating the state of the art accuracy and speed of the algorithm on synthetic and real hybrid linear data.

We believe that one promising next step is to adapt the method for multi-manifold clustering. As it is, our method, while quite good at unions of flats, cannot successfully handle unions of curved manifolds. We expect that by gluing together groups of local best-fit flats related by some smoothness conditions, we will be able to approach the problem of clustering data which lies on unions of smooth manifolds.

We also believe that it will be possible to provide a theoretical framework for performance guarantees with noise for LBF and SLBF. Specifically, we hope to prove a quantitative form of the following alternative: suppose the data lies on the union of d-dimensional affine sets, perhaps with additive noise and outliers. Then either

  1. 1.

    Most points are roughly as close to an affine set they don’t belong to as they are to their nearest O(d) neighbors;

  2. 2.

    A large fraction of the points have optimal neighborhoods contained in only one of the affine clusters, the principal components of these neighborhoods are good approximations to the clusters; and LBF and SLBF recover good approximations to the two affine clusters, or

  3. 3.

    The data looks locally lower than d-dimensional, even though each cluster is globally d-dimensional, and has high curvature; in this case, there are pure optimal neighborhoods, but the local estimation does not accurately represent the affine clusters.