Skip to main content
Log in

A New Approach to Two-View Motion Segmentation Using Global Dimension Minimization

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present a new approach to rigid-body motion segmentation from two views. We use a previously developed nonlinear embedding of two-view point correspondences into a 9-dimensional space and identify the different motions by segmenting lower-dimensional subspaces. In order to overcome nonuniform distributions along the subspaces, whose dimensions are unknown, we suggest the novel concept of global dimension and its minimization for clustering subspaces with some theoretical motivation. We propose a fast projected gradient algorithm for minimizing global dimension and thus segmenting motions from 2-views. We develop an outlier detection framework around the proposed method, and we present state-of-the-art results on outlier-free and outlier-corrupted two-view data for segmenting motion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. A surface \(S\) is ruled if through every point of \(S\) there exists a straight line that lies on \(S\).

  2. In a noiseless case this would return the dimension of the linear span of the set of vectors

  3. “Effective rank” is sometimes defined differently. See Roy and Vetterli (2007).

  4. A measure is spherically symmetric within a \(d\)-subspace if it is supported on this subspace and invariant to rotations within this subspace.

  5. A measure is non-degenerate on a subspace if it does not concentrate mass on any proper subspace. In our setting the measure is also assumed to be spherically symmetric, and the non-degeneracy assumption is equivalent to assuming that the measure does not concentrate at the origin.

  6. This is assuming that we will not require more iterations to get close enough to the minimum that we can apply thresholding. In our experiments the number of needed iterations does not appear to grow with \(N\), but we do not have any results to guarantee this.

  7. We could skip this step and segment directly from the fuzzy assignment that we already have. Refining the membership matrix after removing the outliers is done to repair whatever damage the outliers may have done to the membership matrix before thresholding.

  8. “True inliers” are points that are inliers according to ground truth.

  9. \(1_{\varvec{\sigma }}\) has a \(1\) in each coordinate where \(\varvec{\sigma }\) has a non-zero element, and \(0\)’s in all other coordinates.

References

  • Aldroubi, A. (2013). A review of subspace segmentation: Problem, nonlinear approximations, and applications to motion segmentation. ISRN Signal Processing, 2013, 1–13. doi:10.1155/2013/417492.

    Article  Google Scholar 

  • Arias-Castro, E., Chen, G., & Lerman, G. (2011). Spectral clustering based on local linear approximations. Electronic Journal of Statistics, 5, 1537–1587.

    Article  MATH  MathSciNet  Google Scholar 

  • Arias-Castro, E., Lerman, G., & Zhang, T. (2013). Spectral clustering based on local PCA. ArXiv e-prints.

  • Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(1), 221–255.

    Article  Google Scholar 

  • Barbará, D., & Chen, P. (2000). Using the fractal dimension to cluster datasets. In KDD (pp. 260–264).

  • Bertsekas, D. (1995). Nonlinear programming. Optimization and neural computation series. Belmont, MA: Athena Scientific.

    Google Scholar 

  • Bhatia, R. (1997). Matrix analysis. Graduate texts in mathematics series. New York: Springer.

    Google Scholar 

  • Boult, T. E., & Brown, L. G. (1991). Factorization-based segmentation of motions. In Proceedings of the IEEE workshop on visual motion (pp. 179–186).

  • Bradley, P., & Mangasarian, O. (2000). k-Plane clustering. Journal of Global Optimization, 16(1), 23–32.

    Article  MATH  MathSciNet  Google Scholar 

  • Chen, G., Atev, S., & Lerman, G. (2009). Kernel spectral curvature clustering (KSCC). In IEEE 12th international conference on computer vision, (ICCV workshops), Kyoto (pp. 765–772). doi:10.1109/ICCVW.2009.5457627.

  • Chen, G., & Lerman, G. (2009a). Foundations of a multi-way spectral clustering framework for hybrid linear modeling. Foundations of Computational Mathematics, 9(5), 517–558. doi:10.1007/s10208-009-9043-7.

    Article  MATH  MathSciNet  Google Scholar 

  • Chen, G., & Lerman, G. (2009b). Spectral curvature clustering (SCC). International Journal of Computer Vision, 81(3), 317–330.

    Article  Google Scholar 

  • Chen, G., & Maggioni, M. (2011). Multiscale geometric and spectral analysis of plane arrangements. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Costeira, J., & Kanade, T. (1998). A multibody factorization method for independently moving objects. International Journal of Computer Vision, 29(3), 159–179.

    Article  Google Scholar 

  • Elhamifar, E., & Vidal, R. (2009). Sparse subspace clustering. In Proceedings of the 2009 IEEE computer society conference on computer vision and pattern recognition (CVPR 09) (pp. 2790–2797).

  • Elhamifar, E., & Vidal, R. (2013). Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2765–2781.

    Article  Google Scholar 

  • Feng, X., & Perona, P. (1998). Scene segmentation from 3d motion. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 225–231). doi:10.1109/CVPR.1998.698613.

  • Gionis, A., Hinneburg, A., Papadimitriou, S., & Tsaparas, P. (2005). Dimension induced clustering. In KDD (pp. 51–60).

  • Grafakos, L. (2004). Classical and modern Fourier analysis. London: Pearson/Prentice Hall.

    MATH  Google Scholar 

  • Haro, G., Randall, G., & Sapiro, G. (2006). Stratification learning: Detecting mixed density and dimensionality in high dimensional point clouds. In Neural information processing systems.

  • Haro, G., Randall, G., & Sapiro, G. (2008). Translated poisson mixture model for stratification learning. International Journal of Computer Vision, 80(3), 358–374.

    Article  Google Scholar 

  • Hartley, R. I., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge: Cambridge University Press. ISBN:0521623049.

  • Ho, J., Yang, M., Lim, J., Lee, K., & Kriegman, D. (2003). Clustering appearances of objects under varying illumination conditions. In Proceedings of international conference on computer vision and pattern recognition (vol. 1, pp. 11–18).

  • Kanatani, K. (2001). Motion segmentation by subspace separation and model selection. In Proceedings of 8th ICCV, Vancouver (vol. 3, pp. 586–591)

  • Kanatani, K. (2002). Evaluation and selection of models for motion segmentation. In 7th ECCV (vol. 3, pp. 335–349).

  • Lerman, G., & Zhang, T. (2011). Robust recovery of multiple subspaces by geometric \({{l_p}}\) minimization. Annals of Statistics, 39(5), 2686–2715. doi: 10.1214/11-AOS914.

    Article  MATH  MathSciNet  Google Scholar 

  • Levina, E., & Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems 17 (pp. 777–784). Cambridge, MA: MIT Press.

    Google Scholar 

  • Liu, G., Lin, Z., & Yu, Y. (2010). Robust subspace segmentation by low-rank representation. In ICML.

  • Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., & Ma, Y. (2013). Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 171–184. doi:10.1109/TPAMI.2012.88.

    Article  Google Scholar 

  • Ma, Y. (2004). An invitation to 3-D vision: From images to geometric models. Interdisciplinary applied mathematics: Imaging, vision, and graphics. New York: Springer.

    Google Scholar 

  • Ma, Y., Derksen, H., Hong, W., & Wright, J. (2007). Segmentation of multivariate mixed data via lossy coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9), 1546–1562.

    Article  Google Scholar 

  • Ma, Y., Yang, A. Y., Derksen, H., & Fossum, R. (2008). Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM Review, 50(3), 413–458.

    Article  MATH  MathSciNet  Google Scholar 

  • Ozay, N., Sznaier, M., Lagoa, C., & Camps, O. (2010). GPCA with denoising: A moments-based convex approach. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3209–3216). doi:10.1109/CVPR.2010.5540075.

  • Papadopoulo, T., & Lourakis, M. I. A. (2000). Estimating the jacobian of the singular value decomposition: Theory and applications. In Proceedings of the European conference on computer vision, ECCV 00 (pp. 554–570). New York: Springer.

  • Rao, S. R., Yang, A. Y., Sastry, S. S., & Ma, Y. (2010). Robust algebraic segmentation of mixed rigid-body and planar motions from two views. International Journal of Computer Vision, 88(3), 425–446. doi:10.1007/s11263-009-0314-1.

    Article  MathSciNet  Google Scholar 

  • Roy, O., & Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. In European signal processing conference (EUSIPCO) (pp. 606–610).

  • Soltanolkotabi, M., & Candès, E. J. (2012). A geometric analysis of subspace clustering with outliers. Annals of Statistics, 40(4), 2195–2238. doi:10.1214/12-AOS1034.

    Article  MATH  MathSciNet  Google Scholar 

  • Soltanolkotabi, M., Elhamifar, E., & Candes, E. (2013). Robust subspace clustering. ArXiv e-prints.

  • Tipping, M., & Bishop, C. (1999). Mixtures of probabilistic principal component analysers. Neural Computation, 11(2), 443–482.

    Article  Google Scholar 

  • Torr, P. H. S. (1998). Geometric motion segmentation and model selection. Philosophical Transactions of the Royal Society of London A, 356, 1321–1340.

    Article  MATH  MathSciNet  Google Scholar 

  • Tron, R., & Vidal, R. (2007). A benchmark for the comparison of 3-d motion segmentation algorithms. In IEEE conference on computer vision and pattern recognition, CVPR ’07 (pp. 1–8). doi:10.1109/CVPR.2007.382974.

  • Tseng, P. (2000). Nearest \(q\)-flat to \(m\) points. Journal of Optimization Theory and Applications, 105, 249–252. doi: 10.1023/A:1004678431677.

    Article  MATH  MathSciNet  Google Scholar 

  • Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing (pp. 210–268). Cambridge: Cambridge University Press.

  • Vidal, R. (2011). Subspace clustering. IEEE Signal Processing Magazine, 28(2), 52–68. doi:10.1109/MSP.2010.939739.

    Article  Google Scholar 

  • Vidal, R., Ma, Y., & Sastry, S. (2005). Generalized principal component analysis (GPCA). IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12), 1945–1959.

    Article  Google Scholar 

  • Vidal, R., Ma, Y., Soatto, S., & Sastry, S. (2006). Two-view multibody structure from motion. International Journal of Computer Vision, 68(1), 7–25.

    Article  Google Scholar 

  • Yan, J., & Pollefeys, M. (2006). A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and nondegenerate. In ECCV (vol. 4, pp. 94–106).

  • Yang, A. Y., Rao, S. R., & Ma, Y. (2006). Robust statistical estimation and segmentation of multiple subspaces. In CVPRW ’06: Proceedings of the 2006 conference on computer vision and pattern recognition workshop (p. 99). Washington, DC: IEEE Computer Society. doi:10.1109/CVPRW.2006.178.

  • Zhang, T., Szlam, A., & Lerman, G. (2009). Median \(K\)-flats for hybrid linear modeling with many outliers. In IEEE 12th international conference on computer vision workshops (ICCV workshops), Kyoto (pp. 234–241). doi:10.1109/ICCVW.2009.5457695.

  • Zhang, T., Szlam, A., Wang, Y., & Lerman, G. (2010). Randomized hybrid linear modeling by local best-fit flats. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1927–1934). doi:10.1109/CVPR.2010.5539866.

  • Zhang, T., Szlam, A., Wang, Y., & Lerman, G. (2012). Hybrid linear modeling via local best-fit flats. International Journal of Computer Vision, 100, 217–240. doi:10.1007/s11263-012-0535-6.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work was supported by NSF Grants DMS-09-15064 and DMS-09-56072. GL was partially supported by the IMA during their annual program on the mathematics of information (2011–2012) and BP benefited from participating in parts of this program and even presented an initial version of this work at an IMA seminar in Spring 2012. We thank the anonymous reviewers for their thoughtful comments, Shankar Rao for sharing with us the RAS database, and Tom Lou for his helpful suggestions in regards to our algorithm for minimizing global dimension. A very preliminary version of this work was submitted to CVPR 2012, we thank one of the anonymous reviewers for some insightful comments that made us modify the GDM algorithm and its theoretical support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gilad Lerman.

Additional information

This work was supported by NSF Grants DMS-09-15064 and DMS-09-56072. Supplementary webpage: http://math.umn.edu/~lerman/gdm.

Appendix

Appendix

1.1 Proof of Theorem 1

We prove the four properties of the statement of the theorem. For simplicity we assume that \(D<N\). That is, the number of data points is greater than the dimension of the ambient space. This is the usual case in many applications.

Proof of Property 1

Clearly, scaling all data vectors by \(\alpha \ne 0\) results in scaling all the singular values of the corresponding data matrix by \(\alpha \). Furthermore, this results in scaling by \(\alpha \) both the numerator and denominator of the expression for the empirical dimension for any \(\epsilon >0\). Therefore, the empirical dimension is invariant to this scaling.

Proof of Property 2

The singular values of a matrix (in particular the data matrix) are invariant to any orthogonal transformation of this matrix and thus the empirical dimension is invariant to such transformation.

Proof of Property 3

If \(\{\varvec{v}_i\}_{i=1}^N\) are contained in a \(d\)-subspace, then since these form the columns of \(\varvec{A}\), \(rank(\varvec{A}) \le d\). Since \(\varvec{U}\) and \(\varvec{V}\) are orthogonal, \(rank(\varvec{A})=rank(\varvec{\varSigma })\). In particular, \(\varvec{A}\) has at most \(d\) singular values. Let \(\varvec{\sigma }\) be the vector of singular values of \(\varvec{A}\), and let \(1_{\varvec{\sigma }}\) be the indicator vector of \(\varvec{\sigma }\) Footnote 9.

The generalized Hölder’s Inequality (Grafakos 2004 p. 10) states that if:

$$\begin{aligned} p_1,p_2 \in (0,\infty ] \;\; \text { and } \;\; {1 \over p_1} + {1 \over p_2} = {1 \over r} \end{aligned}$$
(13)

then

$$\begin{aligned} \Vert f_1 f_2 \Vert _r \le \Vert f_1 \Vert _{p_1} \Vert f_2 \Vert _{p_2} \; \text { for any functions } f_1 \text { and } f_2.\nonumber \\ \end{aligned}$$
(14)

To apply this result to vectors, we view them as functions over the set \(\{ 1,2,\ldots ,D \}\) with counting measure.

Let \(p_1=1\), \(p_2={\epsilon \over {1-\epsilon }}\), \(r=\epsilon \). Also let \(f_1=1_{\varvec{\sigma }}\), \(f_2=\varvec{\sigma }\). These values satisfy (13). We therefore get:

$$\begin{aligned} {{\Vert \varvec{\sigma }\Vert _{\epsilon }} \over {\Vert \varvec{\sigma }\Vert _{\epsilon \over {1-\epsilon }}}} \le \Vert 1_{\varvec{\sigma }} \Vert _1 = (\# \mathrm{of\, non-zero\, sing.\, values\, of} \varvec{A}) \le d.\nonumber \\ \end{aligned}$$
(15)

Proof  of  Property  4 By hypothesis, the data vectors \(\{\varvec{v}_i\}_{i=1}^N\) are i.i.d. and sampled according to probability measure \(\mu \), where \(\mu \) is sub-Gaussian, non-degenerate, and spherically symmetric in a \(d\)-subspace of \(\mathbb {R}^D\). We define the \(n\)th data matrix:

$$\begin{aligned} \varvec{A}_n = \left[ \begin{array}{ccccc} \uparrow &{}\quad \uparrow &{}\quad \uparrow &{} \quad &{}\quad \uparrow \\ \varvec{v}_1 &{} \quad \varvec{v}_2 &{} \quad \varvec{v}_3 &{} \quad \cdots &{} \quad \varvec{v}_n \\ \downarrow &{}\quad \downarrow &{}\quad \downarrow &{} \quad &{}\quad \downarrow \\ \end{array} \right] . \end{aligned}$$

Then \(\varvec{\varSigma }_n := ({1 \over n}) \varvec{A}_n \varvec{A}_n^T\) is the \(n\)th sample covariance matrix of our data set. Also, let \(\varvec{v}\) be a random variable with probability measure \(\mu \). Then \(\varvec{\varSigma }:= E[\varvec{v}\varvec{v}^T]\) is the covariance matrix of the distribution. A consequence of \(\mu \) being spherically symmetric in a \(d\)-subspace is that after an appropriate rotation of space, \(\varvec{\varSigma }\) is diagonal with a fixed constant in \(d\) of its diagonal entries and \(0\) in all other locations. We are trying to prove a result about empirical dimension, which is scale invariant and invariant under rotations of space. Because of these two properties we can assume that the appropriate rotation and scaling has been done so that \(\varvec{\varSigma }\) is diagonal with value \(1\) in \(d\) diagonal entries and \(0\) in all others. Without any loss of generality, we assume that the first \(d\) diagonal entries are the non-zero ones.

Let \(\varvec{\sigma }_n = \left( \sigma _{n,1}, \sigma _{n,2},\ldots , \sigma _{n,D} \right) ^T\), \(n \ge D\), denote the vector of singular values of the matrix \(\varvec{A}_n\). Our first task will be to show that \({\varvec{\sigma }_n \over \sqrt{n}}\) converges in probability (as \(n \rightarrow \infty \)) to the vector:

$$\begin{aligned} ( \underbrace{ 1, 1,\ldots ,1}_d,0,\ldots ,0 )^T. \end{aligned}$$
(16)

To accomplish our task, we will first relate \(\varvec{\sigma }_n\) to the vector of singular values of \(\varvec{\varSigma }_n\), and then use a result showing that \(\varvec{\varSigma }_n\) converges to \(\varvec{\varSigma }\) as \(n \rightarrow \infty \).

It is clear that the vector of singular values of \(\varvec{\varSigma }_n\), which we will denote by \(\varvec{\psi }\), is given by:

$$\begin{aligned} \varvec{\psi }= {1 \over n} \left( \sigma _{n,1}^2, \sigma _{n,2}^2, \ldots , \sigma _{n,D}^2 \right) ^T. \end{aligned}$$
(17)

Next, we will need the following result regarding covariance estimation. This is Corollary 5.50 of Vershynin (2012), adapted to be consistent with our notation.

Lemma 1 (Covariance Estimation): Consider a sub-Gaussian distribution in \(\mathbb {R}^D\) with covariance matrix \(\varvec{\varSigma }\). Let \(\gamma \in (0,1)\), and \(t \ge 1\). If \(n > C(t/ \gamma )^2 D\), then with probability at least \(1-2e^{-t^2 D}\), \(\Vert \varvec{\varSigma }_n - \varvec{\varSigma }\Vert _2 \le \gamma \), where \(\Vert \cdot \Vert _2\) denotes the spectral norm (i.e., largest singular value of the matrix). The constant \(C\) depends only on the sub-Gaussian norm of the distribution.

In our problem, we are applying this lemma to the distribution \(\mu \). Let \(\gamma \in (0,1)\) be given. If

$$\begin{aligned} n > C(t/ \gamma )^2 D, \end{aligned}$$
(18)

then \(\Vert \varvec{\varSigma }_n-\varvec{\varSigma }\Vert _2 \le \gamma \) with probability at least \(1-2e^{-t^2 D}\). The \(2\)-norm of the difference of two matrices bounds the differences of their individual singular values. We will use the following result to make this precise:

Lemma 2 Bhatia (1997): Let \(\sigma _i(\bullet )\) denote the \(i\)th largest singular value of an arbitrary \(m\)-by-\(n\) matrix. Then: \(| \sigma _i(\mathbf {B}+\mathbf {E})-\sigma _i(\mathbf {B})| \le \Vert \mathbf {E} \Vert _2\), for each \(i\).

Because \(\varvec{\varSigma }\) is diagonal with only values \(1\) and \(0\) on the diagonal, the singular values of \(\varvec{\varSigma }\) are simply these diagonal values. We will use \(\mathbf {1}_{i \in 1:d}\) to denote the \(i\)’th singular value of \(\varvec{\varSigma }\).

Setting \(\mathbf {B} = \varvec{\varSigma }_n\) and \(\mathbf {E} = \varvec{\varSigma }- \varvec{\varSigma }_n\), in Lemma 2 we get: \(\Vert \varvec{\varSigma }_n-\varvec{\varSigma }\Vert _2 \le \gamma \Rightarrow |(1/n)\sigma _{n,i}^2 - \mathbf {1}_{i \in 1:d}| \le \Vert \varvec{\varSigma }_n-\varvec{\varSigma }\Vert _2 \le \gamma \), for each \(i\). This implies that:

$$\begin{aligned} {\sigma _{n,i} \over \sqrt{n}} \in \left\{ \begin{array}{ll} \left[ \sqrt{{1} - \gamma }, \sqrt{{1} + \gamma } \right] , &{}\quad \text{ if } i \le d; \\ \left[ 0, \sqrt{\gamma } \right] , &{} \quad \text{ if } i > d. \end{array} \right. \end{aligned}$$
(19)

Notice that as \(\gamma \rightarrow 0\), \({\sigma _{n,i} \over \sqrt{n}}\) approaches \(\mathbf {1}_{i \in 1:d}\). Specifically, for any desired tolerance, \(\eta >0\), and any desired certainty, \(\xi \), \(n\) can be chosen large enough that with probability greater than \(\xi \), \(\left| \mathbf {1}_{i \in 1:d}-{\sigma _{n,i} \over \sqrt{n}} \right| < \eta \), simultaneously for each \(i\). It follows from this that the vector \({\varvec{\sigma }_n \over \sqrt{n}}\) converges in probability to (16) as \(n \rightarrow \infty \).

Finally, \(\hat{d}_{\epsilon ,n} = {{\Vert \varvec{\sigma }_n \Vert _{\epsilon }} \over {\Vert \varvec{\sigma }_n \Vert _{\epsilon \over {1-\epsilon }}}} = {\left( 1 \over \sqrt{n}\right) {\Vert \varvec{\sigma }_n \Vert _{\epsilon }} \over {\left( 1 \over \sqrt{n}\right) \Vert \varvec{\sigma }_n \Vert _{\epsilon \over {1-\epsilon }}}} = {{\Vert {\varvec{\sigma }_n \over \sqrt{n}} \Vert _{\epsilon }} \over {\Vert {\varvec{\sigma }_n \over \sqrt{n}} \Vert _{\epsilon \over {1-\epsilon }}}}\). Thus, \(\hat{d}_{\epsilon ,n}\) is a continuous function of the vector \({\varvec{\sigma }_n \over \sqrt{n}}\). Hence, since \({\varvec{\sigma }_n \over \sqrt{n}}\) converges to \(\mathbf {1}_{i \in 1:d}\) as \(n \rightarrow \infty \), \(\hat{d}_{\epsilon ,n}\) converges in probability to

$$\begin{aligned}&\left( {{\Vert (1,1,\ldots ,1,0,\ldots ,0) \Vert _\epsilon } \over {\Vert (1,1,\ldots ,1,0,\ldots ,0) \Vert _{\left( {{\epsilon } \over {1-\epsilon }}\right) }}} \right) = {{d^{{1} \over {\epsilon }}} \over {d^{{1-\epsilon } \over {\epsilon }}}}\nonumber \\&\quad = d^{{{1} \over {\epsilon }} - {{1-\epsilon } \over {\epsilon }}} = d. \end{aligned}$$
(20)

1.2 Proof of Theorem 2

Recall that \(\varPi _{Nat}\) denotes the natural partition of the data set. First, we notice that \(GD(\varPi _{Nat}) = \Vert (d_1,d_2,\ldots ,d_K) \Vert _p\), where \(d_k\) is the true dimension of set \(k\) of the partition. Notice that \(d_k\) cannot exceed \(d\) since \(\mu _k\) is supported by \(L_k\), a \(d\)-subspace. Furthermore, since \(\mu _k\) does not concentrate mass on subspaces it is a probability 0 event that all \(N_k\) points from \(L_k\) exist in a proper subspace of \(L_k\). Thus, for the natural partition, \(d_k\) is almost surely \(d\), for each \(k\). Hence, \(GD(\varPi _{Nat})\) is almost surely \(\Vert (d,d,\ldots ,d) \Vert _p = \left( K d^p\right) ^{1/p} = K^{1/p} d\).

Next, we will find a lower bound for the global dimension of any non-natural partition of the data, and show that if \(p\) meets the hypothesis criteria, the lower bound we get is greater than \(K^{1/p} d = GD(\varPi _{Nat})\). To accomplish this we need the following lemma.

Lemma 1

If \(\varPi \ne \varPi _{Nat}\) then \(\varPi \) almost surely has one set with dimension at least \(d+1\).

Before proving the lemma, observe that a consequence is that if \(\varPi \ne \varPi _{Nat}\), then with probability 1:

$$\begin{aligned} GD(\varPi ) \ge \Vert (?,\ldots ,?,d+1,?,\ldots ,?) \Vert _p \ge d+1. \end{aligned}$$
(21)

Then, from our hypothesis:

$$\begin{aligned} p >&ln(K)/(ln(d+1)-ln(d)) \nonumber \\ \Longrightarrow&\; \left( {{d+1} \over d} \right) ^p > K \nonumber \\ \Longrightarrow&\; d+1 > K^{1/p} d. \end{aligned}$$
(22)

Hence,

$$\begin{aligned} GD(\varPi ) \ge d+1 > K^{1/p} d = GD(\varPi _{Nat}). \end{aligned}$$
(23)

Thus, if we show Lemma 1, the proof of the theorem follows. To prove Lemma 1 we require an a simpler lemma:

Lemma 2

If a set \(Q\) in \(\varPi \) has fewer than \(d\) points from a subspace \(L_i\), then either \(Q\) has dimension at least \(d+1\) or adding another point from \(L_i\) to \(Q\) (an R.V. \(X\) with probability measure \(\mu _i\), independent from all other samples) will almost surely increase the dimension of \(Q\) by 1.

Proof

If \({{\mathrm{dim}}}(Q) \le d\) then \(Q\) has dimension strictly less than the ambient space (\(\mathbb {R}^D\)). Observe that \({{\mathrm{span}}}(Q)\) is a linear subspace of \(\mathbb {R}^D\), which a.s. does not contain \(L_i\). We cannot have proper containment since \({{\mathrm{dim}}}(L_i)=d \ge {{\mathrm{dim}}}(Q)\). Also, we have fewer than \(d\) points from \(L_i\) in \(Q\), and each other point in \(Q\) lies in \(L_i\) with probability 0 (All \(\mu _i\) do not concentrate mass on subspaces). Thus, \({{\mathrm{span}}}(Q)\) a.s. does not equal \(L_i\).

Therefore, if we intersect \(L_i\) with \({{\mathrm{span}}}(Q)\) we get a proper subspace of \(L_i\); call it \(\bar{L}\). We note that \(\mu _i(\bar{L})=0\) since \(\mu _i\) does not concentrate on subspaces. Thus, since \(X\) has probability measure \(\mu _i\), \(X\) a.s. lies outside the intersection of \(L_i\) and \({{\mathrm{span}}}(Q)\). It follows that if we add \(X\) to \(Q\), the dimension of \(Q\) a.s. increases by 1. \(\square \)

Now we prove Lemma 1. We will assume all sets in \(\varPi \) have dimension less than \(d+1\) and pursue a contradiction. By hypothesis, our set \(\{\varvec{v}_n\}_{n=1}^N\) contains at least \(d+1\) points from each subspace \(L_i\). Since \(\varPi \ne \varPi _{Nat}\), there is some subspace \(L^*\) whose points are assigned to \(2\) or more distinct sets in \(\varPi \). Let \(\varvec{v}^*\) be a point from \(L^*\). Now, choose \(d\) points from each \(L_i\) and denote this collection of \(Kd\) points \(\{y_1,y_2,\ldots ,y_{Kd}\}\). When making this selection, ensure that \(v^*\) is not chosen and that of the points selected from \(L^*\), not all of them are assigned to the same set in \(\varPi \) as \(\varvec{v}^*\). Notice that \(\varPi \) induces a partition on \(\{y_1,y_2,\ldots ,y_{Kd}\}\).

Select any point \(y_i\) and remove it from the set \(\{y_1,y_2,\ldots , y_{Kd}\}\). Since we are assuming that each set in \(\varPi \) has dimension less than \(d+1\), Lemma 2 implies that the set in \(\varPi \) to which \(y_i\) belongs will have its dimension decrease by \(1\). Now select another point \(y_j\) and remove it. Lemma 2 still applies and so the set to which \(y_j\) belonged will have its dimension decrease by 1. We can repeat this until all \(Kd\) points have been removed. Since each removal decreases the dimension of some set in \(\varPi \) by \(1\) it follows that before any removals the sum of the dimensions of all sets in \(\varPi \) was at least \(Kd\). Since each of the \(K\) sets in \(\varPi \) had dimension \(d\) or less, we conclude that in fact each set must have had dimension exactly \(d\).

Now, consider our set \(\{y_1,y_2,\ldots ,y_{Kd}\}\) and add in \(v^*\). By our choice of \(v^*\), Lemma \(2\) implies that its addition a.s. increases the dimension of its target set in \(\varPi \) by \(1\) (to \(d+1\)). Adding in all remaining points from \(\{\varvec{v}_n\}_{n=1}^N\) will only increase the dimensions of the sets in \(\varPi \). Thus, we almost surely have a set of dimension at least \(d+1\) in \(\varPi \), contradicting our hypothesis.\(\square \)

1.3 Proof of Theorem 3

Recall that the soft partition is stored in a membership matrix \(\varvec{M}\). Specifically, the \((k,n)\)’th element of \(\varvec{M}\), denoted \(m_k^n\), holds the “probability” that vector \(\varvec{v}_n\) belongs to cluster \(k\). Thus, each column of \(\varvec{M}\) forms a probability vector.

Hence, global dimension is a real-valued function of the matrix \(\varvec{M}\). We will think of the membership matrix as being vectorized, so that the domain of optimization can be thought of as a subset of \(\mathbb {R}^{NK}\). However, we will not explicitly vectorize the membership matrix. Thus, when we talk about the gradient of global dimension, we are referring to another \(K\)-by-\(N\) matrix, where the \((k,n)\)’th element is the derivative of global dimension w.r.t. \(m_k^n\).

To differentiate global dimension we must be able to differentiate the singular values of a matrix w.r.t. each element of that matrix. A treatment of this is available in Papadopoulo and Lourakis (2000).

To begin, recall the definition of \(GD\):

$$\begin{aligned} GD = \left\| \begin{array}{c} \hat{d}_\epsilon ^1 \\ \hat{d}_\epsilon ^2 \\ \vdots \\ \hat{d}_\epsilon ^K \end{array} \right\| _p = \left( (\hat{d}_\epsilon ^1)^p + (\hat{d}_\epsilon ^2)^p + \ldots + (\hat{d}_\epsilon ^K)^p \right) ^{1/p}.\nonumber \\ \end{aligned}$$
(24)

We will denote the thin SVD (only \(D\) columns of \(\varvec{U}\) and \(\varvec{V}\) are used) of \(\varvec{A}_k\):

$$\begin{aligned} \varvec{A}_k = \varvec{U}_k \varvec{\varSigma }_k {\varvec{V}_k}^T. \end{aligned}$$
(25)

Also, we will let \(\sigma _j^i\) refer to the \((j,j)\)’th element of \(\varvec{\varSigma }_i\). Then, using the chain rule:

$$\begin{aligned} \frac{\partial GD}{\partial m_k^n} = \frac{\partial GD}{\partial \hat{d}_\epsilon ^1} \frac{\partial \hat{d}_\epsilon ^1}{\partial m_k^n} + \frac{\partial GD}{\partial \hat{d}_\epsilon ^2} \frac{\partial \hat{d}_\epsilon ^2}{\partial m_k^n} + \ldots + \frac{\partial GD}{\partial \hat{d}_\epsilon ^K} \frac{\partial \hat{d}_\epsilon ^K}{\partial m_k^n}.\nonumber \\ \end{aligned}$$
(26)

From (24) we can compute \(\frac{\partial GD}{\partial \hat{d}_\epsilon ^i}\) rather easily:

$$\begin{aligned} \frac{\partial GD}{\partial \hat{d}_\epsilon ^i} =&\frac{1}{p} \left( (\hat{d}_\epsilon ^1)^p + (\hat{d}_\epsilon ^2)^p + \ldots + (\hat{d}_\epsilon ^K)^p \right) ^{\frac{1}{p}-1} p (\hat{d}_\epsilon ^i)^{p-1}\nonumber \\ =&(\hat{d}_\epsilon ^i)^{p-1} \left( (\hat{d}_\epsilon ^1)^p + (\hat{d}_\epsilon ^2)^p + \ldots + (\hat{d}_\epsilon ^K)^p \right) ^{\frac{1}{p}-1}. \end{aligned}$$
(27)

Next, we expand the other components of (26):

$$\begin{aligned} \frac{\partial \hat{d}_\epsilon ^i}{\partial m_k^n} = \sum _{j=1}^D \frac{\partial \hat{d}_\epsilon ^i}{\partial \sigma _j^i} \frac{\partial \sigma _j^i}{\partial m_k^n}. \end{aligned}$$
(28)

We now use the definition of \(\hat{d}_\epsilon ^i\) to compute the first factor of each term as follows:

$$\begin{aligned} \frac{\partial \hat{d}_\epsilon ^i}{\partial \sigma _j^i} =&{{\Vert \varvec{\sigma }_i \Vert _\delta {{\partial } \over {\partial \sigma _j^i}} \left( \left( (\sigma _1^i)^\epsilon + \ldots + (\sigma _D^i)^\epsilon \right) ^{1/\epsilon } \right) } \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \nonumber \\&- {{\Vert \varvec{\sigma }_i \Vert _\epsilon {{\partial } \over {\partial \sigma _j^i}} \left( \left( (\sigma _1^i)^\delta + \ldots + (\sigma _D^i)^\delta \right) ^{1/\delta } \right) } \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \nonumber \\ =&{{\Vert \varvec{\sigma }_i \Vert _\delta \left( (\sigma _1^i)^\epsilon + \ldots + (\sigma _D^i)^\epsilon \right) ^{{1-\epsilon } \over {\epsilon }} \left( \sigma _j^i \right) ^{\epsilon -1}} \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \nonumber \\&- {{\Vert \varvec{\sigma }_i \Vert _\epsilon \left( (\sigma _1^i)^\delta + \ldots + (\sigma _D^i)^\delta \right) ^{{1-\delta } \over {\delta }} \left( \sigma _j^i \right) ^{\delta -1}} \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \nonumber \\ =&\left( {{1} \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \right) \Vert \varvec{\sigma }_i \Vert _\delta \Vert \varvec{\sigma }_i \Vert _\epsilon ^{1-\epsilon } \left( \sigma _j^i \right) ^{\epsilon -1} \nonumber \\&- \left( {{1} \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \right) \Vert \varvec{\sigma }_i \Vert _\epsilon \Vert \varvec{\sigma }_i \Vert _\delta ^{1-\delta } \left( \sigma _j^i \right) ^{\delta -1} \nonumber \\ =&C_1^i \left( \sigma _j^i \right) ^{\epsilon -1} - C_2^i \left( \sigma _j^i \right) ^{\delta -1}, \end{aligned}$$
(29)

where

$$\begin{aligned} C_1^i \!=\! \left( {{\Vert \varvec{\sigma }_i \Vert _\epsilon ^{1-\epsilon } \Vert \varvec{\sigma }_i \Vert _\delta } \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \right) , \quad C_2^i \!=\! \left( {{\Vert \varvec{\sigma }_i \Vert _\epsilon \Vert \varvec{\sigma }_i \Vert _\delta ^{1-\delta }} \over {\Vert \varvec{\sigma }_i \Vert _\delta ^2}} \right) .\qquad \end{aligned}$$
(30)

Next, we must evaluate the second factor in each term of (28). Recall that \(\sigma _j^i\) is the \(j\)’th largest singular value of the matrix \(\varvec{A}_i\). To achieve the next step, we must observe that each singular value of \(\varvec{A}_i\) depends, in general, on each element of the matrix \(\varvec{A}_i\). We can then compute the derivative of each element of the matrix \(\varvec{A}_i\) w.r.t. each membership variable, \(m_k^n\). We will denote the \((\alpha ,\beta )\)’th element of the matrix \(\varvec{A}_i\) by \({\varvec{A}_i}_{(\alpha ,\beta )}\). Using the chain rule:

$$\begin{aligned} \frac{\partial \sigma _j^i}{\partial m_k^n} = \sum _{\beta = 1}^N \sum _{\alpha =1}^D \frac{\partial \sigma _j^i}{\partial {\varvec{A}_i}_{(\alpha ,\beta )}} \frac{\partial {\varvec{A}_i}_{(\alpha ,\beta )}}{m_k^n}. \end{aligned}$$
(31)

A powerful result (Papadopoulo and Lourakis 2000, Eq. 7) allows us to express the partial derivative of each singular value, \(\sigma _j^i\), w.r.t. a given matrix element in terms of the already-known SVD of \(\varvec{A}_i\):

$$\begin{aligned} \frac{\partial \sigma _j^i}{\partial {\varvec{A}_i}_{(\alpha ,\beta )}} = \varvec{U}_{i(\alpha ,j)} \varvec{V}_{i(\beta ,j)}. \end{aligned}$$
(32)

The second factor in each term of (31) can be evaluated directly from the definition of \(\varvec{A}_k\):

$$\begin{aligned} \frac{\partial {\varvec{A}_i}_{(\alpha ,\beta )}}{\partial m_k^n} = \left\{ \begin{array}{ll} 0, &{}\quad \text { if } n \ne \beta \text { or if } i \ne k; \\ \varvec{v}_n \cdot \hat{\mathbf {e}_{\alpha }}, &{}\quad \text { if } n = \beta \text { and } i = k, \end{array}\right. \end{aligned}$$
(33)

where \(\hat{\mathbf {e}_{\alpha }}\) denotes the \(\alpha \)’th standard basis vector (1 in position \(\alpha \) and \(0\)’s everywhere else).

We are now in a position to work backwards and construct the partial derivative of \(GD\) w.r.t. \(m_k^n\). In what follows, \(\delta _{ik}\) is equal to \(1\) if \(i=k\) and is \(0\) otherwise (this is not to be confused with the un-subscripted \(\delta \), which is shorthand for \(\epsilon /(1-\epsilon )\)). Also, for notational convenience, we use Matlab notation to represent a row or column of a matrix (\(B_{(w,:)}\) and \(B_{(:,w)}\), respectively). We fist compute \(\partial \sigma _j^i / \partial m_k^n \) as follows:

$$\begin{aligned} \frac{\partial \sigma _j^i}{\partial m_k^n}&= \sum _{\alpha =1}^D \frac{\partial \sigma _j^i}{\partial {\varvec{A}_i}_{(\alpha ,n)}} \frac{\partial {\varvec{A}_i}_{(\alpha ,n)}}{m_k^n} \nonumber \\&= \sum _{\alpha =1}^D \varvec{U}_{i(\alpha ,j)} \varvec{V}_{i(n,j)} \left( \varvec{v}_n \cdot \hat{\mathbf {e}_{\alpha }} \right) \delta _{ik} \nonumber \\&= \left[ \begin{array}{c} \varvec{U}_{i(1,j)} \varvec{V}_{i(n,j)} \\ \varvec{U}_{i(2,j)} \varvec{V}_{i(n,j)} \\ \vdots \\ \varvec{U}_{i(D,j)} \varvec{V}_{i(n,j)} \end{array} \right] \cdot \varvec{v}_n \delta _{ik}\nonumber \\&= \varvec{V}_{i(n,j)} \left[ \begin{array}{c} \varvec{U}_{i(1,j)} \\ \varvec{U}_{i(2,j)} \\ \vdots \\ \varvec{U}_{i(D,j)} \end{array} \right] \cdot \varvec{v}_n \delta _{ik} \nonumber \\&= \varvec{V}_{i(n,j)} \left( \varvec{U}_{i(:,j)} \cdot \varvec{v}_n \right) \delta _{ik}. \end{aligned}$$
(34)

Then from (28), we get

$$\begin{aligned} \frac{\partial \hat{d}_\epsilon ^i}{\partial m_k^n}&= \sum _{j=1}^D \frac{\partial \hat{d}_\epsilon ^i}{\partial \sigma _j^i} \frac{\partial \sigma _j^i}{\partial m_k^n} = \sum _{j=1}^D \left( C_1^i \left( \sigma _j^i \right) ^{\epsilon -1}\right. \nonumber \\&\left. \quad - C_2^i \left( \sigma _j^i \right) ^{\delta -1} \right) \varvec{V}_{i(n,j)} \left( \varvec{U}_{i(:,j)} \cdot \varvec{v}_n \right) \delta _{ik}. \end{aligned}$$
(35)

Now we can write:

$$\begin{aligned}&\frac{\partial \hat{d}_\epsilon ^i}{\partial m_k^n} = C_1^i \left( \sum _{j=1}^D \left( \sigma _j^i \right) ^{\epsilon -1} \varvec{V}_{i(n,j)} \left( \varvec{U}_{i(:,j)} \cdot \varvec{v}_n \right) \delta _{ik} \right) \nonumber \\&\quad - C_2^i \left( \sum _{j=1}^D \left( \sigma _j^i \right) ^{\delta -1} \varvec{V}_{i(n,j)} \left( \varvec{U}_{i(:,j)} \cdot \varvec{v}_n \right) \delta _{ik} \right) . \end{aligned}$$
(36)

We now simplify the components of (36). After some manipulation, and using the notation

$$\begin{aligned} \left( \varvec{\varSigma }_i \right) ^{\epsilon -1} = \left[ \begin{array}{c c c} (\sigma _1^i)^{\epsilon -1} &{}\quad 0 &{}\quad 0 \\ 0 &{} \quad \ddots &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad (\sigma _D^i)^{\epsilon -1} \end{array}\right] , \end{aligned}$$
(37)

we can write

$$\begin{aligned}&\sum _{j=1}^D \left( \sigma _j^i \right) ^{\epsilon -1} \varvec{V}_{i(n,j)} \left( \varvec{U}_{i(:,j)} \cdot \varvec{v}_n \right) \nonumber \\&\quad = \left[ \left( \sigma _1^i \right) ^{\epsilon -1} \varvec{V}_{i(n,1)}, \ldots , \left( \sigma _D^i \right) ^{\epsilon -1} \varvec{V}_{i(n,D)} \right] \left[ \begin{array}{c} \varvec{U}_{i(:,1)} \cdot \varvec{v}_n \\ \varvec{U}_{i(:,2)} \cdot \varvec{v}_n \\ \vdots \\ \varvec{U}_{i(:,D)} \cdot \varvec{v}_n \end{array}\right] \nonumber \\ \\&\quad = \varvec{V}_{i(n,:)} \left( \varvec{\varSigma }_i \right) ^{\epsilon -1} \left( \varvec{U}_i \right) ^T \varvec{v}_n. \nonumber \end{aligned}$$
(38)

Similarly, we can simplify part of the second term of (36):

$$\begin{aligned}&\sum _{j=1}^D \left( \sigma _j^i \right) ^{\delta -1} \varvec{V}_{i(n,j)} \left( \varvec{U}_{i(:,j)} \cdot \varvec{v}_n \right) \nonumber \\&\quad = \varvec{V}_{i(n,:)} \left( \varvec{\varSigma }_i \right) ^{\delta -1} \left( \varvec{U}_i \right) ^T \varvec{v}_n. \end{aligned}$$
(39)

Substituting (38) and (39) into (36) we get

$$\begin{aligned} \frac{\partial \hat{d}_\epsilon ^i}{\partial m_k^n} =&\left[ C_1^i \left( \varvec{V}_{i(n,:)} \left( \varvec{\varSigma }_i \right) ^{\epsilon -1} \left( \varvec{U}_i \right) ^T \varvec{v}_n \right) \right. \\&\left. -C_2^i \left( \varvec{V}_{i(n,:)} \left( \varvec{\varSigma }_i \right) ^{\delta -1} \left( \varvec{U}_i \right) ^T \varvec{v}_n \right) \right] \delta _{ik}. \end{aligned}$$

With this expression we are ready to evaluate (26) as follows:

$$\begin{aligned} {{\partial GD} \over {\partial m_k^n}}&= \sum _{i=1}^K {{\partial GD} \over {\partial \hat{d}_\epsilon ^i}} {{\partial \hat{d}_\epsilon ^i} \over {\partial m_k^n}}\nonumber \\&= \sum _{i=1}^K (\hat{d}_\epsilon ^i)^{p-1} \left( (\hat{d}_\epsilon ^1)^p + \ldots + (\hat{d}_\epsilon ^K)^p \right) ^{\frac{1}{p}-1} \delta _{ik} \cdot \nonumber \\&\; \left[ C_1^i \left( \varvec{V}_{i(n,:)} \left( \varvec{\varSigma }_i \right) ^{\epsilon -1} \left( \varvec{U}_i \right) ^T \varvec{v}_n \right) \right. \nonumber \\&\left. - C_2^i \left( \varvec{V}_{i(n,:)} \left( \varvec{\varSigma }_i \right) ^{\delta -1} \left( \varvec{U}_i \right) ^T \varvec{v}_n \right) \right] \nonumber \\&= (\hat{d}_\epsilon ^k)^{p-1} \left( (\hat{d}_\epsilon ^1)^p + \ldots + (\hat{d}_\epsilon ^K)^p \right) ^{\frac{1}{p}-1} \cdot \nonumber \\&\; \left[ C_1^k \varvec{V}_{k(n,:)} \left( \varvec{\varSigma }_k \right) ^{\epsilon -1} \left( \varvec{U}_k \right) ^T \varvec{v}_n\right. \nonumber \\&\left. - C_2^k \varvec{V}_{k(n,:)} \left( \varvec{\varSigma }_k \right) ^{\delta -1} \left( \varvec{U}_k \right) ^T \varvec{v}_n \right] \nonumber \\&= (\hat{d}_\epsilon ^k)^{p-1} \Vert \left( \hat{d}_\epsilon ^1,\ldots , \hat{d}_\epsilon ^K \right) \Vert _p^{1-p} \cdot \nonumber \\&\; \left[ \varvec{V}_{k(n,:)} \left( C_1^k \left( \varvec{\varSigma }_k \right) ^{\epsilon -1} - C_2^k \left( \varvec{\varSigma }_k \right) ^{\delta -1} \right) \left( \varvec{U}_k \right) ^T \varvec{v}_n \right] \nonumber \\&= (\hat{d}_\epsilon ^k)^{p-1} \Vert \left( \hat{d}_\epsilon ^1,\ldots , \hat{d}_\epsilon ^K \right) \Vert _p^{1-p} \varvec{V}_{k(n,:)} \varvec{D}_k \left( \varvec{U}_k \right) ^T \varvec{v}_n, \nonumber \\ \end{aligned}$$
(40)

where

$$\begin{aligned} \varvec{D}_k = \left( C_1^k \left( \varvec{\varSigma }_k \right) ^{\epsilon -1} - C_2^k \left( \varvec{\varSigma }_k \right) ^{\delta -1} \right) . \end{aligned}$$
(41)

We re-write (40) as follows:

$$\begin{aligned}&{{\partial GD} \over {\partial m_k^n}} = \varvec{V}_{k(n,:)} \nonumber \\&\quad \times \left( (\hat{d}_\epsilon ^k)^{p-1} \Vert \left( \hat{d}_\epsilon ^1, \ldots , \hat{d}_\epsilon ^K \right) \Vert _p^{1-p} \varvec{D}_k \left( \varvec{U}_k \right) ^T \right) \varvec{A}_{(:,n)}. \nonumber \\ \end{aligned}$$
(42)

1.4 Experiment Setup

For our comparison on the outlier-free RAS database, we include the following methods: GDM, SCC (Chen and Lerman 2009b) from http://www.math.umn.edu/~lerman/scc, MAPA (Chen and Maggioni 2011) from http://www.math.duke.edu/~glchen/mapa.html, SSC (Elhamifar and Vidal 2009) (version 1.0 based on CVX) from http://www.vision.jhu.edu/code, SLBF (& SLBF-MS) (Zhang et al. 2012) from http://www.math.umn.edu/~lerman/lbf, LRR (Liu et al. 2010) from https://sites.google.com/site/guangcanliu, RAS (Rao et al. 2010) from http://perception.csl.illinois.edu/ras, and HOSC (Arias-Castro et al. 2011) from http://www.math.duke.edu/~glchen/hosc.html. For each method in our comparisons (outlier-free and our tests with outliers) the implementation of each algorithm is that of the original authors. Most of these codes were found on the respective authors’ websites, although some codes were obtained from the authors directly when they could not be found online. As a matter of good testing methodology, we ran each method 10 times on each file. This is because we want to avoid capturing any fluke occurrences of any method, but instead seek the “usual case” results (this is important for repeatability of the results). Of the 10 runs for a given file and method, the median error is reported. For deterministic methods, we get the same exact results for each run. GDM involves randomness, but the average standard deviation of the misclassification errors was \(0.73~\%\), meaning that it behaved very consistently in the experiment. GDM was run with \(n_1=10\). The other parameters (\(\epsilon \) and \(p\)) are fixed throughout all experiments and are addressed earlier. SCC was run with \(d=3\) for the linearly embedded data (this was found to give the best results), and \(d=7\) for the nonlinearly embedded data (as recommended in Chen et al. 2009). MAPA was run without any special parameters. SSC was run with no data projection (because of the low ambient dimension to start with), the affine constraint enabled (we tried it both ways and this gave better results), optimization method = “Lasso” (Default for authors code), and parameter lambda = 0.001 (found through trial and error). SLBF was run with \(d=3\) for the linearly embedded data and \(d=6\) for the nonlinearly embedded data and \(\sigma \) was set to 20,000 for both cases (\(d\) and \(\sigma \) were selected by trial and error to give the best results). LRR was run with \(\lambda =100\) for the linear case and \(\lambda =\) 10,000 for the non-linear case (these seemed to give the best results). RAS proved rather sensitive to its main parameter (“angleTolerance”), and no single value gave good across-the-board results. We ran with all default parameters and many other combinations. The results presented were generated using \(\text {angleTolerance}=0.22\) and \(\text {boundaryThreshold}=5\), as this combination gave the best results from our tests (better than the algorithms defaults). HOSC was run with \(\eta \) automatically selected by the algorithm from the range \([0.0001, 0.1]\). The parameter “knn” was set to 20, and the default “heat” kernel was chosen. The algorithm was tried with \(d\) set to 2 and 3. Both of these cases are presented. \(d=2\) gave better results, but the authors of HOSC argue for using \(d=3\) in this setting.

For our comparison on the outlier-free Hopkins 155 database, the algorithms that were selected for the comparison were run once on each of the 155 data files. The mean and median performance for each category is reported. GDM was run with \(n_1=30\) to improve reliability. All other parameters were left fixed, and (as before) the non-linearly embedded data was used. Each competing 2-view method was run on the non-linearly embedded data with the same parameters that gave the best performance on the RAS database. The competing n-view methods have their parameters given in the results tables.

For our outlier comparison on the corrupted RAS database, we ran GDM with \(n_1=30\) (same as for the Hopkins 155 database). For the naive approach, we used \(\alpha =0.02\). For “GDM-Known Fraction” we rejected 20 % of the dataset. For “GDM-Model Reassign” we used \(\kappa =0.05\). “GDM-Classic” was the same algorithm as in the outlier-free comparisons and so had no extra parameters. RAS was run with \(\text {angle}\;\text {Tolerance}=0.22\) and \(\text {boundary}\;\text {Threshold}=5\) (same as in the outlier-free tests). We ran LRR with \(\lambda =0.1\), and \(\text {outlier}\;\text {Threshold}=0.138\) (these gave the best results of the combinations we tried). HOSC was run with \(d=2\) (which gave the best results in the outlier-free case) and \(\alpha =0.11\).

The code for GDM can be found on our supplemental webpage.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Poling, B., Lerman, G. A New Approach to Two-View Motion Segmentation Using Global Dimension Minimization. Int J Comput Vis 108, 165–185 (2014). https://doi.org/10.1007/s11263-013-0694-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0694-0

Keywords

Navigation