Heuristic Framework for Multiscale Testing of the Multi-Manifold Hypothesis
When analyzing empirical data, we often find that global linear models overestimate the number of parameters required. In such cases, we may ask whether the data lies on or near a manifold or a set of manifolds, referred to as multi-manifold, of lower dimension than the ambient space. This question can be phrased as a (multi-)manifold hypothesis. The identification of such intrinsic multiscale features is a cornerstone of data analysis and representation, and has given rise to a large body of work on manifold learning. In this work, we review key results on multiscale data analysis and intrinsic dimension followed by the introduction of a heuristic, multiscale, framework for testing the multi-manifold hypothesis. Our method implements a hypothesis test on a set of spline-interpolated manifolds constructed from variance-based intrinsic dimensions. The workflow is suitable for empirical data analysis as we demonstrate on two use cases.
This research started at the Women in Data Science and Mathematics Research Collaboration Workshop (WiSDM), July 17–21, 2017, at the Institute for Computational and Experimental Research in Mathematics (ICERM). The workshop was partially supported by grant number NSF-HRD 1500481-AWM ADVANCE and co-sponsored by Brown’s Data Science Initiative.
Additional support for some participant travel was provided by DIMACS in association with and through its Special Focus on Information Sharing and Dynamic Data Analysis. Linda Ness worked on this project during a visit to DIMACS, partially supported by the National Science Foundation under grant number CCF-1445755. F. Patricia Medina received partial travel funding from the Mathematical Science Department at Worcester Polytechnic Institute.
We thank Brie Finegold and Katherine M. Kinnaird for their participation in the workshop and in early stage experiments. In addition, we thank Anna Little for helpful discussions on intrinsic dimensions and Jason Stoker for sharing material on LiDAR data.
- 2.J. Azzam, R. Schul, An analyst’s traveling salesman theorem for sets of dimension larger than one. Tech Report (2017). https://arxiv.org/abs/1609.02892
- 3.D. Bassu, R. Izmailov, A. McIntosh, L. Ness, D. Shallcross, Centralized multi-scale singular vector decomposition for feature construction in LiDAR image classification problems, in IEEE Applied Imagery and Pattern Recognition Workshop (AIPR) (IEEE, Piscataway, 2012)Google Scholar
- 4.D. Bassu, R. Izmailov, A. McIntosh, L. Ness, D. Shallcross, Application of multi-scale singular vector decomposition to vessel classification in overhead satellite imagery, in Proceedings of the Seventh Annual International Conference on Digital Image Processing (ICDIP 2015), vol. 9631, ed. by C. Falco, X. Jiang (2015)Google Scholar
- 5.M. Belkin, P. Niyogi, Laplacian Eigenmaps and spectral techniques for embedding and clustering, in Advances in Neural Information Processing Systems (NIPS), vol. 14 (2002)Google Scholar
- 8.P. Bendich, E. Gasparovic, J. Harer, R. Izmailov, L. Ness, Multi-scale local shape analysis and feature selection in machine learning applications, in Multi-Scale Local Shape Analysis and Feature Selection in Machine Learning Applications (IEEE, Piscataway, 2014). http://arxiv.org/pdf/1410.3169.pdf Google Scholar
- 9.P. Bendich, E. Gasparovic, C. Tralie, J. Harer, Scaffoldings and spines: organizing high-dimensional data using cover trees, local principal component analysis, and persistent homology. Technical Report (2016). https://arxiv.org/pdf/1602.06245.pdf
- 10.A. Beygelzimer, S. Kakade, J. Langford, Cover trees for nearest neighbor, in Proceedings of the 23rd International Conference on Machine Learning (ICML ’06) (ACM, New York 2006), pp. 97–104Google Scholar
- 14.K. Carter, A. Hero, Variance reduction with neighborhood smoothing for local intrinsic dimension estimation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, Piscataway, 2008)Google Scholar
- 15.G. Chen, A. Little, M. Maggioni, Multi-resolution geometric analysis for data in high dimensions, in Excursions in Harmonic Analysis: The February Fourier Talks at the Norbert Wiener Center (Springer, Berlin, 2013), pp. 259–285Google Scholar
- 21.J.A. Costa, A. Girotra, A.O. Hero, Estimating local intrinsic dimension with k-nearest neighbor graphs, in IEEE/SP 13th Workshop on Statistical Signal Processing (IEEE, Piscataway, 2005)Google Scholar
- 25.K. Fukunaga, Intrinsic dimensionality extraction, in Classification Pattern Recognition and Reduction of Dimensionality. Handbook of Statistics, vol. 2 (Elsevier, Amsterdam, 1982), pp. 347–360Google Scholar
- 27.J. Ham, D. Lee, S. Mika, B. Schölkopf, A kernel view of the dimensionality reduction of manifolds, in Proceedings of the Twenty-First International Conference on Machine Learning (ICML ’04) (ACM, New York, 2004), pp. 47–55Google Scholar
- 29.D. Joncas, M. Meila, J. McQueen, Improved graph Laplacian via geometric self-consistency, in Advances in Neural Information Processing Systems (2017), pp. 4457–4466Google Scholar
- 32.R. Krauthgamer, J.R. Lee, Navigating nets: simple algorithms for proximity search, in Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’04) (Philadelphia, Society for Industrial and Applied Mathematics, 2004), pp. 798–807Google Scholar
- 34.E. Levina, P. Bickel, Maximum likelihood estimation of intrinsic dimension, in Advances in Neural Information Processing Systems (NIPS), vol. 17 (MIT Press, Cambridge, MA, 2005), pp. 777–784Google Scholar
- 35.A. Little, Estimating the Intrinsic Dimension of High-Dimensional Data Sets: A Multiscale, Geometric Approach, vol. 5 (Duke University, Durham, 2011)Google Scholar
- 36.P.M. Mather, Computer Processing of Remotely-Sensed Images: An Introduction (Wiley, New York, 2004)Google Scholar
- 39.H. Narayanan, S. Mitter, Sample complexity of testing the manifold hypothesis, in Advances in Neural Information Processing Systems, vol. 23. ed. by J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, A. Culotta (Curran Associates, Red Hook, 2010), pp. 1786–1794Google Scholar
- 40.A. Ng, M. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in Advances in Neural Information Processing Systems (NIPS), vol. 14 (2002), pp. 849–856Google Scholar
- 44.B. Schölkopf, A. Smola, J. Alexander, K. Müller, Kernel principal component analysis, in Advances in Kernel Methods: Support Vector Learning (1999), pp. 327–352Google Scholar
- 47.G. Sumerling, Lidar Analysis in Arcgis 9.3.1 for Forestry Applications. https://www.esri.com/library/whitepapers/pdfs/lidar-analysis-forestry.pdf (2010)
- 52.X. Wang, K. Slavakis, G. Lerman, Riemannian multi-manifold modeling. Technical Report (2014). http://arXiv:1410.0095 and http://www-users.math.umn.edu/~lerman/MMC/ Link to supplementary webpage with code
- 53.W. Zheng, M. Rohrdanz, M. Maggioni, C. Clementi, Determination of reaction coordinates via locally scaled diffusion map. J. Chem. Phys. 134, 03B624 (2011)Google Scholar