The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering
An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.
KeywordsMultivariate data analysis Cluster analysis Hierarchy Ultrametric p-Adic Dimensionality
Unable to display preview. Download preview PDF.
- AGGARWAL, C.C., HINNEBURG, A., and KEIM, D.A. (2001), “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces”, Proceedings of the 8th International Conference on Database Theory, January 04-06, pp. 420–434.Google Scholar
- AHN, J., and MARRON, J.S. (2005), “Maximal Data Piling in Discrimination”, Biometrika, submitted; and “The Direction of Maximal Data Piling in High Dimensional Space”.Google Scholar
- BENZÉCRI, J.P. (1979), L’Analyse des Donn´ees, Tome I Taxinomie, Tome II Correspondances (2nd ed.), Paris: Dunod.Google Scholar
- BREUEL, T.M. (2007), “A Note on Approximate Nearest Neighbor Methods”, http://arxiv.org/pdf/cs/0703101
- CAILLIEZ, F., and PAG`ES, J.P. (1976), Introduction `a l’Analyse de Donn´ees, SMASH (Soci´et´e de Math´ematiques Appliqu´ees et de Sciences Humaines), Paris.Google Scholar
- HINNEBURG, A., AGGARWAL, C., and KEIM, D. (2000), “What is the Nearest Neighbor in High Dimensional Spaces?”, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt: Morgan Kaufmann, pp. 506–515.Google Scholar
- HORNIK, K. (2005), “A CLUE for CLUster Ensembles”, Journal of Statistical Software, 14 (12).Google Scholar
- LERMAN, I.C. (1981), Classification et Analyse Ordinale des Donn´ees, Paris: Dunod.Google Scholar
- MURTAGH, F. (2007), “A Note on Local Ultrametricity in Text”, http://arxiv.org/pdf/cs.CL/0701181
- MURTAGH, F. (2005), Correspondence Analysis and Data Coding with R and Java, Boca Raton FL: Chapman & Hall/CRC.Google Scholar
- MURTAGH, F. (2006), “From Data to the Physics using Ultrametrics: New Results in High Dimensional Data Analysis”, in p-Adic Mathematical Physics, eds. A.Yu. Khrennikov, Z. Raki´c, and I.V. Volovich, American Institute of Physics Conference Proceedings Vol. 826, pp. 151–161.Google Scholar
- TORGERSON,W.S. (1958), Theory and Methods of Scaling, New York: Wiley.Google Scholar