Journal of Classification

, Volume 26, Issue 3, pp 249–277 | Cite as

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Article

Abstract

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

Keywords

Multivariate data analysis Cluster analysis Hierarchy Ultrametric p-Adic Dimensionality 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AGGARWAL, C.C., HINNEBURG, A., and KEIM, D.A. (2001), “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces”, Proceedings of the 8th International Conference on Database Theory, January 04-06, pp. 420–434.Google Scholar
  2. AHN, J., MARRON, J.S., MULLER, K.E., and CHI, Y.-Y. (2007), “The High Dimension, Low Sample Size Geometric Representation Holds Under Mild Conditions”, Biometrika, 94, 760–766.MATHCrossRefMathSciNetGoogle Scholar
  3. AHN, J., and MARRON, J.S. (2005), “Maximal Data Piling in Discrimination”, Biometrika, submitted; and “The Direction of Maximal Data Piling in High Dimensional Space”.Google Scholar
  4. BELLMAN, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton NJ: Princeton University Press.MATHGoogle Scholar
  5. BÉNASSÉNI, J., BENNANI DOSSE, M., and JOLY, S. (2007), On a General Transformation Making a Dissimilarity Matrix Euclidean, Journal of Classification, 24, 33–51.MATHCrossRefMathSciNetGoogle Scholar
  6. BENZÉCRI, J.P. (1979), L’Analyse des Donn´ees, Tome I Taxinomie, Tome II Correspondances (2nd ed.), Paris: Dunod.Google Scholar
  7. BREUEL, T.M. (2007), “A Note on Approximate Nearest Neighbor Methods”, http://arxiv.org/pdf/cs/0703101
  8. CAILLIEZ, F., and PAG`ES, J.P. (1976), Introduction `a l’Analyse de Donn´ees, SMASH (Soci´et´e de Math´ematiques Appliqu´ees et de Sciences Humaines), Paris.Google Scholar
  9. CAILLIEZ, F. (1983), “The Analytical Solution of the Additive Constant Problem”, Psychometrika, 48, 305–308.MATHCrossRefMathSciNetGoogle Scholar
  10. CH ÁVEZ, E., NAVARRO, G., BAEZA-YATES,R., andMARROQUÍN, J.L. (2001), “Proximity Searching in Metric Spaces”, ACM Computing Surveys, 33, 273–321.CrossRefGoogle Scholar
  11. CRITCHLEY, F., and HEISER, W. (1988), “Hierarchical Trees Can Be Perfectly Scaled in One Dimension”, Journal of Classification, 5, 5–20.MATHCrossRefMathSciNetGoogle Scholar
  12. DE SOETE, G. (1986), “A Least Squares Algorithm for Fitting an Ultrametric Tree to a Dissimilarity Matrix”, Pattern Recognition Letters, 2, 133–137.CrossRefGoogle Scholar
  13. DONOHO, D.L., and TANNER, J. (2005), “Neighborliness of Randomly-Projected Simplices in High Dimensions”, Proceedings of the National Academy of Sciences, 102, 9452–9457.MATHCrossRefMathSciNetGoogle Scholar
  14. HALL, P., MARRON, J.S. and NEEMAN, A. (2005), “Geometric Representation of High Dimension Low Sample Size Data”, Journal of the Royal Statistical Society B, 67, 427–444.MATHCrossRefMathSciNetGoogle Scholar
  15. HEISER, W.J. (2004), “Geometric Representation of Association Between Categories”, Psychometrika, 69, 513–545.CrossRefMathSciNetGoogle Scholar
  16. HINNEBURG, A., AGGARWAL, C., and KEIM, D. (2000), “What is the Nearest Neighbor in High Dimensional Spaces?”, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt: Morgan Kaufmann, pp. 506–515.Google Scholar
  17. HORNIK, K. (2005), “A CLUE for CLUster Ensembles”, Journal of Statistical Software, 14 (12).Google Scholar
  18. KASS, R.E., and RAFTERY, A.E. (1995), “Bayes Factors and Model Uncertainty”, Journal of the American Statistical Association, 90, 773–795.MATHCrossRefGoogle Scholar
  19. KHRENNIKOV, A. (1997), Non-Archimedean Analysis: Quantum Paradoxes, Dynamical Systems and Biological Models, Dordrecht: Kluwer.MATHGoogle Scholar
  20. LERMAN, I.C. (1981), Classification et Analyse Ordinale des Donn´ees, Paris: Dunod.Google Scholar
  21. MURTAGH, F. (1985), Multidimensional Clustering Algorithms, Vienna: Physica-Verlag.MATHGoogle Scholar
  22. MURTAGH, F. (2004), “On Ultrametricity, Data Coding, and Computation”, Journal of Classification, 21, 167–184.MATHCrossRefMathSciNetGoogle Scholar
  23. MURTAGH, F. (2005), “Identifying the Ultrametricity of Time Series”, European Physical Journal B, 43, 573–579.CrossRefGoogle Scholar
  24. MURTAGH, F. (2007), “A Note on Local Ultrametricity in Text”, http://arxiv.org/pdf/cs.CL/0701181
  25. MURTAGH, F. (2005), Correspondence Analysis and Data Coding with R and Java, Boca Raton FL: Chapman & Hall/CRC.Google Scholar
  26. MURTAGH, F. (2006), “From Data to the Physics using Ultrametrics: New Results in High Dimensional Data Analysis”, in p-Adic Mathematical Physics, eds. A.Yu. Khrennikov, Z. Raki´c, and I.V. Volovich, American Institute of Physics Conference Proceedings Vol. 826, pp. 151–161.Google Scholar
  27. MURTAGH, F., DOWNS, G., and CONTRERAS, P. (2008), “Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding”, SIAM Journal on Scientific Computing, 30, 707–730.MATHCrossRefMathSciNetGoogle Scholar
  28. MURTAGH, F., and STARCK, J.L. (2003), “Quantization from Bayes Factors with Application to Multilevel Thresholding”, Pattern Recognition Letters, 24, 2001–2007.CrossRefGoogle Scholar
  29. NEUWIRTH, E., and REISINGER, L. (1982), “Dissimilarity and Distance Coefficients in Automation-Supported Thesauri”, Information Systems, 7, 47–52.MATHCrossRefGoogle Scholar
  30. RAMMAL, R., ANGLES D’AURIAC, J.C., and DOUCOT, B. (1985), “On the Degree of Ultrametricity”, Le Journal de Physique – Lettres, 46, L-945–L-952.CrossRefGoogle Scholar
  31. RAMMAL, R., TOULOUSE,G., and VIRASORO,M.A. (1986), “Ultrametricity for Physicists”, Reviews of Modern Physics, 58, 765–788.CrossRefMathSciNetGoogle Scholar
  32. ROHLF, F.J., and FISHER, D.R. (1968), “Tests for Hierarchical Structure in Random Data Sets”, Systematic Zoology, 17, 407–412.CrossRefGoogle Scholar
  33. SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461–464.MATHCrossRefMathSciNetGoogle Scholar
  34. TORGERSON,W.S. (1958), Theory and Methods of Scaling, New York: Wiley.Google Scholar
  35. TREVES, A. (1997), “On the Perceptual Structure of Face Space”, BioSystems, 40, 189–196.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Science Foundation IrelandDublin 4Ireland
  2. 2.Department of Computer Science, Royal HollowayUniversity of LondonEghamEngland

Personalised recommendations