Skip to main content
Log in

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • AGGARWAL, C.C., HINNEBURG, A., and KEIM, D.A. (2001), “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces”, Proceedings of the 8th International Conference on Database Theory, January 04-06, pp. 420–434.

  • AHN, J., MARRON, J.S., MULLER, K.E., and CHI, Y.-Y. (2007), “The High Dimension, Low Sample Size Geometric Representation Holds Under Mild Conditions”, Biometrika, 94, 760–766.

    Article  MATH  MathSciNet  Google Scholar 

  • AHN, J., and MARRON, J.S. (2005), “Maximal Data Piling in Discrimination”, Biometrika, submitted; and “The Direction of Maximal Data Piling in High Dimensional Space”.

  • BELLMAN, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton NJ: Princeton University Press.

    MATH  Google Scholar 

  • BÉNASSÉNI, J., BENNANI DOSSE, M., and JOLY, S. (2007), On a General Transformation Making a Dissimilarity Matrix Euclidean, Journal of Classification, 24, 33–51.

    Article  MATH  MathSciNet  Google Scholar 

  • BENZÉCRI, J.P. (1979), L’Analyse des Donn´ees, Tome I Taxinomie, Tome II Correspondances (2nd ed.), Paris: Dunod.

    Google Scholar 

  • BREUEL, T.M. (2007), “A Note on Approximate Nearest Neighbor Methods”, http://arxiv.org/pdf/cs/0703101

  • CAILLIEZ, F., and PAG`ES, J.P. (1976), Introduction `a l’Analyse de Donn´ees, SMASH (Soci´et´e de Math´ematiques Appliqu´ees et de Sciences Humaines), Paris.

  • CAILLIEZ, F. (1983), “The Analytical Solution of the Additive Constant Problem”, Psychometrika, 48, 305–308.

    Article  MATH  MathSciNet  Google Scholar 

  • CH ÁVEZ, E., NAVARRO, G., BAEZA-YATES,R., andMARROQUÍN, J.L. (2001), “Proximity Searching in Metric Spaces”, ACM Computing Surveys, 33, 273–321.

    Article  Google Scholar 

  • CRITCHLEY, F., and HEISER, W. (1988), “Hierarchical Trees Can Be Perfectly Scaled in One Dimension”, Journal of Classification, 5, 5–20.

    Article  MATH  MathSciNet  Google Scholar 

  • DE SOETE, G. (1986), “A Least Squares Algorithm for Fitting an Ultrametric Tree to a Dissimilarity Matrix”, Pattern Recognition Letters, 2, 133–137.

    Article  Google Scholar 

  • DONOHO, D.L., and TANNER, J. (2005), “Neighborliness of Randomly-Projected Simplices in High Dimensions”, Proceedings of the National Academy of Sciences, 102, 9452–9457.

    Article  MATH  MathSciNet  Google Scholar 

  • HALL, P., MARRON, J.S. and NEEMAN, A. (2005), “Geometric Representation of High Dimension Low Sample Size Data”, Journal of the Royal Statistical Society B, 67, 427–444.

    Article  MATH  MathSciNet  Google Scholar 

  • HEISER, W.J. (2004), “Geometric Representation of Association Between Categories”, Psychometrika, 69, 513–545.

    Article  MathSciNet  Google Scholar 

  • HINNEBURG, A., AGGARWAL, C., and KEIM, D. (2000), “What is the Nearest Neighbor in High Dimensional Spaces?”, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt: Morgan Kaufmann, pp. 506–515.

    Google Scholar 

  • HORNIK, K. (2005), “A CLUE for CLUster Ensembles”, Journal of Statistical Software, 14 (12).

  • KASS, R.E., and RAFTERY, A.E. (1995), “Bayes Factors and Model Uncertainty”, Journal of the American Statistical Association, 90, 773–795.

    Article  MATH  Google Scholar 

  • KHRENNIKOV, A. (1997), Non-Archimedean Analysis: Quantum Paradoxes, Dynamical Systems and Biological Models, Dordrecht: Kluwer.

    MATH  Google Scholar 

  • LERMAN, I.C. (1981), Classification et Analyse Ordinale des Donn´ees, Paris: Dunod.

    Google Scholar 

  • MURTAGH, F. (1985), Multidimensional Clustering Algorithms, Vienna: Physica-Verlag.

    MATH  Google Scholar 

  • MURTAGH, F. (2004), “On Ultrametricity, Data Coding, and Computation”, Journal of Classification, 21, 167–184.

    Article  MATH  MathSciNet  Google Scholar 

  • MURTAGH, F. (2005), “Identifying the Ultrametricity of Time Series”, European Physical Journal B, 43, 573–579.

    Article  Google Scholar 

  • MURTAGH, F. (2007), “A Note on Local Ultrametricity in Text”, http://arxiv.org/pdf/cs.CL/0701181

  • MURTAGH, F. (2005), Correspondence Analysis and Data Coding with R and Java, Boca Raton FL: Chapman & Hall/CRC.

    Google Scholar 

  • MURTAGH, F. (2006), “From Data to the Physics using Ultrametrics: New Results in High Dimensional Data Analysis”, in p-Adic Mathematical Physics, eds. A.Yu. Khrennikov, Z. Raki´c, and I.V. Volovich, American Institute of Physics Conference Proceedings Vol. 826, pp. 151–161.

  • MURTAGH, F., DOWNS, G., and CONTRERAS, P. (2008), “Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding”, SIAM Journal on Scientific Computing, 30, 707–730.

    Article  MATH  MathSciNet  Google Scholar 

  • MURTAGH, F., and STARCK, J.L. (2003), “Quantization from Bayes Factors with Application to Multilevel Thresholding”, Pattern Recognition Letters, 24, 2001–2007.

    Article  Google Scholar 

  • NEUWIRTH, E., and REISINGER, L. (1982), “Dissimilarity and Distance Coefficients in Automation-Supported Thesauri”, Information Systems, 7, 47–52.

    Article  MATH  Google Scholar 

  • RAMMAL, R., ANGLES D’AURIAC, J.C., and DOUCOT, B. (1985), “On the Degree of Ultrametricity”, Le Journal de Physique – Lettres, 46, L-945–L-952.

    Article  Google Scholar 

  • RAMMAL, R., TOULOUSE,G., and VIRASORO,M.A. (1986), “Ultrametricity for Physicists”, Reviews of Modern Physics, 58, 765–788.

    Article  MathSciNet  Google Scholar 

  • ROHLF, F.J., and FISHER, D.R. (1968), “Tests for Hierarchical Structure in Random Data Sets”, Systematic Zoology, 17, 407–412.

    Article  Google Scholar 

  • SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461–464.

    Article  MATH  MathSciNet  Google Scholar 

  • TORGERSON,W.S. (1958), Theory and Methods of Scaling, New York: Wiley.

    Google Scholar 

  • TREVES, A. (1997), “On the Perceptual Structure of Face Space”, BioSystems, 40, 189–196.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fionn Murtagh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Murtagh, F. The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering. J Classif 26, 249–277 (2009). https://doi.org/10.1007/s00357-009-9037-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-009-9037-9

Keywords

Navigation