Abstract
An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.
Similar content being viewed by others
References
AGGARWAL, C.C., HINNEBURG, A., and KEIM, D.A. (2001), “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces”, Proceedings of the 8th International Conference on Database Theory, January 04-06, pp. 420–434.
AHN, J., MARRON, J.S., MULLER, K.E., and CHI, Y.-Y. (2007), “The High Dimension, Low Sample Size Geometric Representation Holds Under Mild Conditions”, Biometrika, 94, 760–766.
AHN, J., and MARRON, J.S. (2005), “Maximal Data Piling in Discrimination”, Biometrika, submitted; and “The Direction of Maximal Data Piling in High Dimensional Space”.
BELLMAN, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton NJ: Princeton University Press.
BÉNASSÉNI, J., BENNANI DOSSE, M., and JOLY, S. (2007), On a General Transformation Making a Dissimilarity Matrix Euclidean, Journal of Classification, 24, 33–51.
BENZÉCRI, J.P. (1979), L’Analyse des Donn´ees, Tome I Taxinomie, Tome II Correspondances (2nd ed.), Paris: Dunod.
BREUEL, T.M. (2007), “A Note on Approximate Nearest Neighbor Methods”, http://arxiv.org/pdf/cs/0703101
CAILLIEZ, F., and PAG`ES, J.P. (1976), Introduction `a l’Analyse de Donn´ees, SMASH (Soci´et´e de Math´ematiques Appliqu´ees et de Sciences Humaines), Paris.
CAILLIEZ, F. (1983), “The Analytical Solution of the Additive Constant Problem”, Psychometrika, 48, 305–308.
CH ÁVEZ, E., NAVARRO, G., BAEZA-YATES,R., andMARROQUÍN, J.L. (2001), “Proximity Searching in Metric Spaces”, ACM Computing Surveys, 33, 273–321.
CRITCHLEY, F., and HEISER, W. (1988), “Hierarchical Trees Can Be Perfectly Scaled in One Dimension”, Journal of Classification, 5, 5–20.
DE SOETE, G. (1986), “A Least Squares Algorithm for Fitting an Ultrametric Tree to a Dissimilarity Matrix”, Pattern Recognition Letters, 2, 133–137.
DONOHO, D.L., and TANNER, J. (2005), “Neighborliness of Randomly-Projected Simplices in High Dimensions”, Proceedings of the National Academy of Sciences, 102, 9452–9457.
HALL, P., MARRON, J.S. and NEEMAN, A. (2005), “Geometric Representation of High Dimension Low Sample Size Data”, Journal of the Royal Statistical Society B, 67, 427–444.
HEISER, W.J. (2004), “Geometric Representation of Association Between Categories”, Psychometrika, 69, 513–545.
HINNEBURG, A., AGGARWAL, C., and KEIM, D. (2000), “What is the Nearest Neighbor in High Dimensional Spaces?”, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt: Morgan Kaufmann, pp. 506–515.
HORNIK, K. (2005), “A CLUE for CLUster Ensembles”, Journal of Statistical Software, 14 (12).
KASS, R.E., and RAFTERY, A.E. (1995), “Bayes Factors and Model Uncertainty”, Journal of the American Statistical Association, 90, 773–795.
KHRENNIKOV, A. (1997), Non-Archimedean Analysis: Quantum Paradoxes, Dynamical Systems and Biological Models, Dordrecht: Kluwer.
LERMAN, I.C. (1981), Classification et Analyse Ordinale des Donn´ees, Paris: Dunod.
MURTAGH, F. (1985), Multidimensional Clustering Algorithms, Vienna: Physica-Verlag.
MURTAGH, F. (2004), “On Ultrametricity, Data Coding, and Computation”, Journal of Classification, 21, 167–184.
MURTAGH, F. (2005), “Identifying the Ultrametricity of Time Series”, European Physical Journal B, 43, 573–579.
MURTAGH, F. (2007), “A Note on Local Ultrametricity in Text”, http://arxiv.org/pdf/cs.CL/0701181
MURTAGH, F. (2005), Correspondence Analysis and Data Coding with R and Java, Boca Raton FL: Chapman & Hall/CRC.
MURTAGH, F. (2006), “From Data to the Physics using Ultrametrics: New Results in High Dimensional Data Analysis”, in p-Adic Mathematical Physics, eds. A.Yu. Khrennikov, Z. Raki´c, and I.V. Volovich, American Institute of Physics Conference Proceedings Vol. 826, pp. 151–161.
MURTAGH, F., DOWNS, G., and CONTRERAS, P. (2008), “Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding”, SIAM Journal on Scientific Computing, 30, 707–730.
MURTAGH, F., and STARCK, J.L. (2003), “Quantization from Bayes Factors with Application to Multilevel Thresholding”, Pattern Recognition Letters, 24, 2001–2007.
NEUWIRTH, E., and REISINGER, L. (1982), “Dissimilarity and Distance Coefficients in Automation-Supported Thesauri”, Information Systems, 7, 47–52.
RAMMAL, R., ANGLES D’AURIAC, J.C., and DOUCOT, B. (1985), “On the Degree of Ultrametricity”, Le Journal de Physique – Lettres, 46, L-945–L-952.
RAMMAL, R., TOULOUSE,G., and VIRASORO,M.A. (1986), “Ultrametricity for Physicists”, Reviews of Modern Physics, 58, 765–788.
ROHLF, F.J., and FISHER, D.R. (1968), “Tests for Hierarchical Structure in Random Data Sets”, Systematic Zoology, 17, 407–412.
SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461–464.
TORGERSON,W.S. (1958), Theory and Methods of Scaling, New York: Wiley.
TREVES, A. (1997), “On the Perceptual Structure of Face Space”, BioSystems, 40, 189–196.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Murtagh, F. The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering. J Classif 26, 249–277 (2009). https://doi.org/10.1007/s00357-009-9037-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-009-9037-9