The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Murtagh, Fionn

doi:10.1007/s00357-009-9037-9

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Published: 14 January 2010

Volume 26, pages 249–277, (2009)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Fionn Murtagh^1,2

311 Accesses
26 Citations
Explore all metrics

Abstract

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

AGGARWAL, C.C., HINNEBURG, A., and KEIM, D.A. (2001), “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces”, Proceedings of the 8th International Conference on Database Theory, January 04-06, pp. 420–434.
AHN, J., MARRON, J.S., MULLER, K.E., and CHI, Y.-Y. (2007), “The High Dimension, Low Sample Size Geometric Representation Holds Under Mild Conditions”, Biometrika, 94, 760–766.
Article MATH MathSciNet Google Scholar
AHN, J., and MARRON, J.S. (2005), “Maximal Data Piling in Discrimination”, Biometrika, submitted; and “The Direction of Maximal Data Piling in High Dimensional Space”.
BELLMAN, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton NJ: Princeton University Press.
MATH Google Scholar
BÉNASSÉNI, J., BENNANI DOSSE, M., and JOLY, S. (2007), On a General Transformation Making a Dissimilarity Matrix Euclidean, Journal of Classification, 24, 33–51.
Article MATH MathSciNet Google Scholar
BENZÉCRI, J.P. (1979), L’Analyse des Donn´ees, Tome I Taxinomie, Tome II Correspondances (2nd ed.), Paris: Dunod.
Google Scholar
BREUEL, T.M. (2007), “A Note on Approximate Nearest Neighbor Methods”, http://arxiv.org/pdf/cs/0703101
CAILLIEZ, F., and PAG`ES, J.P. (1976), Introduction `a l’Analyse de Donn´ees, SMASH (Soci´et´e de Math´ematiques Appliqu´ees et de Sciences Humaines), Paris.
CAILLIEZ, F. (1983), “The Analytical Solution of the Additive Constant Problem”, Psychometrika, 48, 305–308.
Article MATH MathSciNet Google Scholar
CH ÁVEZ, E., NAVARRO, G., BAEZA-YATES,R., andMARROQUÍN, J.L. (2001), “Proximity Searching in Metric Spaces”, ACM Computing Surveys, 33, 273–321.
Article Google Scholar
CRITCHLEY, F., and HEISER, W. (1988), “Hierarchical Trees Can Be Perfectly Scaled in One Dimension”, Journal of Classification, 5, 5–20.
Article MATH MathSciNet Google Scholar
DE SOETE, G. (1986), “A Least Squares Algorithm for Fitting an Ultrametric Tree to a Dissimilarity Matrix”, Pattern Recognition Letters, 2, 133–137.
Article Google Scholar
DONOHO, D.L., and TANNER, J. (2005), “Neighborliness of Randomly-Projected Simplices in High Dimensions”, Proceedings of the National Academy of Sciences, 102, 9452–9457.
Article MATH MathSciNet Google Scholar
HALL, P., MARRON, J.S. and NEEMAN, A. (2005), “Geometric Representation of High Dimension Low Sample Size Data”, Journal of the Royal Statistical Society B, 67, 427–444.
Article MATH MathSciNet Google Scholar
HEISER, W.J. (2004), “Geometric Representation of Association Between Categories”, Psychometrika, 69, 513–545.
Article MathSciNet Google Scholar
HINNEBURG, A., AGGARWAL, C., and KEIM, D. (2000), “What is the Nearest Neighbor in High Dimensional Spaces?”, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt: Morgan Kaufmann, pp. 506–515.
Google Scholar
HORNIK, K. (2005), “A CLUE for CLUster Ensembles”, Journal of Statistical Software, 14 (12).
KASS, R.E., and RAFTERY, A.E. (1995), “Bayes Factors and Model Uncertainty”, Journal of the American Statistical Association, 90, 773–795.
Article MATH Google Scholar
KHRENNIKOV, A. (1997), Non-Archimedean Analysis: Quantum Paradoxes, Dynamical Systems and Biological Models, Dordrecht: Kluwer.
MATH Google Scholar
LERMAN, I.C. (1981), Classification et Analyse Ordinale des Donn´ees, Paris: Dunod.
Google Scholar
MURTAGH, F. (1985), Multidimensional Clustering Algorithms, Vienna: Physica-Verlag.
MATH Google Scholar
MURTAGH, F. (2004), “On Ultrametricity, Data Coding, and Computation”, Journal of Classification, 21, 167–184.
Article MATH MathSciNet Google Scholar
MURTAGH, F. (2005), “Identifying the Ultrametricity of Time Series”, European Physical Journal B, 43, 573–579.
Article Google Scholar
MURTAGH, F. (2007), “A Note on Local Ultrametricity in Text”, http://arxiv.org/pdf/cs.CL/0701181
MURTAGH, F. (2005), Correspondence Analysis and Data Coding with R and Java, Boca Raton FL: Chapman & Hall/CRC.
Google Scholar
MURTAGH, F. (2006), “From Data to the Physics using Ultrametrics: New Results in High Dimensional Data Analysis”, in p-Adic Mathematical Physics, eds. A.Yu. Khrennikov, Z. Raki´c, and I.V. Volovich, American Institute of Physics Conference Proceedings Vol. 826, pp. 151–161.
MURTAGH, F., DOWNS, G., and CONTRERAS, P. (2008), “Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding”, SIAM Journal on Scientific Computing, 30, 707–730.
Article MATH MathSciNet Google Scholar
MURTAGH, F., and STARCK, J.L. (2003), “Quantization from Bayes Factors with Application to Multilevel Thresholding”, Pattern Recognition Letters, 24, 2001–2007.
Article Google Scholar
NEUWIRTH, E., and REISINGER, L. (1982), “Dissimilarity and Distance Coefficients in Automation-Supported Thesauri”, Information Systems, 7, 47–52.
Article MATH Google Scholar
RAMMAL, R., ANGLES D’AURIAC, J.C., and DOUCOT, B. (1985), “On the Degree of Ultrametricity”, Le Journal de Physique – Lettres, 46, L-945–L-952.
Article Google Scholar
RAMMAL, R., TOULOUSE,G., and VIRASORO,M.A. (1986), “Ultrametricity for Physicists”, Reviews of Modern Physics, 58, 765–788.
Article MathSciNet Google Scholar
ROHLF, F.J., and FISHER, D.R. (1968), “Tests for Hierarchical Structure in Random Data Sets”, Systematic Zoology, 17, 407–412.
Article Google Scholar
SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics, 6, 461–464.
Article MATH MathSciNet Google Scholar
TORGERSON,W.S. (1958), Theory and Methods of Scaling, New York: Wiley.
Google Scholar
TREVES, A. (1997), “On the Perceptual Structure of Face Space”, BioSystems, 40, 189–196.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Science Foundation Ireland, Wilton Park House, Wilton Place, Dublin 4, Ireland
Fionn Murtagh
Department of Computer Science, Royal Holloway, University of London, Egham, TW20 0EX, England
Fionn Murtagh

Authors

Fionn Murtagh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fionn Murtagh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Murtagh, F. The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering. J Classif 26, 249–277 (2009). https://doi.org/10.1007/s00357-009-9037-9

Download citation

Published: 14 January 2010
Issue Date: December 2009
DOI: https://doi.org/10.1007/s00357-009-9037-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Abstract

Access this article

Similar content being viewed by others

Multiscale and Multivariate Time Series Clustering: A New Approach

Accelerating the discovery of unsupervised-shapelets

TSX-Means: An Optimal K Search Approach for Time Series Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Abstract

Access this article

Similar content being viewed by others

Multiscale and Multivariate Time Series Clustering: A New Approach

Accelerating the discovery of unsupervised-shapelets

TSX-Means: An Optimal K Search Approach for Time Series Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation