Advertisement

Journal of Classification

, Volume 29, Issue 2, pp 118–143 | Cite as

Fast, Linear Time Hierarchical Clustering using the Baire Metric

  • Pedro ContrerasEmail author
  • Fionn Murtagh
Article

Abstract

The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorithm properties; (ii) generalized ultrametrics, in terms of definition; and (iii) fast clustering through k-means partitioning, in terms of quality of results. For the latter, we carry out an in depth astronomical study. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more costly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we use clusterwise regression for this.

Keywords

Hierarchical clustering Ultrametric Redshift k-means p-adic m-adic Baire Longest common prefix 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ADELMAN-MCCARTHY, J.K. et al. (2007), “The Fifth Data Release of the Sloan Digital Sky Survey”, The Astrophysical Journal Supplement Series, 172(2), 634–644.CrossRefGoogle Scholar
  2. BENZÉCRI, J.-P. (1979), La Taxinomie (2nd ed.), Paris: Dunod.Google Scholar
  3. BRADLEY, P.E. (2008), “Degenerating Families of Dendrograms”, Journal of Classification, 25, 27–42.MathSciNetzbMATHCrossRefGoogle Scholar
  4. BRADLEY, P.E. (2010), “Mumford Dendrograms”, Computer Journal, 53, 393–404.CrossRefGoogle Scholar
  5. CONTRERAS, P. (2010), Search and Retrieval in Massive Data Collections, London: Royal Holloway, University of London.Google Scholar
  6. CONTRERAS, P., and MURTAGH, F. (2007), “Evaluation of Hierarchies based on the Longest Common Prefix, or Baire, Metric”, Classification Society of North America (CSNA) meeting, University of Illinois, Urbana-Champaign IL, USA.Google Scholar
  7. D’ABRUSCO, R., LONGO, G., PAOLILLO, M., BRESCIA, M., DE FILIPPI, E., STAIANO, A., and TAGLIAFERRI, R. (2007a), “The Use of Neural Networks to Probe the Structure of the Nearby Universe”, ArXiv, April 2007, http://arxiv.org/pdf/astroph/0701137/
  8. D’ABRUSCO, R., STAIANO, A., LONGO, G., BRESCIA, M., PAOLILLO, M., DE FILIPPIS, E., and TAGLIAFERRI, R. (2007b), “Mining the SDSS Archive. I. Photometric Redshifts in the Nearby Universe”, Astrophysical Journal, 663(2), 752–764.CrossRefGoogle Scholar
  9. D’ABRUSCO, R., STAIANO, A., LONGO, G., PAOLILLO, M., and DE FILIPPIS, E. (2006), “Steps Toward a Classifier for the Virtual Observatory. I. Classifying the SDSS Photometric Archive”, 1st Workshop of Astronomy and Astrophysics for Students-Naples, April 2006, http://arxiv.org/abs/0706.4424.
  10. DAVEY, B.A., and PRIESTLEY, H.A. (2002), Introduction to Lattices and Order (2nd ed.), Cambridge: Cambridge University Press.zbMATHCrossRefGoogle Scholar
  11. FERNÁNDEZ-SOTO, A., LANZETTA, K.M., HSIAO-WENCHEN, PASCARELLE, S.M., and NORIAKI YAHATA (2001), “On the Compared Accuracy and Reliability of Spectroscopic and Photometric Redshift Measurements”, The Astrophysical Journal Supplement Series,135, 41–61.CrossRefGoogle Scholar
  12. GANTER, B., and WILLE, R. (1999), Formal Concept Analysis: Mathematical Foundations, Springer, (Formale Begriffsanalyse: Mathematische Grundlagen, Springer, 1996).Google Scholar
  13. HARTIGAN, J.A., and WONG, M.A. (1979), “Algorithm AS 136: A k-means Clustering Algorithm”, Journal of the Royal Statistical Society, Series C (Applied Statistics), 28(1), 100–108.zbMATHGoogle Scholar
  14. HITZLER, P., and SEDA, A.K (2002), “The Fixed-point Theorems of Priess-Crampe and Ribenboim in Logic Programming”, Fields Institute Communications, 32, 219–235.MathSciNetGoogle Scholar
  15. JANOWITZ, M.F. (1978), “An Order Theoretic Model for Cluster Analysis”, SIAM Journal on Applied Mathematics, 34, 55–72.MathSciNetzbMATHCrossRefGoogle Scholar
  16. JANOWITZ, M.F. (2010), Ordinal and Relational Clustering, Hackensack, NJ: World Scientific.zbMATHGoogle Scholar
  17. JOHNSON, S.C. (1967), “Hierarchical Clustering Schemes”, Psychometrika, 32, 241–254.CrossRefGoogle Scholar
  18. LERMAN, I.C. (1981), Classification et Analyse Ordinale des Données, Paris:Dunod.zbMATHGoogle Scholar
  19. LONGO, G. (2010), “DAME. Data Mining & Exploration”, http://people.na.infn.it/astroneural/.
  20. MURTAGH, F. (2009), “Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy”, Proceedings of Steklov Institute of Mathematics, 265, 177-198.MathSciNetzbMATHCrossRefGoogle Scholar
  21. MURTAGH, F. (2004a), “Quantifying Ultrametricity”, in COMPSTAT 2004 – Proceedings in Computational Statistics, ed. J. Antoch, Prague, Czech Republic: Springer, pp. 1561–1568.Google Scholar
  22. MURTAGH, F. (2004b), “Thinking Ultrametrically”, in Classification, Clustering, and Data Mining Applications. Proceedings of the Meeting of the International Federation of Classification Societies (IFCS), eds. D. Banks, L. House, F.R. McMorris, P. Arabie, and W. Gaul, Chicago: Springer, pp. 3–14.Google Scholar
  23. MURTAGH, F. (2004c), “On Ultrametricity, Data Coding, and Computation”, Journal of Classification, 21, 167–184.MathSciNetzbMATHCrossRefGoogle Scholar
  24. MURTAGH, F. (2005), “Identifying the Ultrametricity of Time Series”, The European Physical Journal B, 43(4), 573–579.CrossRefGoogle Scholar
  25. MURTAGH, F. (1985), Multidimensional Clustering Algorithms,Wuerzburg: Physica-Verlag.zbMATHGoogle Scholar
  26. MURTAGH, F., DOWNS, G., and CONTRERAS, P. (2008), “Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding”, SIAM Journal on Scientific Computing, 30(2), 707–730.MathSciNetzbMATHCrossRefGoogle Scholar
  27. PEREIRA, J., SCHMIDT, F., CONTRERAS, P., MURTAGH, F., and ASTUDILLO, H. (2010), “Clustering and Semantics Preservation in Cultural Heritage Information Spaces”, in RIAO’2010, 9th International Conference on Adaptivity, Personalization and Fusion of Heterogeneous Information, Paris, France, pp. 100-105.Google Scholar
  28. SDSS (2008), Sloan Digital Sky Survey, http://www.sdss.org.
  29. SEDA, A.K, and HITZLER, P. (2010), “Generalized Distance Functions in the Theory of Computation”, Computer Journal, 53, 443–464.CrossRefGoogle Scholar
  30. VAN ROOIJ, A.C.M. (1978), Non-Archimedean Functional Analysis, New York: Marcel Dekker.zbMATHGoogle Scholar
  31. VEMPALA, S.S. (2004), “The Random Projection Method. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science”, American Mathematical Society, 65, ISBN = 0821820184.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of Computer Science, Royal HollowayUniversity of LondonEghamEngland
  2. 2.ThinkingSafe Ltd.EghamEngland
  3. 3.Science Foundation IrelandDublinIreland

Personalised recommendations