Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 410))

Abstract

This chapter provides a tutorial overview of hierarchical clustering. Several data visualization methods based on hierarchical clustering are demonstrated and the scaling of hierarchical clustering in time and memory is discussed. A new method for speeding up hierarchical clustering with cluster seeding is introduced, and this method is compared with a traditional agglomerative hierarchical, average link clustering algorithm using several internal and external cluster validation indices. A benchmark study compares the cluster performance of both approaches using a wide variety of real-world and artificial benchmark data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akl, S.G.: An adaptive and cost-optimal parallel algorithm for minimum spanning trees. Comput. 3, 271–277 (1986)

    MathSciNet  Google Scholar 

  2. Akl, S.G.: Optimal parallel merging and sorting without memory conflicts. IEEE Trans. on Comput. 36(11), 1367–1369 (1987)

    Article  MathSciNet  Google Scholar 

  3. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press (1973)

    Google Scholar 

  4. Bar-Joseph, Z., Gifford, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinforma. 17(S1), 22–29 (2001)

    Google Scholar 

  5. Behbahani, S., Ali Moti Nasrabadi, A.M.: Application of SOM neural network in clustering. Journal of Biomedical Science and Engineering 2, 637–643 (2009)

    Article  Google Scholar 

  6. Bezdek, J.C.: Private conversation. In: IJCNN (2010)

    Google Scholar 

  7. Carmona-Saez, P., Pascual-Marqui, R.D., Tirado, F., Carazo, J.M., Pascual-Montano, A.: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinforma. 7(1), 78 (2006)

    Article  Google Scholar 

  8. Chen, G.: Design and analysis of parallel algorithm. Higher Education Press (2002)

    Google Scholar 

  9. Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 93–103 (2005)

    Google Scholar 

  10. Dahlhaus, E.: Parallel Algorithms for Hierarchical Clustering and Applications to Split Decomposition and Parity Graph Recognition. Algorithms 36, 205–240 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  11. Dash, M., Lui, H., Scheuermann, P., Tan, K.L.: Fast Hierarchical clustering and its validation. Data Knowl. and Eng. 44(1), 109–138 (2003)

    Article  MATH  Google Scholar 

  12. Datta, A., Soundaralakshmi, S.: Fast parallel algorithm for distance transform. IEEE Trans. Sys., Man, and Cybern. 33(5), 429–434 (2003)

    Article  Google Scholar 

  13. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Analysis Mach. Intell. 1, 224–227 (1971)

    Article  Google Scholar 

  14. Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. Classif. 1, 7–24 (1984)

    Article  MATH  Google Scholar 

  15. Marques de Sá, J.P.: Applied Statistics using SPSS, STATISTICA and MATLAB and R. Springer (2007)

    Google Scholar 

  16. Dunn, J.: Well separated clusters and optimal fuzzy partitions. Cybern. 4, 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  17. Eibe, F., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2011)

    Google Scholar 

  18. Epstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs. Exp. Algorithms 5, 1–10 (2000)

    Article  Google Scholar 

  19. Fahlman, S.E.: Faster-Learning Variations on Back-Propagation: An Empirical Study. In: Proceedings of the 1988 Connectionist Models Summer School. Morgan Kaufmann (1989)

    Google Scholar 

  20. Fayyad, U.M., Pietetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advanced in Knowledge Discovery and Data Mining. MIT Press (1996)

    Google Scholar 

  21. Fisher, R.A.: The Use of Multiple Measurements in Axonomic Problems. Ann. Eugen. 7, 179–188 (1936)

    Google Scholar 

  22. Forina, M., Armanino, C.: Eigenvector projection and simplified nonlinear mapping of fatty acid content of Italian olive oils. Ann. Chem. 72, 125–127 (1981)

    Google Scholar 

  23. Fowlkes, E.B., Mallows, C.: A method for comparing two hierarchical clusterings. Am. Stat. Assoc. 78, 553–569 (1983)

    Article  MATH  Google Scholar 

  24. Goodman, L., Kruskal, W.: Measures of associations for cross-validations. Am. Stat. Assoc. 49, 732–764 (1954)

    MATH  Google Scholar 

  25. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression. Monitoring Science 15, 531–537 (1999)

    Google Scholar 

  26. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)

    Google Scholar 

  27. Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29, 190–241 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  28. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    Article  Google Scholar 

  29. Jaccard, P.: Etude comperative de la distribution florale dans une portion des Alpes et des Jura. Bull. de la Société Vaudoise des Sciences Naturelles 37, 574–579 (1901)

    Google Scholar 

  30. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall (1988)

    Google Scholar 

  31. Jain, A.K., Murthy, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  32. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 3(8), 651–666 (2010)

    Article  Google Scholar 

  33. Jiang, J.H., Wang, J.H., Chu, X., Ru-Qin, R.Q.: Neural network learning to non-linear principal component analysis. Analytica Chemica Acta. 336, 209–222 (1996)

    Article  Google Scholar 

  34. Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)

    Article  Google Scholar 

  35. Jun, M., Shaohan, M.: Efficient Parallel Algorithm s for Some Graph Theory Problems. Comput. Sci. Technol. 8(4), 362–366 (1993)

    Article  Google Scholar 

  36. Kaufman, L., Rousseeuw, P.: Finding Groups in Data. Wiley Interscience (1990)

    Google Scholar 

  37. Kohonen, T.: Self-Organizing Maps. Springer (1995)

    Google Scholar 

  38. Krieger, A.M., Green, P.E.: A Generalized Rand-Index method for consensus Clustering of Separate partitions of the Same Data Base. Classif. 16, 63–89 (1999)

    Article  Google Scholar 

  39. Kwon, S., Han, C.: Hybrid clustering method for DNA microarray data analysis. Gene Inform. 13, 258–259 (2002)

    Google Scholar 

  40. Li, Z., Li, K.-L., Xiao, D., Yang, L.: An Adaptive Parallel Hierarchical Clustering Algorithm. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 97–107. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  41. Linton, J., Chen, M.-N.: Working paper: Analysis of the Evolution of the Field of Business Ethics through Text Mining. University of Ottawa (2011)

    Google Scholar 

  42. Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinforma. 1(1), 24–25 (2004)

    Article  Google Scholar 

  43. The MathWorks, Natick, MA

    Google Scholar 

  44. Murthag, F.: Expected-time complexity results for hierarchic clustering algorithms which use cluster centres. Inform. Process. Lett. 16, 237–241 (1983)

    Article  MathSciNet  Google Scholar 

  45. Murthag, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 354–359 (1983)

    Google Scholar 

  46. Murthag, F.: Complexities of hierarchical clustering algorithms: State of the art. Comput. Stat. Q 1(2), 101–113 (1984)

    Google Scholar 

  47. Murthag, F.: Comments on ’Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Analysis Mach. Intell. 14(10), 1056–1057 (1992)

    Article  Google Scholar 

  48. Nath, D., Maheshwari, S.N.: Parallel algorithms for the connected components and minimal spanning tree problems. Informa. Process. Lett. 14(1), 7–11 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  49. Olson, C.F.: Parallel algorithms for hierarchical Clustering. Parallel Comput. 21, 1313–1325 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  50. Pauwels, E.J., Frederix, G.: Finding salient regions in images: nonparametric clustering for image segmentation and grouping. Comput. Vis. Underst. 75, 73–85 (1999)

    Article  Google Scholar 

  51. Prelic, A., Bleuer, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinforma. 22(9), 1122–1129 (2006)

    Article  Google Scholar 

  52. Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Sys. 16(6), 497–502 (2005)

    Article  MathSciNet  Google Scholar 

  53. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Am. Stat. Assoc. 66, 846–850 (1971)

    Article  Google Scholar 

  54. Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  55. Santamaría, R., Therón, R., Quintales, L.A.M.: A Framework to Analyze Biclustering Results on Microarray Experiments. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 770–779. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  56. Jorge Manuel Fernandes dos Santos: Data Classification with Neural Networks and Entropic Criteria. Ph. D. Dissertation (School of Engineering, University of Porto FEUP (2007)

    Google Scholar 

  57. Sledge, I.J., Havens, T.C., Bezdek, J.C., Kelleher, J.M.: Relational cluster validity. In: Aranda, J., Xambó, S. (eds.) Plenary and Invited Lectures of the 2010 World Congress on Computational Intelligence, Barcelona, Spain, pp. 151–185 (2010)

    Google Scholar 

  58. Smith, K.: Private Communication (2011)

    Google Scholar 

  59. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. W.H. Freeman (1973)

    Google Scholar 

  60. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley (2005)

    Google Scholar 

  61. Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. In: Handbook of Computational Molecular Biology. Chapman and Hall (2004)

    Google Scholar 

  62. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gap Statistics. J. R. Stat. Soc. B 63, 411–423 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  63. Weiss, S.F.: A probabilistic algorithm for nearest neighbor searching. In: Oddy, R.N., Robertson, S.E., Van Rijsbergen, C.J. (eds.) Information Retrieval Research, Butterworths, pp. 325–333 (1981)

    Google Scholar 

  64. Wikipedia, http://en.wikipedia.org/wiki/Hierarchical_clustering (last accessed August 4, 2011)

  65. Zapan, J., Gasteiger, J.: Neural Networks in Chemistry and Drug Design, 2nd edn. Wiley VCH (1999)

    Google Scholar 

  66. Xu, R., Wunsch II, D.: Survey of clustering Algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)

    Article  Google Scholar 

  67. Xu, R., Wunsch II, D.: Clustering. IEEE Press Series on Computational intelligence. Wiley (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark J. Embrechts .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Berlin Heidelberg

About this chapter

Cite this chapter

Embrechts, M.J., Gatti, C.J., Linton, J., Roysam, B. (2013). Hierarchical Clustering for Large Data Sets. In: Georgieva, P., Mihaylova, L., Jain, L. (eds) Advances in Intelligent Signal Processing and Data Mining. Studies in Computational Intelligence, vol 410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28696-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28696-4_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28695-7

  • Online ISBN: 978-3-642-28696-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics