Hierarchical Clustering for Large Data Sets

  • Mark J. Embrechts
  • Christopher J. Gatti
  • Jonathan Linton
  • Badrinath Roysam
Part of the Studies in Computational Intelligence book series (SCI, volume 410)

Abstract

This chapter provides a tutorial overview of hierarchical clustering. Several data visualization methods based on hierarchical clustering are demonstrated and the scaling of hierarchical clustering in time and memory is discussed. A new method for speeding up hierarchical clustering with cluster seeding is introduced, and this method is compared with a traditional agglomerative hierarchical, average link clustering algorithm using several internal and external cluster validation indices. A benchmark study compares the cluster performance of both approaches using a wide variety of real-world and artificial benchmark data sets.

Keywords

Hierarchical Cluster Rand Index Hierarchical Cluster Algorithm Cluster Distance Adjust Rand Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Akl, S.G.: An adaptive and cost-optimal parallel algorithm for minimum spanning trees. Comput. 3, 271–277 (1986)MathSciNetGoogle Scholar
  2. 2.
    Akl, S.G.: Optimal parallel merging and sorting without memory conflicts. IEEE Trans. on Comput. 36(11), 1367–1369 (1987)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Anderberg, M.R.: Cluster Analysis for Applications. Academic Press (1973)Google Scholar
  4. 4.
    Bar-Joseph, Z., Gifford, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinforma. 17(S1), 22–29 (2001)Google Scholar
  5. 5.
    Behbahani, S., Ali Moti Nasrabadi, A.M.: Application of SOM neural network in clustering. Journal of Biomedical Science and Engineering 2, 637–643 (2009)CrossRefGoogle Scholar
  6. 6.
    Bezdek, J.C.: Private conversation. In: IJCNN (2010)Google Scholar
  7. 7.
    Carmona-Saez, P., Pascual-Marqui, R.D., Tirado, F., Carazo, J.M., Pascual-Montano, A.: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinforma. 7(1), 78 (2006)CrossRefGoogle Scholar
  8. 8.
    Chen, G.: Design and analysis of parallel algorithm. Higher Education Press (2002)Google Scholar
  9. 9.
    Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 93–103 (2005)Google Scholar
  10. 10.
    Dahlhaus, E.: Parallel Algorithms for Hierarchical Clustering and Applications to Split Decomposition and Parity Graph Recognition. Algorithms 36, 205–240 (2000)MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Dash, M., Lui, H., Scheuermann, P., Tan, K.L.: Fast Hierarchical clustering and its validation. Data Knowl. and Eng. 44(1), 109–138 (2003)MATHCrossRefGoogle Scholar
  12. 12.
    Datta, A., Soundaralakshmi, S.: Fast parallel algorithm for distance transform. IEEE Trans. Sys., Man, and Cybern. 33(5), 429–434 (2003)CrossRefGoogle Scholar
  13. 13.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Analysis Mach. Intell. 1, 224–227 (1971)CrossRefGoogle Scholar
  14. 14.
    Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. Classif. 1, 7–24 (1984)MATHCrossRefGoogle Scholar
  15. 15.
    Marques de Sá, J.P.: Applied Statistics using SPSS, STATISTICA and MATLAB and R. Springer (2007)Google Scholar
  16. 16.
    Dunn, J.: Well separated clusters and optimal fuzzy partitions. Cybern. 4, 95–104 (1974)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Eibe, F., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2011)Google Scholar
  18. 18.
    Epstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs. Exp. Algorithms 5, 1–10 (2000)CrossRefGoogle Scholar
  19. 19.
    Fahlman, S.E.: Faster-Learning Variations on Back-Propagation: An Empirical Study. In: Proceedings of the 1988 Connectionist Models Summer School. Morgan Kaufmann (1989)Google Scholar
  20. 20.
    Fayyad, U.M., Pietetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advanced in Knowledge Discovery and Data Mining. MIT Press (1996)Google Scholar
  21. 21.
    Fisher, R.A.: The Use of Multiple Measurements in Axonomic Problems. Ann. Eugen. 7, 179–188 (1936)Google Scholar
  22. 22.
    Forina, M., Armanino, C.: Eigenvector projection and simplified nonlinear mapping of fatty acid content of Italian olive oils. Ann. Chem. 72, 125–127 (1981)Google Scholar
  23. 23.
    Fowlkes, E.B., Mallows, C.: A method for comparing two hierarchical clusterings. Am. Stat. Assoc. 78, 553–569 (1983)MATHCrossRefGoogle Scholar
  24. 24.
    Goodman, L., Kruskal, W.: Measures of associations for cross-validations. Am. Stat. Assoc. 49, 732–764 (1954)MATHGoogle Scholar
  25. 25.
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression. Monitoring Science 15, 531–537 (1999)Google Scholar
  26. 26.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)Google Scholar
  27. 27.
    Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29, 190–241 (1976)MathSciNetMATHCrossRefGoogle Scholar
  28. 28.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)CrossRefGoogle Scholar
  29. 29.
    Jaccard, P.: Etude comperative de la distribution florale dans une portion des Alpes et des Jura. Bull. de la Société Vaudoise des Sciences Naturelles 37, 574–579 (1901)Google Scholar
  30. 30.
    Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall (1988)Google Scholar
  31. 31.
    Jain, A.K., Murthy, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surv. 31(3), 264–323 (1999)CrossRefGoogle Scholar
  32. 32.
    Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 3(8), 651–666 (2010)CrossRefGoogle Scholar
  33. 33.
    Jiang, J.H., Wang, J.H., Chu, X., Ru-Qin, R.Q.: Neural network learning to non-linear principal component analysis. Analytica Chemica Acta. 336, 209–222 (1996)CrossRefGoogle Scholar
  34. 34.
    Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)CrossRefGoogle Scholar
  35. 35.
    Jun, M., Shaohan, M.: Efficient Parallel Algorithm s for Some Graph Theory Problems. Comput. Sci. Technol. 8(4), 362–366 (1993)CrossRefGoogle Scholar
  36. 36.
    Kaufman, L., Rousseeuw, P.: Finding Groups in Data. Wiley Interscience (1990)Google Scholar
  37. 37.
    Kohonen, T.: Self-Organizing Maps. Springer (1995)Google Scholar
  38. 38.
    Krieger, A.M., Green, P.E.: A Generalized Rand-Index method for consensus Clustering of Separate partitions of the Same Data Base. Classif. 16, 63–89 (1999)CrossRefGoogle Scholar
  39. 39.
    Kwon, S., Han, C.: Hybrid clustering method for DNA microarray data analysis. Gene Inform. 13, 258–259 (2002)Google Scholar
  40. 40.
    Li, Z., Li, K.-L., Xiao, D., Yang, L.: An Adaptive Parallel Hierarchical Clustering Algorithm. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 97–107. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  41. 41.
    Linton, J., Chen, M.-N.: Working paper: Analysis of the Evolution of the Field of Business Ethics through Text Mining. University of Ottawa (2011)Google Scholar
  42. 42.
    Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinforma. 1(1), 24–25 (2004)CrossRefGoogle Scholar
  43. 43.
    The MathWorks, Natick, MAGoogle Scholar
  44. 44.
    Murthag, F.: Expected-time complexity results for hierarchic clustering algorithms which use cluster centres. Inform. Process. Lett. 16, 237–241 (1983)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Murthag, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 354–359 (1983)Google Scholar
  46. 46.
    Murthag, F.: Complexities of hierarchical clustering algorithms: State of the art. Comput. Stat. Q 1(2), 101–113 (1984)Google Scholar
  47. 47.
    Murthag, F.: Comments on ’Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Analysis Mach. Intell. 14(10), 1056–1057 (1992)CrossRefGoogle Scholar
  48. 48.
    Nath, D., Maheshwari, S.N.: Parallel algorithms for the connected components and minimal spanning tree problems. Informa. Process. Lett. 14(1), 7–11 (1982)MathSciNetMATHCrossRefGoogle Scholar
  49. 49.
    Olson, C.F.: Parallel algorithms for hierarchical Clustering. Parallel Comput. 21, 1313–1325 (1995)MathSciNetMATHCrossRefGoogle Scholar
  50. 50.
    Pauwels, E.J., Frederix, G.: Finding salient regions in images: nonparametric clustering for image segmentation and grouping. Comput. Vis. Underst. 75, 73–85 (1999)CrossRefGoogle Scholar
  51. 51.
    Prelic, A., Bleuer, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinforma. 22(9), 1122–1129 (2006)CrossRefGoogle Scholar
  52. 52.
    Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Sys. 16(6), 497–502 (2005)MathSciNetCrossRefGoogle Scholar
  53. 53.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. Am. Stat. Assoc. 66, 846–850 (1971)CrossRefGoogle Scholar
  54. 54.
    Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53–65 (1987)MATHCrossRefGoogle Scholar
  55. 55.
    Santamaría, R., Therón, R., Quintales, L.A.M.: A Framework to Analyze Biclustering Results on Microarray Experiments. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 770–779. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  56. 56.
    Jorge Manuel Fernandes dos Santos: Data Classification with Neural Networks and Entropic Criteria. Ph. D. Dissertation (School of Engineering, University of Porto FEUP (2007)Google Scholar
  57. 57.
    Sledge, I.J., Havens, T.C., Bezdek, J.C., Kelleher, J.M.: Relational cluster validity. In: Aranda, J., Xambó, S. (eds.) Plenary and Invited Lectures of the 2010 World Congress on Computational Intelligence, Barcelona, Spain, pp. 151–185 (2010)Google Scholar
  58. 58.
    Smith, K.: Private Communication (2011)Google Scholar
  59. 59.
    Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. W.H. Freeman (1973)Google Scholar
  60. 60.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley (2005)Google Scholar
  61. 61.
    Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. In: Handbook of Computational Molecular Biology. Chapman and Hall (2004)Google Scholar
  62. 62.
    Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gap Statistics. J. R. Stat. Soc. B 63, 411–423 (2001)MathSciNetMATHCrossRefGoogle Scholar
  63. 63.
    Weiss, S.F.: A probabilistic algorithm for nearest neighbor searching. In: Oddy, R.N., Robertson, S.E., Van Rijsbergen, C.J. (eds.) Information Retrieval Research, Butterworths, pp. 325–333 (1981)Google Scholar
  64. 64.
    Wikipedia, http://en.wikipedia.org/wiki/Hierarchical_clustering (last accessed August 4, 2011)
  65. 65.
    Zapan, J., Gasteiger, J.: Neural Networks in Chemistry and Drug Design, 2nd edn. Wiley VCH (1999)Google Scholar
  66. 66.
    Xu, R., Wunsch II, D.: Survey of clustering Algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar
  67. 67.
    Xu, R., Wunsch II, D.: Clustering. IEEE Press Series on Computational intelligence. Wiley (2008)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2013

Authors and Affiliations

  • Mark J. Embrechts
    • 1
  • Christopher J. Gatti
    • 1
  • Jonathan Linton
    • 2
  • Badrinath Roysam
    • 3
  1. 1.Rensselaer Polytechnic InstituteTroyUSA
  2. 2.University of OttawaOttawaCanada
  3. 3.University of HoustonHoustonUSA

Personalised recommendations