Hierarchical Clustering for Large Data Sets

Embrechts, Mark J.; Gatti, Christopher J.; Linton, Jonathan; Roysam, Badrinath

doi:10.1007/978-3-642-28696-4_8

Mark J. Embrechts⁴,
Christopher J. Gatti⁴,
Jonathan Linton⁵ &
…
Badrinath Roysam⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 410))

2359 Accesses
16 Citations

Abstract

This chapter provides a tutorial overview of hierarchical clustering. Several data visualization methods based on hierarchical clustering are demonstrated and the scaling of hierarchical clustering in time and memory is discussed. A new method for speeding up hierarchical clustering with cluster seeding is introduced, and this method is compared with a traditional agglomerative hierarchical, average link clustering algorithm using several internal and external cluster validation indices. A benchmark study compares the cluster performance of both approaches using a wide variety of real-world and artificial benchmark data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Akl, S.G.: An adaptive and cost-optimal parallel algorithm for minimum spanning trees. Comput. 3, 271–277 (1986)
MathSciNet Google Scholar
Akl, S.G.: Optimal parallel merging and sorting without memory conflicts. IEEE Trans. on Comput. 36(11), 1367–1369 (1987)
Article MathSciNet Google Scholar
Anderberg, M.R.: Cluster Analysis for Applications. Academic Press (1973)
Google Scholar
Bar-Joseph, Z., Gifford, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinforma. 17(S1), 22–29 (2001)
Google Scholar
Behbahani, S., Ali Moti Nasrabadi, A.M.: Application of SOM neural network in clustering. Journal of Biomedical Science and Engineering 2, 637–643 (2009)
Article Google Scholar
Bezdek, J.C.: Private conversation. In: IJCNN (2010)
Google Scholar
Carmona-Saez, P., Pascual-Marqui, R.D., Tirado, F., Carazo, J.M., Pascual-Montano, A.: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinforma. 7(1), 78 (2006)
Article Google Scholar
Chen, G.: Design and analysis of parallel algorithm. Higher Education Press (2002)
Google Scholar
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 93–103 (2005)
Google Scholar
Dahlhaus, E.: Parallel Algorithms for Hierarchical Clustering and Applications to Split Decomposition and Parity Graph Recognition. Algorithms 36, 205–240 (2000)
Article MathSciNet MATH Google Scholar
Dash, M., Lui, H., Scheuermann, P., Tan, K.L.: Fast Hierarchical clustering and its validation. Data Knowl. and Eng. 44(1), 109–138 (2003)
Article MATH Google Scholar
Datta, A., Soundaralakshmi, S.: Fast parallel algorithm for distance transform. IEEE Trans. Sys., Man, and Cybern. 33(5), 429–434 (2003)
Article Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Analysis Mach. Intell. 1, 224–227 (1971)
Article Google Scholar
Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. Classif. 1, 7–24 (1984)
Article MATH Google Scholar
Marques de Sá, J.P.: Applied Statistics using SPSS, STATISTICA and MATLAB and R. Springer (2007)
Google Scholar
Dunn, J.: Well separated clusters and optimal fuzzy partitions. Cybern. 4, 95–104 (1974)
Article MathSciNet Google Scholar
Eibe, F., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2011)
Google Scholar
Epstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs. Exp. Algorithms 5, 1–10 (2000)
Article Google Scholar
Fahlman, S.E.: Faster-Learning Variations on Back-Propagation: An Empirical Study. In: Proceedings of the 1988 Connectionist Models Summer School. Morgan Kaufmann (1989)
Google Scholar
Fayyad, U.M., Pietetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advanced in Knowledge Discovery and Data Mining. MIT Press (1996)
Google Scholar
Fisher, R.A.: The Use of Multiple Measurements in Axonomic Problems. Ann. Eugen. 7, 179–188 (1936)
Google Scholar
Forina, M., Armanino, C.: Eigenvector projection and simplified nonlinear mapping of fatty acid content of Italian olive oils. Ann. Chem. 72, 125–127 (1981)
Google Scholar
Fowlkes, E.B., Mallows, C.: A method for comparing two hierarchical clusterings. Am. Stat. Assoc. 78, 553–569 (1983)
Article MATH Google Scholar
Goodman, L., Kruskal, W.: Measures of associations for cross-validations. Am. Stat. Assoc. 49, 732–764 (1954)
MATH Google Scholar
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression. Monitoring Science 15, 531–537 (1999)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)
Google Scholar
Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29, 190–241 (1976)
Article MathSciNet MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Article Google Scholar
Jaccard, P.: Etude comperative de la distribution florale dans une portion des Alpes et des Jura. Bull. de la Société Vaudoise des Sciences Naturelles 37, 574–579 (1901)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall (1988)
Google Scholar
Jain, A.K., Murthy, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 3(8), 651–666 (2010)
Article Google Scholar
Jiang, J.H., Wang, J.H., Chu, X., Ru-Qin, R.Q.: Neural network learning to non-linear principal component analysis. Analytica Chemica Acta. 336, 209–222 (1996)
Article Google Scholar
Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)
Article Google Scholar
Jun, M., Shaohan, M.: Efficient Parallel Algorithm s for Some Graph Theory Problems. Comput. Sci. Technol. 8(4), 362–366 (1993)
Article Google Scholar
Kaufman, L., Rousseeuw, P.: Finding Groups in Data. Wiley Interscience (1990)
Google Scholar
Kohonen, T.: Self-Organizing Maps. Springer (1995)
Google Scholar
Krieger, A.M., Green, P.E.: A Generalized Rand-Index method for consensus Clustering of Separate partitions of the Same Data Base. Classif. 16, 63–89 (1999)
Article Google Scholar
Kwon, S., Han, C.: Hybrid clustering method for DNA microarray data analysis. Gene Inform. 13, 258–259 (2002)
Google Scholar
Li, Z., Li, K.-L., Xiao, D., Yang, L.: An Adaptive Parallel Hierarchical Clustering Algorithm. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 97–107. Springer, Heidelberg (2007)
Chapter Google Scholar
Linton, J., Chen, M.-N.: Working paper: Analysis of the Evolution of the Field of Business Ethics through Text Mining. University of Ottawa (2011)
Google Scholar
Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinforma. 1(1), 24–25 (2004)
Article Google Scholar
The MathWorks, Natick, MA
Google Scholar
Murthag, F.: Expected-time complexity results for hierarchic clustering algorithms which use cluster centres. Inform. Process. Lett. 16, 237–241 (1983)
Article MathSciNet Google Scholar
Murthag, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 354–359 (1983)
Google Scholar
Murthag, F.: Complexities of hierarchical clustering algorithms: State of the art. Comput. Stat. Q 1(2), 101–113 (1984)
Google Scholar
Murthag, F.: Comments on ’Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Analysis Mach. Intell. 14(10), 1056–1057 (1992)
Article Google Scholar
Nath, D., Maheshwari, S.N.: Parallel algorithms for the connected components and minimal spanning tree problems. Informa. Process. Lett. 14(1), 7–11 (1982)
Article MathSciNet MATH Google Scholar
Olson, C.F.: Parallel algorithms for hierarchical Clustering. Parallel Comput. 21, 1313–1325 (1995)
Article MathSciNet MATH Google Scholar
Pauwels, E.J., Frederix, G.: Finding salient regions in images: nonparametric clustering for image segmentation and grouping. Comput. Vis. Underst. 75, 73–85 (1999)
Article Google Scholar
Prelic, A., Bleuer, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinforma. 22(9), 1122–1129 (2006)
Article Google Scholar
Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Sys. 16(6), 497–502 (2005)
Article MathSciNet Google Scholar
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Am. Stat. Assoc. 66, 846–850 (1971)
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Santamaría, R., Therón, R., Quintales, L.A.M.: A Framework to Analyze Biclustering Results on Microarray Experiments. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 770–779. Springer, Heidelberg (2007)
Chapter Google Scholar
Jorge Manuel Fernandes dos Santos: Data Classification with Neural Networks and Entropic Criteria. Ph. D. Dissertation (School of Engineering, University of Porto FEUP (2007)
Google Scholar
Sledge, I.J., Havens, T.C., Bezdek, J.C., Kelleher, J.M.: Relational cluster validity. In: Aranda, J., Xambó, S. (eds.) Plenary and Invited Lectures of the 2010 World Congress on Computational Intelligence, Barcelona, Spain, pp. 151–185 (2010)
Google Scholar
Smith, K.: Private Communication (2011)
Google Scholar
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. W.H. Freeman (1973)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley (2005)
Google Scholar
Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. In: Handbook of Computational Molecular Biology. Chapman and Hall (2004)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gap Statistics. J. R. Stat. Soc. B 63, 411–423 (2001)
Article MathSciNet MATH Google Scholar
Weiss, S.F.: A probabilistic algorithm for nearest neighbor searching. In: Oddy, R.N., Robertson, S.E., Van Rijsbergen, C.J. (eds.) Information Retrieval Research, Butterworths, pp. 325–333 (1981)
Google Scholar
Wikipedia, http://en.wikipedia.org/wiki/Hierarchical_clustering (last accessed August 4, 2011)
Zapan, J., Gasteiger, J.: Neural Networks in Chemistry and Drug Design, 2nd edn. Wiley VCH (1999)
Google Scholar
Xu, R., Wunsch II, D.: Survey of clustering Algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Article Google Scholar
Xu, R., Wunsch II, D.: Clustering. IEEE Press Series on Computational intelligence. Wiley (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Rensselaer Polytechnic Institute, Troy, NY, USA
Mark J. Embrechts & Christopher J. Gatti
University of Ottawa, Canada, Ottawa
Jonathan Linton
University of Houston, Houston, TX, USA
Badrinath Roysam

Authors

Mark J. Embrechts
View author publications
You can also search for this author in PubMed Google Scholar
Christopher J. Gatti
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Linton
View author publications
You can also search for this author in PubMed Google Scholar
Badrinath Roysam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark J. Embrechts .

Editor information

Editors and Affiliations

, Dep. of Electronics Telecommunications, University of Aveiro, Aveiro, 3800 - 193, Portugal
Petia Georgieva
, School of Computing and Communications, Lancaster University, InfoLab21, South Drive, Lancaster, LA1 4WA, Montserrat
Lyudmila Mihaylova
, School of Electrical and Information, University of South Australia, Adelaide, SA 5095, Australia
Lakhmi C Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Embrechts, M.J., Gatti, C.J., Linton, J., Roysam, B. (2013). Hierarchical Clustering for Large Data Sets. In: Georgieva, P., Mihaylova, L., Jain, L. (eds) Advances in Intelligent Signal Processing and Data Mining. Studies in Computational Intelligence, vol 410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28696-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-28696-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28695-7
Online ISBN: 978-3-642-28696-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics