Abstract
This chapter provides a tutorial overview of hierarchical clustering. Several data visualization methods based on hierarchical clustering are demonstrated and the scaling of hierarchical clustering in time and memory is discussed. A new method for speeding up hierarchical clustering with cluster seeding is introduced, and this method is compared with a traditional agglomerative hierarchical, average link clustering algorithm using several internal and external cluster validation indices. A benchmark study compares the cluster performance of both approaches using a wide variety of real-world and artificial benchmark data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Akl, S.G.: An adaptive and cost-optimal parallel algorithm for minimum spanning trees. Comput. 3, 271–277 (1986)
Akl, S.G.: Optimal parallel merging and sorting without memory conflicts. IEEE Trans. on Comput. 36(11), 1367–1369 (1987)
Anderberg, M.R.: Cluster Analysis for Applications. Academic Press (1973)
Bar-Joseph, Z., Gifford, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinforma. 17(S1), 22–29 (2001)
Behbahani, S., Ali Moti Nasrabadi, A.M.: Application of SOM neural network in clustering. Journal of Biomedical Science and Engineering 2, 637–643 (2009)
Bezdek, J.C.: Private conversation. In: IJCNN (2010)
Carmona-Saez, P., Pascual-Marqui, R.D., Tirado, F., Carazo, J.M., Pascual-Montano, A.: Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinforma. 7(1), 78 (2006)
Chen, G.: Design and analysis of parallel algorithm. Higher Education Press (2002)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 93–103 (2005)
Dahlhaus, E.: Parallel Algorithms for Hierarchical Clustering and Applications to Split Decomposition and Parity Graph Recognition. Algorithms 36, 205–240 (2000)
Dash, M., Lui, H., Scheuermann, P., Tan, K.L.: Fast Hierarchical clustering and its validation. Data Knowl. and Eng. 44(1), 109–138 (2003)
Datta, A., Soundaralakshmi, S.: Fast parallel algorithm for distance transform. IEEE Trans. Sys., Man, and Cybern. 33(5), 429–434 (2003)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Analysis Mach. Intell. 1, 224–227 (1971)
Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. Classif. 1, 7–24 (1984)
Marques de Sá, J.P.: Applied Statistics using SPSS, STATISTICA and MATLAB and R. Springer (2007)
Dunn, J.: Well separated clusters and optimal fuzzy partitions. Cybern. 4, 95–104 (1974)
Eibe, F., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2011)
Epstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs. Exp. Algorithms 5, 1–10 (2000)
Fahlman, S.E.: Faster-Learning Variations on Back-Propagation: An Empirical Study. In: Proceedings of the 1988 Connectionist Models Summer School. Morgan Kaufmann (1989)
Fayyad, U.M., Pietetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advanced in Knowledge Discovery and Data Mining. MIT Press (1996)
Fisher, R.A.: The Use of Multiple Measurements in Axonomic Problems. Ann. Eugen. 7, 179–188 (1936)
Forina, M., Armanino, C.: Eigenvector projection and simplified nonlinear mapping of fatty acid content of Italian olive oils. Ann. Chem. 72, 125–127 (1981)
Fowlkes, E.B., Mallows, C.: A method for comparing two hierarchical clusterings. Am. Stat. Assoc. 78, 553–569 (1983)
Goodman, L., Kruskal, W.: Measures of associations for cross-validations. Am. Stat. Assoc. 49, 732–764 (1954)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression. Monitoring Science 15, 531–537 (1999)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer (2009)
Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29, 190–241 (1976)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Jaccard, P.: Etude comperative de la distribution florale dans une portion des Alpes et des Jura. Bull. de la Société Vaudoise des Sciences Naturelles 37, 574–579 (1901)
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall (1988)
Jain, A.K., Murthy, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surv. 31(3), 264–323 (1999)
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 3(8), 651–666 (2010)
Jiang, J.H., Wang, J.H., Chu, X., Ru-Qin, R.Q.: Neural network learning to non-linear principal component analysis. Analytica Chemica Acta. 336, 209–222 (1996)
Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)
Jun, M., Shaohan, M.: Efficient Parallel Algorithm s for Some Graph Theory Problems. Comput. Sci. Technol. 8(4), 362–366 (1993)
Kaufman, L., Rousseeuw, P.: Finding Groups in Data. Wiley Interscience (1990)
Kohonen, T.: Self-Organizing Maps. Springer (1995)
Krieger, A.M., Green, P.E.: A Generalized Rand-Index method for consensus Clustering of Separate partitions of the Same Data Base. Classif. 16, 63–89 (1999)
Kwon, S., Han, C.: Hybrid clustering method for DNA microarray data analysis. Gene Inform. 13, 258–259 (2002)
Li, Z., Li, K.-L., Xiao, D., Yang, L.: An Adaptive Parallel Hierarchical Clustering Algorithm. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 97–107. Springer, Heidelberg (2007)
Linton, J., Chen, M.-N.: Working paper: Analysis of the Evolution of the Field of Business Ethics through Text Mining. University of Ottawa (2011)
Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinforma. 1(1), 24–25 (2004)
The MathWorks, Natick, MA
Murthag, F.: Expected-time complexity results for hierarchic clustering algorithms which use cluster centres. Inform. Process. Lett. 16, 237–241 (1983)
Murthag, F.: A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 354–359 (1983)
Murthag, F.: Complexities of hierarchical clustering algorithms: State of the art. Comput. Stat. Q 1(2), 101–113 (1984)
Murthag, F.: Comments on ’Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Analysis Mach. Intell. 14(10), 1056–1057 (1992)
Nath, D., Maheshwari, S.N.: Parallel algorithms for the connected components and minimal spanning tree problems. Informa. Process. Lett. 14(1), 7–11 (1982)
Olson, C.F.: Parallel algorithms for hierarchical Clustering. Parallel Comput. 21, 1313–1325 (1995)
Pauwels, E.J., Frederix, G.: Finding salient regions in images: nonparametric clustering for image segmentation and grouping. Comput. Vis. Underst. 75, 73–85 (1999)
Prelic, A., Bleuer, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinforma. 22(9), 1122–1129 (2006)
Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Sys. 16(6), 497–502 (2005)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. Am. Stat. Assoc. 66, 846–850 (1971)
Rousseeuw, P.J.: Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 20, 53–65 (1987)
SantamarÃa, R., Therón, R., Quintales, L.A.M.: A Framework to Analyze Biclustering Results on Microarray Experiments. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 770–779. Springer, Heidelberg (2007)
Jorge Manuel Fernandes dos Santos: Data Classification with Neural Networks and Entropic Criteria. Ph. D. Dissertation (School of Engineering, University of Porto FEUP (2007)
Sledge, I.J., Havens, T.C., Bezdek, J.C., Kelleher, J.M.: Relational cluster validity. In: Aranda, J., Xambó, S. (eds.) Plenary and Invited Lectures of the 2010 World Congress on Computational Intelligence, Barcelona, Spain, pp. 151–185 (2010)
Smith, K.: Private Communication (2011)
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. W.H. Freeman (1973)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley (2005)
Tanay, A., Sharan, R., Shamir, R.: Biclustering algorithms: A survey. In: Handbook of Computational Molecular Biology. Chapman and Hall (2004)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gap Statistics. J. R. Stat. Soc. B 63, 411–423 (2001)
Weiss, S.F.: A probabilistic algorithm for nearest neighbor searching. In: Oddy, R.N., Robertson, S.E., Van Rijsbergen, C.J. (eds.) Information Retrieval Research, Butterworths, pp. 325–333 (1981)
Wikipedia, http://en.wikipedia.org/wiki/Hierarchical_clustering (last accessed August 4, 2011)
Zapan, J., Gasteiger, J.: Neural Networks in Chemistry and Drug Design, 2nd edn. Wiley VCH (1999)
Xu, R., Wunsch II, D.: Survey of clustering Algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Xu, R., Wunsch II, D.: Clustering. IEEE Press Series on Computational intelligence. Wiley (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Berlin Heidelberg
About this chapter
Cite this chapter
Embrechts, M.J., Gatti, C.J., Linton, J., Roysam, B. (2013). Hierarchical Clustering for Large Data Sets. In: Georgieva, P., Mihaylova, L., Jain, L. (eds) Advances in Intelligent Signal Processing and Data Mining. Studies in Computational Intelligence, vol 410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28696-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-28696-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28695-7
Online ISBN: 978-3-642-28696-4
eBook Packages: EngineeringEngineering (R0)