Overview on Techniques in Cluster Analysis

  • Itziar Frades
  • Rune Matthiesen
Part of the Methods in Molecular Biology book series (MIMB, volume 593)


Clustering is the unsupervised, semisupervised, and supervised classification of patterns into groups. The clustering problem has been addressed in many contexts and disciplines. Cluster analysis encompasses different methods and algorithms for grouping objects of similar kinds into respective categories. In this chapter, we describe a number of methods and algorithms for cluster analysis in a stepwise framework. The steps of a typical clustering analysis process include sequentially pattern representation, the choice of the similarity measure, the choice of the clustering algorithm, the assessment of the output, and the representation of the clusters.

Key words

Clustering algorithm feature selection feature extraction similarity measure cluster tendency cluster validity cluster stability relevance networks dendrogram 


  1. 1.
    Saeys Y, Inza I, Larrañaga P. (2007) Bioinformatics 23:2507–2517.CrossRefPubMedGoogle Scholar
  2. 2.
    Densmore D, Heath TL. (2002) Euclid’s Elements, Green Lion Press, Santa Fe, NM.Google Scholar
  3. 3.
    Zhang T, Ramakrishnman R, Linvy M. (1996) In ACM SIGMOD International Conference on Management of Data.Google Scholar
  4. 4.
    Guha S, Rastogi R, Shim K. (1998) In ACM SIGMOD International Conference on Management of Data.Google Scholar
  5. 5.
    Guha S, Rastogi R, Shim K. (1999) In IEEE Conference on Data Engineering.Google Scholar
  6. 6.
    Kaufman L, Rousseeuw P. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York.Google Scholar
  7. 7.
    Gonzalez MD. (2005) In Mathematics, University of Puerto Rico, Puerto Rico.Google Scholar
  8. 8.
    Massey L. (2002) In Recent Advances in Soft-Computing (RASC02), Nottingham, UK.Google Scholar
  9. 9.
    Butte AJ, Kohane IS. (2000) In Pacific Symposium on Biocomputing.Google Scholar
  10. 10.
    Krause EF. (1987) Taxicab Geometry, Dover Publications, Dover, UK.Google Scholar
  11. 11.
    MacQueen JB. (1967) In 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, University of California Press, Berkeley.Google Scholar
  12. 12.
    Ball G, Hall D. (1967) Behav Sci 12:153–155.CrossRefPubMedGoogle Scholar
  13. 13.
    Ng R, Han J. (1994) In Proceedings of 20th VLDB Conference, Santiago, Chile.Google Scholar
  14. 14.
    Lu SY, Fu KS. (1978) IEEE Trans Syst Man Cybern 8:381–389.CrossRefGoogle Scholar
  15. 15.
    Jain A K. (1999) ACM Comp Surv 31:264–323.CrossRefGoogle Scholar
  16. 16.
    Pearson K. (1896) Philos Trans Roy Soc 187:253–318.CrossRefGoogle Scholar
  17. 17.
    Ester M, Kriegel H, Sander J, Xu X. (1996) In 2nd International Conference On Knowledge Discovery and Data Mining (KDD’96), pp. 226–231.Google Scholar
  18. 18.
    Hinneburg A, Keim D. (1998) In 4th International Conference On Knowledge Discovery and Data Mining (KDD’98), pp. 58–65.Google Scholar
  19. 19.
    Halkidi M, Batistakis Y, Vazirgiannis M. (2001) J. Intell Inform Syst 17: 107–145.CrossRefGoogle Scholar
  20. 20.
    Dunn J. (1974) J Cybern 4:95–104.CrossRefGoogle Scholar
  21. 21.
    Knudsen S. (2002) A Biologist’s Guide to Analysis of DNA Microarray Data, John Wiley & Sons, New York.Google Scholar
  22. 22.
    Sheikholeslami G, Chatterjee S, Zhang A. (1998) In Proceedings of 24th VLDB Conference, pp. 428–439.Google Scholar
  23. 23.
    Wang W, Yang J, Muntz R. (1997) In Proceedings of 23rd VLDB Conference.Google Scholar
  24. 24.
    Pearson K. (1901) Philos Mag 2:559–572.Google Scholar
  25. 25.
    Bezdeck JC, Ehrlich R, Full W. (1984) Comput Geosci 10:191–203.CrossRefGoogle Scholar
  26. 26.
    Breiman L. (1996) Mach Learn 24:123–140.Google Scholar
  27. 27.
    Suzuki R, Shimodaira H. (2006) Bioinformatics 22:1540–1542.CrossRefPubMedGoogle Scholar
  28. 28.
    Arfken G. (1985) In Mathematical Methods for Physicists, Academic Press, Orlando, FL, pp. 13–18.Google Scholar
  29. 29.
    Kohonen T. (1995) Self-Organizing Maps, Springer-Verlag, Heidelberg, Germany.Google Scholar
  30. 30.
    Herrero J, Valencia A, Dopazo J. (2001) Bioinformatics 17:126–136.CrossRefPubMedGoogle Scholar
  31. 31.
    Dopazo J, Carazo JM. (1997) J Mol Evol 44:226–233.CrossRefPubMedGoogle Scholar
  32. 32.
    Spearman C. (1906) Br J Psychol 2:89–108.Google Scholar
  33. 33.
    Kendall M. (1938) Biometrika 30:81–89.Google Scholar
  34. 34.
    Hall L, Özyurt I, Bezdek J. (1999) IEEE Trans Evol Comput 3:103–112.CrossRefGoogle Scholar
  35. 35.
    Shannon CE. (1948) Bell Syst Tech J 27:379–423 and 623–656.Google Scholar
  36. 36.
    Mirkin B. (1996) Mathematical Classification and Clustering, Kluwer Academic Publishers, Dordrecht, the Netherlands.Google Scholar
  37. 37.
    Bandeira LPC, Sousa JMC, Kaymak U. (2003) In Fuzzy Sets and Systems – IFSA 2003, Vol. 2715. Springer, Berlin.Google Scholar
  38. 38.
    Witten IH, Frank E. (2005) Data Mining: Practical Machine Learning Tools and Techniques, Elsevier, San Francisco.Google Scholar
  39. 39.
    Dash M, Choi K, Scheuermann P, Liu H. (2002) In IEEE International Conference on Data Mining (ICDM’02).Google Scholar
  40. 40.
    Yu L, Liu H. (2003) in Proceedings ICML, Washington, DC.Google Scholar
  41. 41.
    Xiong M, Fang X, Zhao J. (2001) Genome Res 11:1878–1887.PubMedGoogle Scholar
  42. 42.
    Blanco R, Larrañaga P, Inza I, Sierra B. (2004) Int J Patt Recog. Artif Intell 18:1373–1390.CrossRefGoogle Scholar
  43. 43.
    Subbarao C, Subbarao NV, Chandu SN. (1995) Environ Geol 28:175–180.CrossRefGoogle Scholar
  44. 44.
    Fisher RA. (1936) Ann Eugen 7:179–188.Google Scholar
  45. 45.
    Frank I, Friedman J. (1993) Technometrics 35:109–148.CrossRefGoogle Scholar
  46. 46.
    Friedman JH, Tukey JW. (1974) IEEE Trans Comput 23:881–890.CrossRefGoogle Scholar
  47. 47.
    Wold H. (1966) In Multivariate Analysis (Krishnaiaah PR, Ed.), Academic Press, New York, pp. 391–420.Google Scholar
  48. 48.
    Sturn A. (2000) The Institute for Genomic Research, Rockville, MD.Google Scholar
  49. 49.
    Jiang D, Tang C, Zhang A. (2004) Trans Knowl Data Eng 16:1370–1386.CrossRefGoogle Scholar
  50. 50.
    Kullback S, Leibler RA. (1951) Ann Math Stat 22:79–86.CrossRefGoogle Scholar
  51. 51.
    Xu R. (2005) IEEE Trans Neural Netw 16:645–678.CrossRefPubMedGoogle Scholar
  52. 52.
    Johnson SC. (1967) Psychometrika 2:241–254.CrossRefGoogle Scholar
  53. 53.
    Ward JH. (1963) J Am Stat Assoc 58:236–244.CrossRefGoogle Scholar
  54. 54.
    Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrousky E, Lander ES, Golub TR. (1999) Proc Natl Acad Sci 96:2907–2912.CrossRefPubMedGoogle Scholar
  55. 55.
    Fung, G. (2001) A Comprehensive Overview of Basic Clustering Algorithms. Available at∼gfung/
  56. 56.
    Berkhin, P. (2002) Survey of clustering data mining techniques. Technical report,Accrue.Google Scholar
  57. 57.
    Hertz J, Krogh A, Palmer RG. (1991) Introduction to the Theory of Neural Computation, Addison-Wesley, Reading, MA.Google Scholar
  58. 58.
    Fritzke B. (1994) Neural Netw 7:1441–1460.CrossRefGoogle Scholar
  59. 59.
    Goldberg DE. (1989) Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Redwood City, CA.Google Scholar
  60. 60.
    Holland JH. (1975) Adaption in Natural and Artificial Systems, University of Michigan Press, Ann Arbor.Google Scholar
  61. 61.
    Schewefel HP. (1981) Numerical Optimization of Computer Models, John Wiley and Sons, New York.Google Scholar
  62. 62.
    Fogel LJ, Owens AJ, Wals MJ. (1965) Artificial Intelligence Through Simulated Evolution, John Wiley and Sons, New York.Google Scholar
  63. 63.
    Madeira SC, Oliveira AL. (2004) IEEE/ACM Trans Comput Biol Bioinform 1:24–45.CrossRefPubMedGoogle Scholar
  64. 64.
    Davies DL, Bouldin DW. (1979) IEEE Trans Patt Recog Mach Intell 1:224–227.CrossRefGoogle Scholar
  65. 65.
    Dudoit S, Fridlyand J. (2003) Bioinformatics 19:1090–1099.CrossRefPubMedGoogle Scholar
  66. 66.
    Duran BS, Odell PL. (1974) Cluster Analysis: A Survey, Springer-Verlag, New York.Google Scholar
  67. 67.
    Diday E, Simon JC. (1976) Clustering analysis. In Digital Pattern Recognition, Springer-Verlag, Secaucus, NJ.Google Scholar
  68. 68.
    Michalski R, Stepp RE, Diday E. (1981) In Progress in Pattern Recognition (Kanal L, Rosenfeld A, Eds.), Vol. 1, Springer-Verlag, North-Holland, New York,pp. 33–55.Google Scholar
  69. 69.
    Hillis D, Bull J. (1993) Syst Biol 42:182–192.Google Scholar
  70. 70.
    Felsenstein J, Kishino H. (1993) Syst Biol 42:193–200.Google Scholar
  71. 71.
    Zharkikh A, Li WH. (1992) Mol Biol Evol 9:1119–1147.PubMedGoogle Scholar
  72. 72.
    Efron B, Halloran E, Holmes S. (1996) Proc Natl Acad Sci 93:13429–13434.CrossRefPubMedGoogle Scholar
  73. 73.
    Sanderson MJ, Wojciechwski MF. (2000) Syst Biol 49:671–685.CrossRefPubMedGoogle Scholar
  74. 74.
    Shimodaira H. (2002) Syst Biol 51:492–508.CrossRefPubMedGoogle Scholar
  75. 75.
    Shimodaira H. (2004) Ann Stat 32:2616–2641.CrossRefGoogle Scholar
  76. 76.
    Suzuki R, Shimodaira H. (2004) In 15th International Conference on Genome Informatics.Google Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Itziar Frades
    • 1
  • Rune Matthiesen
    • 2
  1. 1.BioinformaticsParque Technológico de BizkaiaDerioSpain
  2. 2.Instituto de Patologia e Imunologia Molecular da Universidad do Porto – IPATIMUPPortoPortugal

Personalised recommendations