A Systematic Comparison of Genome Scale Clustering Algorithms

(Extended Abstract)
  • Jeremy J. Jay
  • John D. Eblen
  • Yun Zhang
  • Mikael Benson
  • Andy D. Perkins
  • Arnold M. Saxton
  • Brynn H. Voy
  • Elissa J. Chesler
  • Michael A. Langston
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6674)


A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad array of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray data that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae.Clusters are scored using Jaccard similarity coefficients for the analysis of the positive match of clusters to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted.

ISBRA Topics of Interest: gene expression analysis, software tools and applications.


Gene Ontology Cluster Algorithm Gene Expression Data Maximal Clique Rand Index 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: a survey. IEEE Transactions on Knowledge and Data Engineering 16(11), 1370–1386 (2004)CrossRefGoogle Scholar
  2. 2.
    Quackenbush, J.: Computational analysis of microarray data. Nature Reviews Genetics 2(6), 418–427 (2001)CrossRefGoogle Scholar
  3. 3.
    Kerr, G., Ruskin, H.J., Crane, M., Doolan, P.: Techniques for clustering gene expression data. Computers in Biology and Medicine 38(3), 283–293 (2008)CrossRefGoogle Scholar
  4. 4.
    Laderas, T., McWeeney, S.: Consensus framework for exploring microarray data using multiple clustering methods. Omics: A Journal of Integrative Biology 11(1), 116–128 (2007)CrossRefGoogle Scholar
  5. 5.
    Myers, C., Barrett, D., Hibbs, M., Huttenhower, C., Troyanskaya, O.: Finding function: evaluation methods for functional genomics data. BMC Genomics 7(1), 187 (2006)CrossRefGoogle Scholar
  6. 6.
    Giancarlo, R., Scaturro, D., Utro, F.: Computational clustering validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 9(1), 462 (2008)CrossRefGoogle Scholar
  7. 7.
    de Souto, M., Costa, I., de Araujo, D., Ludermir, T., Schliep, A.: Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9(1), 497 (2008)CrossRefGoogle Scholar
  8. 8.
    Mingoti, S.A., Lima, J.O.: Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. European Journal of Operational Research 174(3), 1742–1759 (2006)CrossRefMATHGoogle Scholar
  9. 9.
    Datta, S., Datta, S.: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 7(1), 397 (2006)CrossRefGoogle Scholar
  10. 10.
    Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics 28(1), 100–108 (1979)CrossRefMATHGoogle Scholar
  11. 11.
    McQuitty, L.L.: Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological measurement 26(4), 825–831 (1966)CrossRefGoogle Scholar
  12. 12.
    Ward, J.H.: Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58(301), 236–244 (1963)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043), 814–818 (2005)CrossRefGoogle Scholar
  14. 14.
    Zhang, B., Horvath, S.: A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology 4(1) (2005)Google Scholar
  15. 15.
    Huttenhower, C., Flamholz, A., Landis, J., Sahi, S., Myers, C., Olszewski, K., Hibbs, M., Siemers, N., Troyanskaya, O., Collier, H.: Nearest Neighbor Networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics 8(1), 250 (2007)CrossRefGoogle Scholar
  16. 16.
    Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 6(3-4), 291–297 (1999)CrossRefGoogle Scholar
  17. 17.
    Sharan, R., Maron-Katz, A., Shamir, R.: CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 19(14), 1787–1799 (2003)CrossRefGoogle Scholar
  18. 18.
    Abu-Khzam, F.N., Baldwin, N.E., Langston, M.A., Samatova, N.F.: On the Relative Efficiency of Maximal Clique Enumeration Algorithms, with Applications to High-Throughput Computational Biology. In: Proceedings of the International Conference on Research Trends in Science and Technology (2005)Google Scholar
  19. 19.
    Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM 16(9), 575–577 (1973)CrossRefMATHGoogle Scholar
  20. 20.
    Zhang, Y., Abu-Khzam, F.N., Baldwin, N.E., Chesler, E.J., Langston, M.A., Samatova, N.F.: Genome-Scale Computational Approaches to Memory-Intensive Applications in Systems Biology. In: Gschwind, T., Aßmann, U., Wang, J. (eds.) SC 2005. LNCS, vol. 3628. Springer, Heidelberg (2005)Google Scholar
  21. 21.
    Chesler, E.J., Langston, M.A.: Combinatorial Genetic Regulatory Network Analysis Tools for High Throughput Transcriptomic Data. In: RECOMB Satellite Workshop on Systems Biology and Regulatory Genomics (2005)Google Scholar
  22. 22.
    Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R.: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences of the United States of America 96(6) (1999)Google Scholar
  23. 23.
    Heyer, L.J., Kruglyak, S., Yooseph, S.: Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Research 9(11), 1106–1115 (1999)CrossRefGoogle Scholar
  24. 24.
    Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)CrossRefGoogle Scholar
  25. 25.
    Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405–2412 (2006)CrossRefGoogle Scholar
  26. 26.
    Handl, J., Knowles, J., Kell, D.B.: Computational clustering validation in postgenomic data analysis. Bioinformatics 21(15), 3201–3212 (2005)CrossRefGoogle Scholar
  27. 27.
    Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17(4), 209–318 (2001)CrossRefGoogle Scholar
  28. 28.
    Yao, J., Chang, C., Salmi, M., Hung, Y.S., Loraine, A., Roux, S.: Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient. BMC Bioinformatics 9(1), 288 (2008)CrossRefGoogle Scholar
  29. 29.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classificiation 2(1), 193–218 (1985)CrossRefMATHGoogle Scholar
  30. 30.
    Wallace, D.L.: A Method for Comparing Two Hierarchical Clusterings: Comment. Journal of the American Statistical Association 78(383), 569–576 (1983)Google Scholar
  31. 31.
    Beissbarth, T., Speed, T.P.: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20(9), 1464–1465 (2004)CrossRefGoogle Scholar
  32. 32.
    Dennis, G., Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., Lempicki, R.A.: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 4(9), R60 (2003)CrossRefGoogle Scholar
  33. 33.
    Khatri, P., Draghici, S.: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21(18), 3587–3595 (2005)CrossRefGoogle Scholar
  34. 34.
    Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R., Kohane, I.S.: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proceedings of the National Academy of Sciences of the United States of America 97(22), 12182–12186 (2000)CrossRefGoogle Scholar
  35. 35.
    Abu-Khzam, F.N., Langston, M.A., Shanbhag, P., Symons, C.T.: Scalable Parallel Algorithms for FPT problems. Algorithmica 45(3), 269–284 (2006)MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    Dehne, F., Langston, M., Luo, X., Pitre, S., Shaw, P., Zhang, Y.: The Cluster Editing Problem: Implementations and Experiments. In: Parameterized and ExactComputation (2006)Google Scholar
  37. 37.
    Gasch, A.P., Huang, M., Metzner, S., Botstein, D., Elledge, S.J., Brown, P.O.: Genomic Expression Responses to DNA-damaging Agents and the Regulatory Roleof the Yeast ATR Homolog Mec1p. Molecular Biology of the Cell 12(10), 2987–3003 (2001)CrossRefGoogle Scholar
  38. 38.
    Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene Ontology: tool for the unification of biology. Nature Genetics 25(1), 25–29 (2000)CrossRefGoogle Scholar
  39. 39.
    Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., et al.: KEGG for linking genomes tolife and the environment. Nucleic Acids Research 36(Suppl 1), D480–D484 (2008)Google Scholar
  40. 40.
    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Research 28(1), 235–242 (2000)CrossRefGoogle Scholar
  41. 41.
    Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Cuche, B.A., Castro, E., Lachaize, C., Langendijk-Genevaux, P.S., Sigrist, C.J.A.: The 20 years of PROSITE. Nucleic Acids Research 36(Suppl 1), D245–D249 (2008)Google Scholar
  42. 42.
    Mulder, N.J., Apweiler, R., Attwodd, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Bulliard, V., Cerutti, L., Copley, R., et al.: New developments in theInterPro database. Nucleic Acids Research 35(Suppl 1), D224–D228 (2007)CrossRefGoogle Scholar
  43. 43.
    Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.-R., Ceric, G., Forslung, K., Eddy, S.R., Sonnhammer, E.L.L., et al.: The Pfam protein familiesdatabase. Nucleic Acids Research 36(Suppl 1), D281–D288 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Jeremy J. Jay
    • 1
  • John D. Eblen
    • 2
  • Yun Zhang
    • 2
  • Mikael Benson
    • 3
  • Andy D. Perkins
    • 4
  • Arnold M. Saxton
    • 2
  • Brynn H. Voy
    • 2
  • Elissa J. Chesler
    • 1
  • Michael A. Langston
    • 2
  1. 1.The Jackson LaboratoryBar HarborUSA
  2. 2.University of TennesseeKnoxvilleUSA
  3. 3.University of GöteborgGöteborgSweden
  4. 4.Mississippi State UniversityUSA

Personalised recommendations