FPF-SB: A Scalable Algorithm for Microarray Gene Expression Data Clustering

  • Filippo Geraci
  • Mauro Leoncini
  • Manuela Montangero
  • Marco Pellegrini
  • M. Elena Renda
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4561)


Efficient and effective analysis of large datasets from microarray gene expression data is one of the keys to time-critical personalized medicine. The issue we address here is the scalability of the data processing software for clustering gene expression data into groups with homogeneous expression profile. In this paper we propose FPF-SB, a novel clustering algorithm based on a combination of the Furthest-Point-First (FPF) heuristic for solving the k-center problem and a stability-based method for determining the number of clusters k. Our algorithm improves the state of the art: it is scalable to large datasets without sacrificing output quality.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alon, U., et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96(12), 6745–6750 (1999)CrossRefGoogle Scholar
  2. 2.
    Belacel, N., Cuperlovic-Culf, M., Laflamme, M., Ouellette, R.: Fuzzy J-Means and VNS methods for clustering genes from microarray data. Bioinf. 20(11), 1690–1701 (2004)CrossRefGoogle Scholar
  3. 3.
    Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. J Comput Biol. 6(3-4), 281–297 (1999)CrossRefGoogle Scholar
  4. 4.
    Cho, R.J., et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 2(1), 65–73 (1988)CrossRefGoogle Scholar
  5. 5.
    Clarkson, K.L.: Nearest-neighbor searching and metric space dimensions. In: Shakhnarovich, G., Darrell, T., Indyk, P. (eds.) Nearest-Neighbor Methods for Learning and Vision: Theory and Practice, pp. 15–59. MIT Press, Cambridge (2006)Google Scholar
  6. 6.
    Eisen, M.B., Spellman, P.T., Browndagger, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. PNAS 95(25), 14863–14868 (1998)CrossRefGoogle Scholar
  7. 7.
    Ernst, J., Naur, G.J., Bar-Joseph, Z.: Clustering short time series gene expression. Bioinf. 21(1), i159–i168 (2005)CrossRefGoogle Scholar
  8. 8.
    Feder, T., Greene, D.H.: Optimal algortihms for approximate clustering. In: Proc. of 20th ACM Symposium on Theory of Computing, pp. 434–444 (1988)Google Scholar
  9. 9.
    Geraci, F., Pellegrini, M., Sebastiani, F., Pisati, P.: A Scalable Algorithm for High-Quality Clustering of Web Snippets. In: Proc. of 21st ACM Symposium on Applied Computing 2006 (2006)Google Scholar
  10. 10.
    Gibbons, F.D., Roth, F.P.: Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation. Genome Research 12, 1574–1581 (2000)CrossRefGoogle Scholar
  11. 11.
    Gonzalez, T.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 293–306 (1985)Google Scholar
  12. 12.
    Hastie, T., et al.: Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome. Biol. 1(2) (2000)Google Scholar
  13. 13.
    Holloway, A.J., et al.: Options available - from start to finish - for obtaining data from DNA microarrays II. Nature Gen. Suppl. 32, 481–489 (2002)CrossRefGoogle Scholar
  14. 14.
    Huang, D., Pan, W.: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinf. 22(10), 1259–1268 (2006)CrossRefGoogle Scholar
  15. 15.
    Jiang, D., Tang, C., Zhang, A.: Cluster Analysis for Gene Expression Data: A Survey. IEEE Trans. on Knowledge and Data Eng. 16(11), 1370–1386 (2004)CrossRefGoogle Scholar
  16. 16.
    Ramoni, M.F., Sebastiani, P., Kohane, I.S.: Cluster analysis of gene expression dynamics. Proc. Nat. Acad. Sci. USA 99(14), 9121–9126 (2002)MATHCrossRefGoogle Scholar
  17. 17.
    Schadt, E.E., et al.: A comprehensive transcript index of the human genome generated using microarrays and computational approaches. Genome Biology 5(10), 73 (2004)CrossRefGoogle Scholar
  18. 18.
    Sharan, R., Maron-Katz, A., Shamir, R.: CLICK and EXPANDER: A System for Clustering and Visualizing Gene Expression Data. Bioinf. 19(14), 1787–1799 (2003)CrossRefGoogle Scholar
  19. 19.
    Spellman, P.T., et al.: Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Biol. Cell. 9, 3273–3297 (1998)Google Scholar
  20. 20.
    Tamayo, P., et al.: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96(6), 2907–2912 (1999)CrossRefGoogle Scholar
  21. 21.
    Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M.: Systematic determination of genetic network architecture. Nature Genetics 22, 281–285 (1999)CrossRefGoogle Scholar
  22. 22.
    Tibshirani, R., Walther, G., Botstein, D., Brown, P.: Cluster validation by prediction strength. Journal of Computational & Graphical Statistics 14, 511–528 (2005)CrossRefGoogle Scholar
  23. 23.
    Trent, J.M., Bexevanis, A.D.: Chipping away at genomic medicine. Nature Genetics (Suppl), p. 426 (2002)Google Scholar
  24. 24.
    Wen, X., et al.: Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. USA 95(1), 334–349 (1988)CrossRefGoogle Scholar
  25. 25.
    Xing, E.P., Karp, R.M.: CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinf. 17(1), 306–315 (2001)Google Scholar
  26. 26.
    Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinf. 17(4), 309–318 (2001)CrossRefGoogle Scholar
  27. 27.
    WWW, Personalized Medicine Coalition, The case for Personalised Medicine http://www.personalizedmedicinecoalition.org
  28. 28.
    WWW, The Royal Society, Personalised medicines: hopes and realities http://www.royalsoc.ac.uk

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Filippo Geraci
    • 1
    • 3
  • Mauro Leoncini
    • 1
    • 2
  • Manuela Montangero
    • 1
    • 2
  • Marco Pellegrini
    • 1
  • M. Elena Renda
    • 1
  1. 1.CNR, Istituto di Informatica e Telematica, via Moruzzi 1, 56124, Pisa, (Italy) 
  2. 2.Dipartimento di Ingegneria dell’Informazione, Università di Modena e Reggio Emilia, Via Vignolese 905 - 41100 Modena (Italy) 
  3. 3.Dipartimento di Ingegneria dell’Informazione, Università di Siena, Via Roma 56 - 53100 Siena, (Italy) 

Personalised recommendations