Clustering Metagenome Short Reads Using Weighted Proteins

  • Gianluigi Folino
  • Fabio Gori
  • Mike S. M. Jetten
  • Elena Marchiori
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5483)


This paper proposes a new knowledge-based method for clustering metagenome short reads. The method incorporates biological knowledge in the clustering process, by means of a list of proteins associated to each read. These proteins are chosen from a reference proteome database according to their similarity with the given read, as evaluated by BLAST. We introduce a scoring function for weighting the resulting proteins and use them for clustering reads. The resulting clustering algorithm performs automatic selection of the number of clusters, and generates possibly overlapping clusters of reads. Experiments on real-life benchmark datasets show the effectiveness of the method for reducing the size of a metagenome dataset while maintaining a high accuracy of organism content.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molecular Biology 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Chan, C.K., Hsu, A.L., Tang, S., Halgamuge, S.K.: Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. Journal of Biomedicine and Biotechnology (2008)Google Scholar
  3. 3.
    Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides, N.C., Markowitz, V.M.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16) (2008)Google Scholar
  4. 4.
    Mavromatis, K., et al.: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nature Methods 4(6), 495–500 (2007)CrossRefGoogle Scholar
  5. 5.
    Yooseph, S., et al.: The sorcerer ii global ocean sampling expedition: Expanding the universe of protein families. PLoS Biol. 5(3), 432–466 (2007)CrossRefGoogle Scholar
  6. 6.
    Hernandez, D., Francois, P., Farinelli, L., Osteras, M., Schrenzel, J.: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research 18(5), 802–809 (2008)CrossRefGoogle Scholar
  7. 7.
    Korf, I., Yandell, M., Bedell, J.: BLAST. O’Reilly & Associates, Inc., Sebastopol (2003)Google Scholar
  8. 8.
    Li, W., Wooley, J.C., Godzik, A.: Probing metagenomics by rapid cluster analysis of very large datasets. PLoS ONE 3(10) (2008)Google Scholar
  9. 9.
    Madden, T.: The BLAST Sequence Analysis Tool, ch. 16. Bethesda, MD (2002)Google Scholar
  10. 10.
    Marchiori, E., Steenbeek, A.: An evolutionary algorithm for large scale set covering problems with application to airline crew scheduling. In: Oates, M.J., Lanzi, P.L., Li, Y., Cagnoni, S., Corne, D.W., Fogarty, T.C., Poli, R., Smith, G.D. (eds.) EvoIASP 2000, EvoWorkshops 2000, EvoFlight 2000, EvoSCONDI 2000, EvoSTIM 2000, EvoTEL 2000, and EvoROB/EvoRobot 2000. LNCS, vol. 1803, pp. 367–381. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  11. 11.
    McHardy, A.C., Rigoutsos, I.: What’s in the mix: phylogenetic classification of metagenome sequence samples. Current Opinion in Microbiology 10, 499–503 (2007)CrossRefGoogle Scholar
  12. 12.
    Pluim, J.P.W., Antoine Maintz, J.B., Viergever, M.A.: Image registration by maximization of combined mutual information and gradient information. IEEE Trans. Med. Imaging 19(8), 809–814 (2000)CrossRefGoogle Scholar
  13. 13.
    Pop, M., Phillippy, A., Delcher, A.L., Salzberg, S.L.: Comparative genome assembly. Briefings in Bioinformatics 5(3), 237–248 (2004)CrossRefGoogle Scholar
  14. 14.
    Raes, J., Foerstner, K.U., Bork, P.: Get the most out of your metagenome: computational analysis of environmental sequence data. Current Opinion in Microbiology 10, 490–498 (2007)CrossRefGoogle Scholar
  15. 15.
    Zhao, W., Fanning, M.L., Lane, T.: Efficient RNAi-based gene family knockdown via set cover optimization. Artificial Intelligence in Medicine 35(1-2), 61–73 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Gianluigi Folino
    • 1
  • Fabio Gori
    • 2
  • Mike S. M. Jetten
    • 2
  • Elena Marchiori
    • 2
  1. 1.ICAR-CNRRendeItaly
  2. 2.Radboud UniversityNijmegenThe Netherlands

Personalised recommendations