Better Identification of Repeats in Metagenomic Scaffolding

  • Jay Ghurye
  • Mihai Pop
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9838)


Genomic repeats are the most important challenge in genomic assembly. While for single genomes the effect of repeats is largely addressed by modern long-read sequencing technologies, in metagenomic data intra-genome and, more importantly, inter-genome repeats continue to be a significant impediment to effective genome reconstruction. Detecting repeats in metagenomic samples is complicated by characteristic features of these data, primarily uneven depths of coverage and the presence of genomic polymorphisms. The scaffolder Bambus 2 introduced a new strategy for repeat detection based on the betweenness centrality measure – a concept originally used in social network analysis. The exact computation of the betweenness centrality measure is, however, computationally intensive and impractical in large metagenomic datasets. Here we explore the effectiveness of approximate algorithms for network centrality to accurately detect genomic repeats within metagenomic samples. We show that an approximate measure of centrality achieves much higher computational efficiencies with a minimal loss in the accuracy of detecting repeats in metagenomic data. We also show that the combination of multiple features of the scaffold graph provides a more effective strategy for identifying metagenomic repeats, significantly outperforming all other commonly used approaches.


Metagenomics Random forest Betweenness centrality Scaffolding Algorithms Graph 



We thank Chris Hill for helping us with generating Fig. 1 and experiments. We also thank Todd Treangen for helping us to improve the manuscript and design experiments.


  1. 1.
    Brandes, U.: A faster algorithm for betweenness centrality*. J. Math. Sociol. 25(2), 163–177 (2001)CrossRefzbMATHGoogle Scholar
  2. 2.
    Dayarian, A., Michael, T.P., Sengupta, A.M.: SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinform. 11(1), 1 (2010)CrossRefGoogle Scholar
  3. 3.
    Delcher, A.L., Salzberg, S.L., Phillippy, A.M.: Using MUMmer to identify similar regions in large sequence sets. Curr. Protocols Bioinform. 10.3.1–10.3.18 (2003). Chapter 10:Unit 10.3Google Scholar
  4. 4.
    Gao, S., Sung, W.-K., Nagarajan, N.: Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 18(11), 1681–1691 (2011)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Garey, M., Johnson, D.: Computers and Intractability - A Guide to NP-Completeness. W.H. Freeman & Co., New York (1979)zbMATHGoogle Scholar
  6. 6.
    Geisberger, R., Sanders, P., Schultes, D.: Better approximation of betweenness centrality. In: ALENEX, pp. 90–100. SIAM (2008)Google Scholar
  7. 7.
    Huson, D.H., Reinert, K., Myers, E.W.: The greedy path-merging algorithm for contig scaffolding. J. ACM (JACM) 49(5), 603–615 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Fass, J.N., Joshi, N.A.: Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files (version 1.33)Google Scholar
  9. 9.
    Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly. Algorithmica 13(1–2), 7–51 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Kingsford, C., Schatz, M.C., Pop, M.: Assembly complexity of prokaryotic genomes using short reads. BMC Bioinform. 11(1), 21 (2010)CrossRefGoogle Scholar
  11. 11.
    Koren, S., Phillippy, A.M.: One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 23, 110–120 (2015)CrossRefGoogle Scholar
  12. 12.
    Koren, S., Treangen, T.J., Pop, M.: Bambus 2: scaffolding metagenomes. Bioinformatics 27(21), 2964–2971 (2011)CrossRefGoogle Scholar
  13. 13.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  14. 14.
    Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)Google Scholar
  15. 15.
    Lilliefors, H.W.: On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 62(318), 399–402 (1967)CrossRefGoogle Scholar
  16. 16.
    Madduri, K., Ediger, D., Jiang, K., Bader, D.A., Chavarria-Miranda, D.: A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In: 2009 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, pp. 1–8. IEEE (2009)Google Scholar
  17. 17.
    Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 289–301. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  18. 18.
    Mitchell, L., Sloan, T.M., Mewissen, M., Ghazal, P., Forster, T., Piotrowski, M., Trew, A.S.: A parallel random forest classifier for R. In: Proceedings of the Second International Workshop on Emerging Computational Methods for the Life Sciences, pp. 1–6. ACM (2011)Google Scholar
  19. 19.
    Peng, Y., Leung, H.C., Yiu, S.-M., Chin, F.Y.: Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics 27(13), i94–i101 (2011)CrossRefGoogle Scholar
  20. 20.
    Pop, M., Kosack, D.S., Salzberg, S.L.: Hierarchical scaffolding with bambus. Genome Res. 14(1), 149–159 (2004)CrossRefGoogle Scholar
  21. 21.
    Riondato, M., Kornaropoulos, E.M.: Fast approximation of betweenness centrality through sampling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 413–422. ACM (2014)Google Scholar
  22. 22.
    Salmela, L., Mäkinen, V., Välimäki, N., Ylinen, J., Ukkonen, E.: Fast scaffolding with small independent mixed integer programs. Bioinformatics 27(23), 3259–3265 (2011)CrossRefGoogle Scholar
  23. 23.
    Shakya, M., Quince, C., Campbell, J.H., Yang, Z.K., Schadt, C.W., Podar, M.: Comparative metagenomic and RRNA microbial diversity characterization using archaeal and bacterial synthetic communities. Environ. Microbiol. 15(6), 1882–1899 (2013)CrossRefGoogle Scholar
  24. 24.
    Treangen, T.J., Koren, S., Sommer, D.D., Liu, B., Astrovskaya, I., Ondov, B., Darling, A.E., Phillippy, A.M., Pop, M.: MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 14(1), R2 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer Science and Center for Bioinformatics and Computational BiologyUniversity of MarylandCollege ParkUSA

Personalised recommendations