Better Identification of Repeats in Metagenomic Scaffolding
Genomic repeats are the most important challenge in genomic assembly. While for single genomes the effect of repeats is largely addressed by modern long-read sequencing technologies, in metagenomic data intra-genome and, more importantly, inter-genome repeats continue to be a significant impediment to effective genome reconstruction. Detecting repeats in metagenomic samples is complicated by characteristic features of these data, primarily uneven depths of coverage and the presence of genomic polymorphisms. The scaffolder Bambus 2 introduced a new strategy for repeat detection based on the betweenness centrality measure – a concept originally used in social network analysis. The exact computation of the betweenness centrality measure is, however, computationally intensive and impractical in large metagenomic datasets. Here we explore the effectiveness of approximate algorithms for network centrality to accurately detect genomic repeats within metagenomic samples. We show that an approximate measure of centrality achieves much higher computational efficiencies with a minimal loss in the accuracy of detecting repeats in metagenomic data. We also show that the combination of multiple features of the scaffold graph provides a more effective strategy for identifying metagenomic repeats, significantly outperforming all other commonly used approaches.
KeywordsMetagenomics Random forest Betweenness centrality Scaffolding Algorithms Graph
We thank Chris Hill for helping us with generating Fig. 1 and experiments. We also thank Todd Treangen for helping us to improve the manuscript and design experiments.
- 3.Delcher, A.L., Salzberg, S.L., Phillippy, A.M.: Using MUMmer to identify similar regions in large sequence sets. Curr. Protocols Bioinform. 10.3.1–10.3.18 (2003). Chapter 10:Unit 10.3Google Scholar
- 6.Geisberger, R., Sanders, P., Schultes, D.: Better approximation of betweenness centrality. In: ALENEX, pp. 90–100. SIAM (2008)Google Scholar
- 8.Fass, J.N., Joshi, N.A.: Sickle: a sliding-window, adaptive, quality-based trimming tool for FastQ files (version 1.33)Google Scholar
- 14.Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)Google Scholar
- 16.Madduri, K., Ediger, D., Jiang, K., Bader, D.A., Chavarria-Miranda, D.: A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In: 2009 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009, pp. 1–8. IEEE (2009)Google Scholar
- 18.Mitchell, L., Sloan, T.M., Mewissen, M., Ghazal, P., Forster, T., Piotrowski, M., Trew, A.S.: A parallel random forest classifier for R. In: Proceedings of the Second International Workshop on Emerging Computational Methods for the Life Sciences, pp. 1–6. ACM (2011)Google Scholar
- 21.Riondato, M., Kornaropoulos, E.M.: Fast approximation of betweenness centrality through sampling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 413–422. ACM (2014)Google Scholar