Improving Metagenomic Assemblies Through Data Partitioning: A GC Content Approach

  • Fábio MirandaEmail author
  • Cassio Batista
  • Artur Silva
  • Jefferson Morais
  • Nelson Neto
  • Rommel Ramos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10813)


Assembling metagenomic data sequenced by NGS platforms poses significant computational challenges, especially due to large volumes of data, sequencing errors, and variations in size, complexity, diversity and abundance of organisms present in a given metagenome. To overcome these problems, this work proposes an open-source, bioinformatic tool called GCSplit, which partitions metagenomic sequences into subsets using a computationally inexpensive metric: the GC content. Experiments performed on real data show that preprocessing short reads with GCSplit prior to assembly reduces memory consumption and generates higher quality results, such as an increase in the size of the largest contig and N50 metric, while both the L50 value and the total number of contigs produced in the assembly were reduced. GCSplit is available at


DNA sequencing Metagenomics Data partitioning Bioinformatic tools Metagenomic data preprocessing 



This research is supported in part by CNPq under grant numbers 421528/2016–8 and 304711/2015–2. The authors would also like to thank CAPES for granting scholarships. Datasets processed in Sagarana HPC cluster, CPAD–ICB–UFMG.


  1. 1.
    Vogel, T.M., Simonet, P., Jansson, J.K., et al.: TerraGenome: a consortium for the sequencing of a soil metagenome. Nat. Rev. Microbiol. 7, 252 (2009)CrossRefGoogle Scholar
  2. 2.
    Venter, J.C., Remington, K., Heidelberg, J.F., et al.: Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004)CrossRefGoogle Scholar
  3. 3.
    Qin, J., Li, R., Raes, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010)CrossRefGoogle Scholar
  4. 4.
    Turnbaugh, P.J., Ley, R.E., Hamady, M., et al.: The human microbiome project: exploring the microbial part of ourselves in a changing world. Nature 449, 804–810 (2007)CrossRefGoogle Scholar
  5. 5.
    Namiki, T., Hachiya, T., Tanaka, H., et al.: MetaVelvet: an extension of Velvet assembler to De Novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012)CrossRefGoogle Scholar
  6. 6.
    Rodrigue, S., Materna, A.C., Timberlake, S., et al.: Unlocking short read sequencing for metagenomics. PLoS ONE 5, e11840 (2010)CrossRefGoogle Scholar
  7. 7.
    Nielsen, H.B., Almeida, M., Juncker, A.S., et al.: Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014)CrossRefGoogle Scholar
  8. 8.
    Wojcieszek, M., Pawełkowicz, M., Nowak, R., et al.: Genomes correction and assembling: present methods and tools. In: SPIE Proceedings, vol. 9290, p. 92901X (2014)Google Scholar
  9. 9.
    Charuvaka, A., Rangwala, H.: Evaluation of short read metagenomic assembly. BMC Genom. 12, S8 (2011)CrossRefGoogle Scholar
  10. 10.
    Rasheed, Z., Rangwala, H.: Mc-MinH: metagenome clustering using minwise based hashing. In: SIAM International Conference in Data Mining, pp. 677–685 (2013)Google Scholar
  11. 11.
    Howe, A.C., Jansson, J.K., Malfatti, S.A., et al.: Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl. Acad. Sci. 111, 4904–4909 (2014)CrossRefGoogle Scholar
  12. 12.
    Nurk, S., Meleshko, D., Korobeynikov, A., et al.: metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017)CrossRefGoogle Scholar
  13. 13.
    Brown, C.T., Howe, A., Zhang, Q., et al.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 (2012)
  14. 14.
    Haas, B.J., Papanicolaou, A., Yassour, M., et al.: De Novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013)CrossRefGoogle Scholar
  15. 15.
    McCorrison, J.M., Venepally, P., Singh, I., et al.: NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly. BMC bioinform. 15, 357 (2014)CrossRefGoogle Scholar
  16. 16.
    Durai, D.A., Schulz, M.H.: In-silico read normalization using set multi-cover optimization. bioRxiv:133579 (2017)
  17. 17.
    Pell, J., Hintze, A., Canino-Koning, R., et al.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. 109, 13272–13277 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Crusoe, M.R., Alameldin, H.F., Awad, S., et al.: The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research 4, 900 (2015)Google Scholar
  19. 19.
    Rengasamy, V., Medvedev, P., Madduri, K.: Parallel and memory-efficient preprocessing for metagenome assembly. In: IPDPSW, pp. 283–292 (2017)Google Scholar
  20. 20.
    Cleary, B., Brito, I.L., Huang, K., et al.: Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015)CrossRefGoogle Scholar
  21. 21.
    Melsted, P., Halldórsson, B.V.: KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics 30, 3541–3547 (2014)CrossRefGoogle Scholar
  22. 22.
    Bankevich, A., Nurk, S., Antipov, D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Stamps, B.W., Corsetti, F.A., Spear, J.R., et al.: Draft genome of a novel Chlorobi member assembled by tetranucleotide binning of a hot spring metagenome. Genome Announc. 2, e00897–e00914 (2014)Google Scholar
  24. 24.
    Ibarbalz, F.M., Orellana, E., Figuerola, E.L., et al.: Shotgun metagenomic profiles have a high capacity to discriminate samples of activated sludge according to wastewater type. Appl. Environ. Microbiol. 82, 5186–5196 (2016)CrossRefGoogle Scholar
  25. 25.
    Gurevich, A., Saveliev, V., Vyahhi, N., et al.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Computer Science Graduate ProgramFederal University of ParáBelémBrazil
  2. 2.Institute of Biological SciencesFederal University of ParáBelémBrazil
  3. 3.Center of Genomics and Systems BiologyFederal University of ParáBelémBrazil

Personalised recommendations