Improving Metagenomic Assemblies Through Data Partitioning: A GC Content Approach
Assembling metagenomic data sequenced by NGS platforms poses significant computational challenges, especially due to large volumes of data, sequencing errors, and variations in size, complexity, diversity and abundance of organisms present in a given metagenome. To overcome these problems, this work proposes an open-source, bioinformatic tool called GCSplit, which partitions metagenomic sequences into subsets using a computationally inexpensive metric: the GC content. Experiments performed on real data show that preprocessing short reads with GCSplit prior to assembly reduces memory consumption and generates higher quality results, such as an increase in the size of the largest contig and N50 metric, while both the L50 value and the total number of contigs produced in the assembly were reduced. GCSplit is available at https://github.com/mirand863/gcsplit.
KeywordsDNA sequencing Metagenomics Data partitioning Bioinformatic tools Metagenomic data preprocessing
This research is supported in part by CNPq under grant numbers 421528/2016–8 and 304711/2015–2. The authors would also like to thank CAPES for granting scholarships. Datasets processed in Sagarana HPC cluster, CPAD–ICB–UFMG.
- 8.Wojcieszek, M., Pawełkowicz, M., Nowak, R., et al.: Genomes correction and assembling: present methods and tools. In: SPIE Proceedings, vol. 9290, p. 92901X (2014)Google Scholar
- 10.Rasheed, Z., Rangwala, H.: Mc-MinH: metagenome clustering using minwise based hashing. In: SIAM International Conference in Data Mining, pp. 677–685 (2013)Google Scholar
- 13.Brown, C.T., Howe, A., Zhang, Q., et al.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802 (2012)
- 16.Durai, D.A., Schulz, M.H.: In-silico read normalization using set multi-cover optimization. bioRxiv:133579 (2017)
- 18.Crusoe, M.R., Alameldin, H.F., Awad, S., et al.: The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research 4, 900 (2015)Google Scholar
- 19.Rengasamy, V., Medvedev, P., Madduri, K.: Parallel and memory-efficient preprocessing for metagenome assembly. In: IPDPSW, pp. 283–292 (2017)Google Scholar
- 23.Stamps, B.W., Corsetti, F.A., Spear, J.R., et al.: Draft genome of a novel Chlorobi member assembled by tetranucleotide binning of a hot spring metagenome. Genome Announc. 2, e00897–e00914 (2014)Google Scholar