Abstract
Metagenomic gene classifications are significant in bioinformatics and computational biology research. There are huge interrelated datasets that deal with human characteristics, diseases and molecular functionalities. Analysis of metagenomic reorganization is a challenging issue due to their diversity and efficient classification tools. Graph based MapReducing approach can easily handle the genomic diversity. MapReduce has two parts such as mapping and reducing. In mapping phase, a recursive naive algorithm is used for generating K-mers. De-Bruijn graph is a compact representation of k-mers and finds out an optimal path (solution) for genome assembly. Similarity metrics have been utilized for finding similarity among the De-Oxy Ribonucleic Acid (DNA) sequences. In reducing side, Jaccard similarity and purity of clustering are used as datasets classifier to classify the sequences based on their similarity. Reducing phase can easily classify the DNA sequences from large database. Extensive experimental analysis has demonstrated that graph based MapReduce analysis generate optimal solutions. Remarkable improvements in time and space have recorded and observed. The results established that proposed framework performed faster than SSMA-SFSD when classified elements are increased. It provided better accuracy for metagenomic data clustering.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS Comput Biol 6(2):e1000667
Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR (2008) Evolution of mammals and their gut microbes. Science 320(80):1647–1651
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol 5:e77
Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH (2012) Structure, function and diversity of the healthy human microbiome. Nature 486:207–214
Besemer J, Borodovsky M (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res 27(19):3911–3920
Greenblum S, Turnbaugh PJ, Borenstein E (2009) Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. PNAS 109:594–599
Qin J, Li Y, Cai Z, Li S, Zhu J (2012) A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55–60
Handelsman J (2007) Committee on metagenomics: challenges and functional applications. The National Academies Press, Washington
Pevzner P, Tang H, Waterman M (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98:9748–9753
Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315–327
Compeau P, Pevzner P, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29:987–991
Peng Y, Leung HCM, Yiu SM, Chin FYL (2011) T-IDBA: a de novo iterative de Bruijn graph assembler for transcriptome. In: Bafna V, Sahinalp SC (eds) Research in computational molecular biology. RECOMB 2011. Lecture notes in computer science, vol 6577. Springer, Berlin, Heidelberg
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Simpson JT, Wong K, Jackman K, Schein JE, Jones SJ, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB (2008) AllPaths: de novo assembly of whole-genome shotgun microreads. Genome Res. 18:810–820
Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2012) MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 40(20):e155
Grabherr M (2009) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652
López V, del Río S, Benítez J, Herrera F (2014) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38
Miner D, Shook A (2012) MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems. O’Reilly Media, Inc., Sebastopol, CA
Dean J, Ghemawat S (2003) MapReduce: simplified data processing on large clusters. In: Proceedings. of Symposium on opearting systems design and implementation, vol 6, pp 1–10
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI 2004
Yinan W, Renner DW, Albert I, SzparaL ML (2015) VirAmp: a galaxy-based viral genome assembly pipeline. GigaScience 4:19
Chang Z, Li G, Li J, Zhang Y, Ashby C, Liu D, Cramer C, Huang X (2015) Bridger: a new framework for de novo transcriptome assembly using RNA-seqdata. Genome Biol 16:30
Hernandez D (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 18:802–809
Wang S, Cho H, Zhai CX, Berger B, Peng J (2015) Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31:i357–i364
Yuzhen Y, Haixu T (2016) Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics 32(7):1001–1008
Christopher BB (1997) dentification of genes in human genomicdna. Ph.d. Thesis. Stanford University, Stanford, CA,USA
Gens P, Enrique B, Roderic G (2000) Geneid in drosophila. Genome Res 10:511–515
Arthur D, Kirsten B, Edwin P, Steven S (2007) Identifying bacterial genes and endosymbiontdna with glimmer. Bioinformatics 23:7
Ewan B, Michele C, Richard D (2004) Gene wise and genome wise. Genome Res 14:988–995
Leila T, Oliver R, Saurabh G, Alexander S, Michael B, Serafim B, Burkhard M (2003) Agenda: homology-based gene prediction. Bioinformatics 19:1575–1577
Green P, Lipman D, Hillier L, Waterston R, States RD, Claverie JM (1993) Ancient conserved regions in new gene sequences and the protein databases. Science 259:1711–1716
Noguchi H, Park J, Takagi T (2006) MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 34(19):5623–5630
Hoff KJ, Lingner T, Meinicke P (2009) Orphelia:predicting genes in metagenomic sequencing reads. Nucleic Acids Res 37:W101–W105
Besemer J, Borodovsky M (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res 27(19):3911–3920
Yang B, Peng Y, Leung H, Yiu SM, Qin J, Li R, Chin FYL (2010) Metacluster: unsupervised binning of environmental genomic fragments and taxonomic annotation. In: Proceedingsof the first ACM international conference on bioinformatics and computational biology, pp 170–179
Yang X, Zola J, Aluru S. (2011) Parallel metagenomic sequence clustering via sketching and maximal qQuasi clique enumeration on map-reduce clouds. In: Parallel and distributed processing symposium (IPDPS), 2011 IEEE International, pp 1223–1233
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. In: SIGMOD pp 1099–1110
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010). Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp 996–1005
Chaiken R, Jenkins B, Larson PA, Ramsey B, Shakib D, Weaver S, Zhou J (2008) Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2):1265–1276
Río S, López V, Benítez J, Herrera F (2014) On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci 285:112–137
Birney E, Zerbino DR (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Pevzner PA, Tang HX, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98(17):9748–9753
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res 18(5):821–829
Limasset A, Cazaux B, Rivals E, Peterlongo P (2016) Read mapping on de Bruijn graphs. Bioinformatics 17(1):237
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85
Myers EW, Sutton GG, Delcher AL et al (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204
Gross JL, Yellen J (2004) Handbook of graph theory. CRC Press LLC, Boca Raton
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51:107–113
Dean J, Ghemawat S (2010) Mapreduce:a flexible data processing tool. ACM 53:72–77
Benkrid K, Liu Y, Benkrid A (2009) A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment. IEEE Trans Very Large Scale Integr Syst 17(4):561–570
Edgar RC (2010) Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19):2460–2461
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the thirtieth annual ACM symposium on theory of computing pp 327–336
Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. Technical report, Department of Computer Science, University of Minnesota, Minneapolis
Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Bwl 147:195–197
Hugenholtz P, Tyson GW (2008) Microbiology: metagenomic. Nature 455(7212):481–483
Chatterji S, Yamazaki I, Bai Z, Eisen J (2008) Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Annual international conference on research in computational molecular biology, Springer, pp 17–28
Khan I, Kamal S, Chowdhury L (2015) MSuPDA: a memory efficient algorithm for sequence alignment. Comput Life Sci 8(1):84–94
García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: ascalingupapproach. Pattern Recognit 41(8):2693–2709
Price KV, Storn RM, Lampinen JA (2005) The differential evolution algorithm. In: Differential evolution: a practical approach to global optimization, pp 37–134. ISBN 978-3-540-31306-9
Neri F, Tirronen V (2009) Scale factor local searching differential evolution. Memet Comput 1(2):153–171
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kamal, M.S., Parvin, S., Ashour, A.S. et al. De-Bruijn graph with MapReduce framework towards metagenomic data classification. Int. j. inf. tecnol. 9, 59–75 (2017). https://doi.org/10.1007/s41870-017-0005-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-017-0005-z