Skip to main content

De-Bruijn graph with MapReduce framework towards metagenomic data classification

Abstract

Metagenomic gene classifications are significant in bioinformatics and computational biology research. There are huge interrelated datasets that deal with human characteristics, diseases and molecular functionalities. Analysis of metagenomic reorganization is a challenging issue due to their diversity and efficient classification tools. Graph based MapReducing approach can easily handle the genomic diversity. MapReduce has two parts such as mapping and reducing. In mapping phase, a recursive naive algorithm is used for generating K-mers. De-Bruijn graph is a compact representation of k-mers and finds out an optimal path (solution) for genome assembly. Similarity metrics have been utilized for finding similarity among the De-Oxy Ribonucleic Acid (DNA) sequences. In reducing side, Jaccard similarity and purity of clustering are used as datasets classifier to classify the sequences based on their similarity. Reducing phase can easily classify the DNA sequences from large database. Extensive experimental analysis has demonstrated that graph based MapReduce analysis generate optimal solutions. Remarkable improvements in time and space have recorded and observed. The results established that proposed framework performed faster than SSMA-SFSD when classified elements are increased. It provided better accuracy for metagenomic data clustering.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. 1.

    Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS Comput Biol 6(2):e1000667

    Article  Google Scholar 

  2. 2.

    Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR (2008) Evolution of mammals and their gut microbes. Science 320(80):1647–1651

    Article  Google Scholar 

  3. 3.

    Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S (2007) The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol 5:e77

    Article  Google Scholar 

  4. 4.

    Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH (2012) Structure, function and diversity of the healthy human microbiome. Nature 486:207–214

    Article  Google Scholar 

  5. 5.

    Besemer J, Borodovsky M (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res 27(19):3911–3920

    Article  Google Scholar 

  6. 6.

    Greenblum S, Turnbaugh PJ, Borenstein E (2009) Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. PNAS 109:594–599

    Article  Google Scholar 

  7. 7.

    Qin J, Li Y, Cai Z, Li S, Zhu J (2012) A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490:55–60

    Article  Google Scholar 

  8. 8.

    Handelsman J (2007) Committee on metagenomics: challenges and functional applications. The National Academies Press, Washington

    Google Scholar 

  9. 9.

    Pevzner P, Tang H, Waterman M (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98:9748–9753

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315–327

    Article  Google Scholar 

  11. 11.

    Compeau P, Pevzner P, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29:987–991

    Article  Google Scholar 

  12. 12.

    Peng Y, Leung HCM, Yiu SM, Chin FYL (2011) T-IDBA: a de novo iterative de Bruijn graph assembler for transcriptome. In: Bafna V, Sahinalp SC (eds) Research in computational molecular biology. RECOMB 2011. Lecture notes in computer science, vol 6577. Springer, Berlin, Heidelberg

    Google Scholar 

  13. 13.

    Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829

    Article  Google Scholar 

  14. 14.

    Simpson JT, Wong K, Jackman K, Schein JE, Jones SJ, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123

    Article  Google Scholar 

  15. 15.

    Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB (2008) AllPaths: de novo assembly of whole-genome shotgun microreads. Genome Res. 18:810–820

    Article  Google Scholar 

  16. 16.

    Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2012) MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 40(20):e155

    Article  Google Scholar 

  17. 17.

    Grabherr M (2009) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652

    Article  Google Scholar 

  18. 18.

    López V, del Río S, Benítez J, Herrera F (2014) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38

    MathSciNet  Article  Google Scholar 

  19. 19.

    Miner D, Shook A (2012) MapReduce design patterns: building effective algorithms and analytics for Hadoop and other systems. O’Reilly Media, Inc., Sebastopol, CA

    Google Scholar 

  20. 20.

    Dean J, Ghemawat S (2003) MapReduce: simplified data processing on large clusters. In: Proceedings. of Symposium on opearting systems design and implementation, vol 6, pp 1–10

  21. 21.

    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI 2004

  22. 22.

    Yinan W, Renner DW, Albert I, SzparaL ML (2015) VirAmp: a galaxy-based viral genome assembly pipeline. GigaScience 4:19

    Article  Google Scholar 

  23. 23.

    Chang Z, Li G, Li J, Zhang Y, Ashby C, Liu D, Cramer C, Huang X (2015) Bridger: a new framework for de novo transcriptome assembly using RNA-seqdata. Genome Biol 16:30

    Article  Google Scholar 

  24. 24.

    Hernandez D (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 18:802–809

    Article  Google Scholar 

  25. 25.

    Wang S, Cho H, Zhai CX, Berger B, Peng J (2015) Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics 31:i357–i364

    Article  Google Scholar 

  26. 26.

    Yuzhen Y, Haixu T (2016) Utilizing de Bruijn graph of metagenome assembly for metatranscriptome analysis. Bioinformatics 32(7):1001–1008

    Article  Google Scholar 

  27. 27.

    Christopher BB (1997) dentification of genes in human genomicdna. Ph.d. Thesis. Stanford University, Stanford, CA,USA

  28. 28.

    Gens P, Enrique B, Roderic G (2000) Geneid in drosophila. Genome Res 10:511–515

    Article  Google Scholar 

  29. 29.

    Arthur D, Kirsten B, Edwin P, Steven S (2007) Identifying bacterial genes and endosymbiontdna with glimmer. Bioinformatics 23:7

    Google Scholar 

  30. 30.

    Ewan B, Michele C, Richard D (2004) Gene wise and genome wise. Genome Res 14:988–995

    Article  Google Scholar 

  31. 31.

    Leila T, Oliver R, Saurabh G, Alexander S, Michael B, Serafim B, Burkhard M (2003) Agenda: homology-based gene prediction. Bioinformatics 19:1575–1577

    Article  Google Scholar 

  32. 32.

    Green P, Lipman D, Hillier L, Waterston R, States RD, Claverie JM (1993) Ancient conserved regions in new gene sequences and the protein databases. Science 259:1711–1716

    Article  Google Scholar 

  33. 33.

    Noguchi H, Park J, Takagi T (2006) MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 34(19):5623–5630

    Article  Google Scholar 

  34. 34.

    Hoff KJ, Lingner T, Meinicke P (2009) Orphelia:predicting genes in metagenomic sequencing reads. Nucleic Acids Res 37:W101–W105

    Article  Google Scholar 

  35. 35.

    Besemer J, Borodovsky M (1999) Heuristic approach to deriving models for gene finding. Nucleic Acids Res 27(19):3911–3920

    Article  Google Scholar 

  36. 36.

    Yang B, Peng Y, Leung H, Yiu SM, Qin J, Li R, Chin FYL (2010) Metacluster: unsupervised binning of environmental genomic fragments and taxonomic annotation. In: Proceedingsof the first ACM international conference on bioinformatics and computational biology, pp 170–179

  37. 37.

    Yang X, Zola J, Aluru S. (2011) Parallel metagenomic sequence clustering via sketching and maximal qQuasi clique enumeration on map-reduce clouds. In: Parallel and distributed processing symposium (IPDPS), 2011 IEEE International, pp 1223–1233

  38. 38.

    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a not-so-foreign language for data processing. In: SIGMOD pp 1099–1110

  39. 39.

    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Zhang N, Anthony S, Liu H, Murthy R (2010). Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp 996–1005

  40. 40.

    Chaiken R, Jenkins B, Larson PA, Ramsey B, Shakib D, Weaver S, Zhou J (2008) Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2):1265–1276

    Google Scholar 

  41. 41.

    Río S, López V, Benítez J, Herrera F (2014) On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci 285:112–137

    Article  Google Scholar 

  42. 42.

    Birney E, Zerbino DR (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829

    Article  Google Scholar 

  43. 43.

    Pevzner PA, Tang HX, Waterman MS (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 98(17):9748–9753

    MathSciNet  Article  MATH  Google Scholar 

  44. 44.

    Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res 18(5):821–829

    Article  Google Scholar 

  45. 45.

    Limasset A, Cazaux B, Rivals E, Peterlongo P (2016) Read mapping on de Bruijn graphs. Bioinformatics 17(1):237

    Google Scholar 

  46. 46.

    Myers EW (2005) The fragment assembly string graph. Bioinformatics 21:ii79–ii85

    Google Scholar 

  47. 47.

    Myers EW, Sutton GG, Delcher AL et al (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204

    Article  Google Scholar 

  48. 48.

    Gross JL, Yellen J (2004) Handbook of graph theory. CRC Press LLC, Boca Raton

    MATH  Google Scholar 

  49. 49.

    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51:107–113

    Article  Google Scholar 

  50. 50.

    Dean J, Ghemawat S (2010) Mapreduce:a flexible data processing tool. ACM 53:72–77

    Article  Google Scholar 

  51. 51.

    Benkrid K, Liu Y, Benkrid A (2009) A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment. IEEE Trans Very Large Scale Integr Syst 17(4):561–570

    Article  Google Scholar 

  52. 52.

    Edgar RC (2010) Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19):2460–2461

    Article  Google Scholar 

  53. 53.

    Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Article  Google Scholar 

  54. 54.

    Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the thirtieth annual ACM symposium on theory of computing pp 327–336

  55. 55.

    Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. Technical report, Department of Computer Science, University of Minnesota, Minneapolis

  56. 56.

    Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  Google Scholar 

  57. 57.

    Smith T, Waterman M (1981) Identification of common molecular subsequences. J Mol Bwl 147:195–197

    Article  Google Scholar 

  58. 58.

    Hugenholtz P, Tyson GW (2008) Microbiology: metagenomic. Nature 455(7212):481–483

    Article  Google Scholar 

  59. 59.

    Chatterji S, Yamazaki I, Bai Z, Eisen J (2008) Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Annual international conference on research in computational molecular biology, Springer, pp 17–28

  60. 60.

    Khan I, Kamal S, Chowdhury L (2015) MSuPDA: a memory efficient algorithm for sequence alignment. Comput Life Sci 8(1):84–94

    Google Scholar 

  61. 61.

    García S, Cano JR, Herrera F (2008) A memetic algorithm for evolutionary prototype selection: ascalingupapproach. Pattern Recognit 41(8):2693–2709

    Article  MATH  Google Scholar 

  62. 62.

    Price KV, Storn RM, Lampinen JA (2005) The differential evolution algorithm. In: Differential evolution: a practical approach to global optimization, pp 37–134. ISBN 978-3-540-31306-9

  63. 63.

    Neri F, Tirronen V (2009) Scale factor local searching differential evolution. Memet Comput 1(2):153–171

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nilanjan Dey.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kamal, M.S., Parvin, S., Ashour, A.S. et al. De-Bruijn graph with MapReduce framework towards metagenomic data classification. Int. j. inf. tecnol. 9, 59–75 (2017). https://doi.org/10.1007/s41870-017-0005-z

Download citation

Keywords

  • MapReduce
  • Metagenomic
  • K-mers
  • De-Bruijn graph
  • Jaccard similarity