Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Identifying viruses from metagenomic data using deep learning



The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.


Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.


Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.


Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.


  1. 1.

    Norman, J. M., Handley, S. A., Baldridge, M. T., Droit, L., Liu, C. Y., Keller, B. C, Kambal, A., Monaco, C. L., Zhao, G., Fleshner, P., et al. (2015) Disease-specific alterations in the enteric virome in inflammatory bowel disease. Cell, 160, 447–460

  2. 2.

    Reyes, A., Blanton, L. V., Cao, S., Zhao, G., Manary, M, Trehan, I., Smith, M. I., Wang, D., Virgin, H. W., Rohwer, R, et al. (2015) Gut DNA viromes of Malawian twins discordant for severe acute malnutrition. Proc. Natl. Acad. Sci. USA, 112, 11941–11946

  3. 3.

    Ma, Y, You, X., Mai, G, Tokuyasu, T. and Liu, C. (2018) A human gut phage catalog correlates the gut phageome with type 2 diabetes. Microbiome, 6, 24

  4. 4.

    Roux, S., Enault, R, Hurwitz, B. L. and Sullivan, M. B. (2015) VirSorter: mining viral signal from microbial genomic data. Peer J, 3, e985

  5. 5.

    Ren, J., Ahlgren, N. A., Lu, Y Y, Fuhrman, J. A. and Sun, R (2017) VirFinder: a novel fc-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome, 5, 69

  6. 6.

    Amgarten, D., Braga, L. P. P., da Silva, A. M. and Setubal, J. C. (2018) Marvel, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet., 9, 304

  7. 7.

    Roux, S., Faubladier, M., Mahul, A., Paulhe, N., Bernard, A., Debroas, D. and Enault, R (2011) Metavir: a web server dedicated to virome analysis. Bioinformatics, 27, 3074–3075

  8. 8.

    Rampelli, S., Soverini, M., Turroni, S., Quercia, S., Biagi, E., Brigidi, P. and Candela, M. (2016) ViromeScan: a new tool for metagenomic viral community profiling. BMC Genomics, 17, 165

  9. 9.

    Wommack, K. E., Bhavsar, J., Poison, S. W., Chen, J., Dumas, M., Srinivasiah, S., Furman, M., Jamindar, S. and D. J. (2012) VIROME: a standard operating procedure for analysis of viral metagenome sequences. Stand. Genomic Sci., 6, 427–439

  10. 10.

    Wood, D. E. and Salzberg, S. L. (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15, R46

  11. 11.

    Kim, D., Song, L., Breitwieser, R P. and Salzberg, S. L. (2016) Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res., 26, 1721–1729

  12. 12.

    Truong, D. T., Franzosa, E. A., Tickle, T. L., Scholz, M., Weingart, G, Pasolli, E., Tett, A., Huttenhower, C. and Segata, N. (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods, 12, 902–903

  13. 13.

    Buchfink, B., Xie, C. and Huson, D. H. (2015) Fast and sensitive protein alignment using DIAMOND. Nat. Methods, 12, 59–60

  14. 14.

    Fouts, D. E. (2006) Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res., 34, 5839–5851

  15. 15.

    Lima-Mendez, G, Van Helden, J., Toussaint, A. and Leplae, R. (2008) Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics, 24, 863–865

  16. 16.

    Akhter, S., Aziz, R. K and Edwards, R. A. (2012) PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res., 40, e126

  17. 17.

    Arndt, D., Grant, J. R., Marcu, A., Sajed, T., Pon, A., Liang, Y and Wishart, D. S. (2016) PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res., 44, W16–W21

  18. 18.

    Roux, S., Hallam, S. J., Woyke, T. and Sullivan, M. B. (2015) Viral dark matter and virus-host interactions resolved from publicly available microbial genomes. eLife, 4, e08490

  19. 19.

    Paez-Espino, D., Pavlopoulos, G. A., Ivanova, N. N. and Kyrpides, N. C. (2017) Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data. Nat. Protoc, 12, 1673–1682

  20. 20.

    Alipanahi, B., Delong, A., Weirauch, M. T. and Frey, B. J. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838

  21. 21.

    Zeng, H., Edwards, M. D., Liu, G. and Gifford, D. K. (2016) Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics, 32, i121–i127

  22. 22.

    Quang, D. and Xie, X. (2019) Factornet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods, 166, 40–47

  23. 23.

    Wang, M., Tai, C, E, W. and Wei, L. (2018) DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Res., 46, e69

  24. 24.

    Zhou, J. and Troyanskaya, O. G. (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods, 12, 931–934

  25. 25.

    Quang, D. and Xie, X. (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res., 44, e107

  26. 26.

    Kelley, D. R., Snoek, J. and Rinn, J. L. (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res., 26, 990–999

  27. 27.

    Poplin, R., Chang, P.-C, Alexander, D., Schwartz, S., Colthurst, T., Ku, A., Newburger, D., Dijamco, J., Nguyen, N., Afshar, P. T., et al. (2018) A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol., 36, 983–987

  28. 28.

    Zeng, H. and Gifford, D. K. (2017) Predicting the impact of non-coding variants on DNA methylation. Nucleic Acids Res., 45, e99

  29. 29.

    Li, Y., Quang, D. and Xie, X. (2017) Understanding sequence conservation with deep learning. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 400–406._ACM

  30. 30.

    Li, Y., Shi, W and Wasserman, W. W (2018) Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics. 19, 202

  31. 31.

    Singh, S., Yang, Y, Poczos, B. and Ma, J. (2019) Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Quant. Biol. 7, 122–137

  32. 32.

    Yue, T. and Wang, H. (2018) Deep learning for genomics: A concise overview. arXiv:1802.00810

  33. 33.

    Lauring, A. S., Frydman, J. and Andino, R. (2013) The role of mutational robustness in RNA virus evolution. Nat. Rev. Microbiol., 11, 327–336

  34. 34.

    Glenn, T. C. (2011) Field guide to next-generation DNA sequencers. Mol. Ecol. Resour., 11, 759–769

  35. 35.

    World Health Organization. (2014) World Cancer Report 2014. Stewart, B., Wild, C. P., eds., IAIC

  36. 36.

    Hawk, E.T. and Levin, B. (2016) Colorectal cancer prevention. J. Clinic. Oncolo. 23, 378–391

  37. 37.

    Feng, Q., Liang, S., Jia, H., Stadlmayr, A., Tang, L., Lan, Z., Zhang, D., Xia, H., Xu, X., Jie, Z., et al. (2015) Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun., 6, 6528

  38. 38.

    Vogtmann, E., Hua, X., Zeller, G., Sunagawa, S., Voigt, A. Y, Hercog, R., Goedert, J. J., Shi, J., Bork, P. and Sinha, R. (2016) Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS One 155362

  39. 39.

    Nakatsu, G., Li, X., Zhou, H., Sheng, J., Wong, S. H., Wu, W K. K., Ng, S. C, Tsoi, H., Dong, Y, Zhang, N., et al. (2015) Gut mucosal microbiome across stages of colorectal carcinogenesis. Nat. Commun., 6, 8727

  40. 40.

    Zeller, G., Tap, J., Voigt, A. Y, Sunagawa, S., Kultima, J. R., Costea, P. I., Amiot, A., Bohm, J., Brunetti, F., Habermann, N., et al. (2014) Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol., 10, 766

  41. 41.

    Lu, Y Y, Chen, T, Fuhrman, J. A. and Sun, F. (2017) COCACOLA: binning metagenomic contigs using sequence Composition, read CoverAge, CO-alignment and paired-end read LinkAge. Bioinformatics, 33, 791–798

  42. 42.

    Dutilh, B. E., Cassman, N., McNair, K., Sanchez, S. E., Silva, G. G., Boling, L., Barr, J. J., Speth, D. R., Seguritan, V., Aziz, R. K., et al. (2014) A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun., 5, 4498

  43. 43.

    El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C, Qureshi, M., Richardson, L. J., Salazar, G. A., Smart, A., et al. (2018) The pfam protein families database in 2019. Nucleic Acids Res. D427–D432

  44. 44.

    Zheng, T, Li, J., Ni, Y, Kang, K., Misiakou, M.-A., Imamovic, L., Chow, B. K. C, Rode, A. A., Bytzer, P., Sommer, M., et al. (2019) Mining, analyzing, and integrating viral signals from metagenomic data. Microbiome, 7, 42

  45. 45.

    Edwards, R. A., McNair, K., Faust, K., Raes, J. and Dutilh, B. E. (2016) Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev., 40, 258–272

  46. 46.

    Ahlgren, N. A., Ren, J., Lu, Y Y, Fuhrman, J. A. and Sun, F. (2017) Alignment-free doligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically- derived viral sequences. Nucleic Acids Res., 45, 39–53

  47. 47.

    Gouy, M. and Gautier, C. (1982) Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res., 10, 7055–7074

  48. 48.

    Sharp, P. M., Rogers, M. S. and McConnell, D. J. (1985) Selection pressures on codon usage in the complete genome of bacteriophage T7. J. Mol. Evol., 21, 150–160

  49. 49.

    Pride, D. T, Wassenaar, T M., Ghose, C. and Blaser, M. J. (2006) Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics, 7, 8

  50. 50.

    Carbone, A. (2008) Codon bias is a major factor explaining phage evolution in translationally biased hosts. J. Mol. Evol., 66, 210–223

  51. 51.

    Ponsero, A. J. and Hurwitz, B. L. (2019) The promises and pitfalls of machine learning for detecting viruses in aquatic metagenomes. Front. Microbiol., 10, 806

  52. 52.

    Amodei, D., Olah, C, Steinhardt, J., Christiano, P., Schulman, J. and Man’e, D. (2016) Concrete problems in AI safety. arXiv: 1606.06565

  53. 53.

    Hendrycks, D. and Gimpel, K. A (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: Proceedings of International Conference on Learning Representations 2017. Toulon

  54. 54.

    Lakshminarayanan, B., Pritzel, A. and Blundell, C. (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems, pp. 6402–6413

  55. 55.

    Liang, S., Li, Y. and Srikant, R. (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv: 1706.02690

  56. 56.

    Hendrycks, D., Mazeika, M. and Dietterich, T. G. (2018) Deep anomaly detection with outlier exposure. arXiv: 1812.04606

  57. 57.

    Shafaei, A., Schmidt, M. and Little, J. J. (2018) Does your model know the digit 6 is not a cat? a less biased evaluation of outlier detectors. arXiv: 1809.04729

  58. 58.

    Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., DePristo, M. A., Dillon, J. V. and Lakshminarayanan, B. (2019) Likelihood ratios for out-of-distribution detection. arXiv: 1906.02845

  59. 59.

    Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B. and Snoek, J. (2019) Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. arXiv: 1906.02530

  60. 60.

    Nalisnick, E., Matsukawa, A., Teh, Y W. and Lakshminarayanan, B. (2019) Detecting out-of-distribution inputs to deep generative models using a test for typicality. arXiv: 1906.02994

  61. 61.

    Kingma, D. P. and Ba, J. (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations. San Diego

  62. 62.

    Minor, S., Sinha, R., Chen, J., Li, H., Keilbaugh, S. A., Wu, G. D., Lewis, J. D. and Bushman, F. D. (2011) The human gut virome: inter-individual variation and dynamic response to diet. Genome Res., 21, 1616–1625

  63. 63.

    Roux, S., Brum, J. R., Dutilh, B. E., Sunagawa, S., Duhaime, M. B., Loy, A., Poulos, B. T, Solonenko, N., Lara, E., Poulain, J., et al. (2016) Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature, 537, 689–693

  64. 64.

    Fang, Z., Tan, J., Wu, S., Li, M., Xu, C, Xie, Z. and Zhu, H. (2019) PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. Gigascience, 8, giz066

Download references


The research was supported by the U.S. National Institutes of Health R01GM120624, National Science Foundation DMS-1518001, National Natural Science Foundation of China (11701546), and the Simons Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems (CBIOMES; grant ID 549943). We thank Drs. Michael S. Waterman, Gesine Reinert, Ying Wang, Rui Jiang, Yang Lu, Lizzie Dorfman, Mr. Weili Wang, and Mr. Luigi Manna for helpful discussions and suggestions. We thank USC Center for High Performance Computing (HPC) for helping us use their cluster computers.

Author information

Correspondence to Jie Ren or Fengzhu Sun.

Ethics declarations

The authors Jie Ren, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin and Fengzhu Sun declare that they have no conflicts of interest.

All procedures performed in studies were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Additional information

Note to Related Work of Fang et al. [64]

A preliminary version of this manuscript was put in arXiv (arxiv.org/abs/ 1806.07810) on June 20, 2018. During the process of the submission to regular journals, Fang et al. [64] used deep learning to classify metagenomic fragments to chromosomal, viral and plasmid sequences. Similar prediction accuracy for viruses using nucleotide base encoding as presented in this paper was obtained. The two studies should be considered independent.

Author summary: We developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomics using deep learning. Sequences from viral and prokaryotic genomes are used for training the model. The neural network is composed by a convolutional layer, a max pooling layer, two dense layers to generate the prediction score between 0 and 1. DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, and 0.97 for 300, 500, and 1000 bp sequences respectively, and it will greatly assist the study of viruses in the era of metagenomics.

Electronic supplementary material

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ren, J., Song, K., Deng, C. et al. Identifying viruses from metagenomic data using deep learning. Quant Biol (2020). https://doi.org/10.1007/s40484-019-0187-4

Download citation


  • metagenome
  • deep learning
  • virus identification
  • machine learning