Viral Metagenome Annotation Pipeline

Lorenzi, Hernan

doi:10.1007/978-1-4614-6418-1_693-4

Hernan Lorenzi²

161 Accesses
1 Citations

Introduction

Viruses are the most abundant and diverse organisms on Earth, yet only a small fraction of the viral genome sequence space has been decoded. Based on analyses of environmental viral communities, it is estimated that only 1 % of the existent viral diversity has been explored. During more than a century, cultivation of viruses has remained the gold standard for virus discovery and characterization. One major limitation of this approach is that for most viral species, their hosts (predominantly microbes) are either unknown or cannot be grown in culture. Viral metagenomics (VM) circumvents this limitation by sequencing viral genetic material isolated directly from the environment. A typical viral metagenomics workflow is depicted in Fig. 1. Viral metagenomic methods have evolved significantly since their beginnings. Initially, they involved viral particle purification and enrichment from environmental samples, sharing of isolated nucleic acids followed by an optional cDNA...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Bench SR, Hanson TE, Williamson KE, Ghosh D, Radosovich M, Wang K, Wommack KE. Metagenomic characterization of Chesapeake Bay virioplankton. Appl Environ Microbiol. 2007;73(23):7629–41. 2168038.
Article CAS PubMed Central PubMed Google Scholar
Bolger AM1, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014; doi:10.1093/bioinformatics/btu170
Google Scholar
Chen YA, Lin CC, Wang CD, Wu HB, Hwang PI. An optimized procedure greatly improves EST vector contamination removal. BMC Genomics. 2007;8:416. 2194723.
Article CAS PubMed Central PubMed Google Scholar
Chou HH, Holmes MH. DNA sequence quality trimming and vector removal. Bioinformatics. 2001;17(12):1093–104.
Article CAS PubMed Google Scholar
DeJongh M, Formsma K, Boillot P, Gould J, Rycenga M, Best A. Toward the automated generation of genome-scale metabolic networks in the SEED. BMC Bioinforma. 2007;8:139. 1868769.
Article Google Scholar
Dwivedi B, Schmieder R, Goldsmith DB, Edwards RA, Breitbart M. PhiSiGns: an online tool to identify signature genes in phages and design PCR primers for examining phage diversity. BMC Bioinformatics. 2012;13:37.
Google Scholar
Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7(10):e1002195. 3197634.
Article CAS PubMed Central PubMed Google Scholar
Emanuelsson O, Nielsen H, von Heijne G. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 1999;8(5):978–84. 2144330.
Article CAS PubMed Central PubMed Google Scholar
Emanuelsson O, Nielsen H, Brunak S, von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000;300(4):1005–16.
Article CAS PubMed Google Scholar
Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, Martin JF. Accuracy and quality assessment of 454 GS-FLX titanium pyrosequencing. BMC Genomics. 2011;12:245. 3116506.
Article PubMed Central PubMed Google Scholar
Gillespie JJ, Wattam AR, Cammer SA, Gabbard JL, Shukla MP, Dalay O, Driscoll T, et al. PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species. Infect Immun. 2011;79(11):4286–98. 3257917.
Article CAS PubMed Central PubMed Google Scholar
Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F. Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. Cold Spring Harb Protoc. 2010; 2010(1):pdb prot5368.
Google Scholar
Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–2. 2828112.
Article CAS PubMed Central PubMed Google Scholar
Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012;40(Database issue):D306–12. 3245097.
Article CAS PubMed Central PubMed Google Scholar
Jerome M, Noirot C, Klopp C. Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool. BMC Res Notes. 2011;4:149. 3117718.
Article PubMed Central PubMed Google Scholar
Krogh A1, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001;305(3):567–80.
Google Scholar
Kunin V, Engelbrektson A, Ochman H, Hugenholtz P. Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol. 2010;12(1):118–23.
Article CAS PubMed Google Scholar
Laserson J, Jojic V, Koller D. Genovo: de novo assembly for metagenomes. J Comput Biol. 2011;18(3):429–43.
Article CAS PubMed Google Scholar
Leplae R, Lima-Mendez G, Toussaint A. ACLAME: a classification of mobile genetic elements, update 2010. Nucleic Acids Res. 2010;38(Database issue):D57–61. 2808911.
Article CAS PubMed Central PubMed Google Scholar
Letunic I, Doerks T, Bork P. SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res. 2012;40(Database issue):D302–5. 3245027.
Article CAS PubMed Central PubMed Google Scholar
Li W. Analysis and comparison of very large metagenomes with fast clustering and functional annotation. BMC Bioinforma. 2009;10:359. 2774329.
Article Google Scholar
Lorenzi HA, Hoover J, Inman J, Safford T, Murphy S, Kagan L, Williamson SJ. TheViral MetaGenome Annotation Pipeline (VMGAP): an automated tool for the functional annotation of viral Metagenomic shotgun sequencing data. Stand Genomic Sci. 2011;4(3):418–29. 3156399.
Article CAS PubMed Central PubMed Google Scholar
Mao X, Cai T, Olyarchuk JG, Wei L. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics. 2005;21(19):3787–93.
Article CAS PubMed Google Scholar
Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, et al. CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res. 2011;39(Database issue):D225–9. 3013737.
Article CAS PubMed Central PubMed Google Scholar
Markowitz VM, Chen IM, Chu K, Szeto E, Palaniappan K, Grechkin Y, Ratner A, Jacob B, Pati A, Huntemann M, et al. IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic Acids Res. 2012;40(Database issue):D123–9. 3245048.
Article CAS PubMed Central PubMed Google Scholar
McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods. 2007;4(1):63–72.
Article CAS PubMed Google Scholar
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, et al. The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinforma. 2008;9:386. 2563014.
Article CAS Google Scholar
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011;12(11):R112. 3334598.
Article CAS PubMed Central PubMed Google Scholar
Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40(20):e155.
Google Scholar
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res. 2008;15(6):387–96. 2608843.
Article CAS PubMed Central PubMed Google Scholar
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, De Crecy-Lagard V, Diaz N, Disz T, Edwards R, et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33(17):5691–702. 1251668.
Article CAS PubMed Central PubMed Google Scholar
Patel RK, Jain M. NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One. 2012;7(2):e30619. 3270013.
Article CAS PubMed Central PubMed Google Scholar
Peng Y, Leung HC, Yiu SM, Chin FY. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics. 2011;27(13):i94–101. 3117360.
Article CAS PubMed Central PubMed Google Scholar
Peng Y, Leung HC, Yiu SM, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–8.
Article CAS PubMed Google Scholar
Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011;8(10):785–6.
Article CAS PubMed Google Scholar
Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, et al. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 2012;40(Database issue):D284–9. 3245133.
Article CAS PubMed Central PubMed Google Scholar
Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40(Database issue):D130–5. 3245008.
Article CAS PubMed Central PubMed Google Scholar
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, et al. The Pfam protein families database. Nucleic Acids Res. 2012;40(Database issue):D290–301. 3245129.
Article CAS PubMed Central PubMed Google Scholar
Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010;38(20):e191. 2978382.
Article PubMed Central PubMed Google Scholar
Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012;40(Database issue):D13–25. 3245031.
Article CAS PubMed Central PubMed Google Scholar
Shoop E, Casaes P, Onsongo G, Lesnett L, Petursdottir EO, Donkor EK, Tkach D, Cosimini M. Data exploration tools for the gene ontology database. Bioinformatics. 2004;20(18):3442–54.
Article CAS PubMed Google Scholar
Steward GF, Preston CM. Analysis of a viral metagenomic library from 200 m depth in Monterey Bay, California constructed by direct shotgun cloning. Virol J. 2011;8:287. 3128862.
Article PubMed Central PubMed Google Scholar
Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S, Stocks K, Allen EE, Ellisman M, Grethe J, et al. Community cyberinfrastructure for advanced microbial ecology research and analysis: the CAMERA resource. Nucleic Acids Res. 2011;39(Database issue):D546–51. 3013694.
Article CAS PubMed Central PubMed Google Scholar
Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–8.
Article CAS PubMed Google Scholar
Tanabe M, Kanehisa M. Using the KEGG database resource. Curr Protoc Bioinform. 2012. Chapter 1:Unit1 12.
Google Scholar
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. 222959.
Article PubMed Central PubMed Google Scholar
The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012;40(Database issue):D71–5.3245120.
Google Scholar
Toussaint A, Lima-Mendez G, Leplae R. PhiGO, a phage ontology associated with the ACLAME database. Res Microbiol. 2007;158(7):567–71.
Article CAS PubMed Google Scholar
Yok N, Rosen G. Benchmarking of gene prediction programs for metagenomic data. Conf Proc IEEE Eng Med Biol Soc. 2010;2010:6190–3.
PubMed Google Scholar
Zhang T, Luo Y, Liu K, Pan L, Zhang B, Yu J, Hu S. BIGpre: a quality assessment package for next-generation sequencing data. Genomics Proteomics Bioinforma. 2011;9(6):238–44.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, The J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD, 20850, USA
Hernan Lorenzi

Authors

Hernan Lorenzi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hernan Lorenzi .

Editor information

Editors and Affiliations

J. Craig Venter Institute (JCVI), Rockville, Maryland, USA
Karen E. Nelson

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Lorenzi, H. (2013). Viral Metagenome Annotation Pipeline. In: Nelson, K. (eds) Encyclopedia of Metagenomics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6418-1_693-4

Download citation

DOI: https://doi.org/10.1007/978-1-4614-6418-1_693-4
Received: 20 January 2013
Accepted: 20 January 2013
Published: 05 April 2014
Publisher Name: Springer, New York, NY
Online ISBN: 978-1-4614-6418-1
eBook Packages: Springer Reference Biomedicine and Life SciencesReference Module Biomedical and Life Sciences

Publish with us

Policies and ethics