Bioinformatic Approaches for Comparative Analysis of Viruses

  • Deyvid Amgarten
  • Chris UptonEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1704)


The field of viral genomic studies has experienced an unprecedented increase in data volume. New strains of known viruses are constantly being added to the GenBank database and so are completely new species with little or no resemblance to our databases of sequences. In addition to this, metagenomic techniques have the potential to further increase the number and rate of sequenced genomes. Besides, it is important to consider that viruses have a set of unique features that often break down molecular biology dogmas, e.g., the flux of information from RNA to DNA in retroviruses and the use of RNA molecules as genomes. As a result, extracting meaningful information from viral genomes remains a challenge and standard methods for comparing the unknown and our databases of characterized sequences may need to be modified. Thus, several bioinformatic approaches and tools have been created to address the challenge of analyzing viral data. In this chapter, we offer descriptions and protocols of some of the most important bioinformatic techniques for comparative analysis of viruses. We also provide comments and discussion on how viruses’ unique features can affect standard analyses and how to overcome some of the major sources of problems. Topics include: (1) Clustering of related genomes, (2) Whole genome multiple sequence alignments for small RNA viruses, (3) Protein alignments for marker genes, (4) Analyses based on ortholog groups, and (5) Taxonomic identification and comparisons of viruses from environmental datasets.

Key words

Comparative analysis Viral genomes Virus Genomics Metagenomics Viromes Bioinformatics Multiple sequence alignment VOCs BLAST Ortholog groups 



This work has been supported by grants #2014/16450-8 and #2015/14334-3, São Paulo Research Foundation (FAPESP) to D.A. and an NSERC Discovery grant to C.U.


  1. 1.
    Ureta-Vidal A, Ettwiller L, Birney E (2003) Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat Rev Genet 4:251–262CrossRefPubMedGoogle Scholar
  2. 2.
    Edwards R, Rohwer F (2005) Viral metagenomics. Nat Rev Microbiol 3:801–805CrossRefGoogle Scholar
  3. 3.
    Rosario K, Breitbart M (2011) Exploring the viral world through metagenomics. Curr Opin Virol 1:289–297CrossRefPubMedGoogle Scholar
  4. 4.
    Domingo E, Escarmis C, Sevilla N et al (1996) Basic concepts in RNA virus evolution. FASEB J 10:859–864PubMedGoogle Scholar
  5. 5.
    Qin L, Upton C, Hazes B et al (2011) Genomic analysis of the vaccinia virus strain variants found in dryvax vaccine. J Virol 85:13049–13060CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    Kristensen DM, Waller AS, Yamada T et al (2013) Orthologous gene clusters and taxon signature genes for viruses of prokaryotes. J Bacteriol 195:941–950CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Sharma D, Priyadarshini P, Vrati S (2015) Unraveling the web of viroinformatics: computational tools and databases in virus research. J Virol 89:1489–1501CrossRefPubMedGoogle Scholar
  8. 8.
    Bérard S, Chateau A, Pompidor N et al (2016) Aligning the unalignable: bacteriophage whole genome alignments. BMC Bioinformatics 17:30CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Pickett BE, Greer DS, Zhang Y et al (2012) Virus pathogen database and analysis resource (ViPR): a comprehensive bioinformatics database and analysis resource for the coronavirus research community. Viruses 4:3209–3226CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410CrossRefPubMedGoogle Scholar
  11. 11.
    Marchler-Bauer A, Zheng C, Chitsaz F et al (2013) CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res 41:D348–D352CrossRefPubMedGoogle Scholar
  12. 12.
    Brister JR, Ako-adjei D, Bao Y et al (2014) NCBI viral genomes resource. Nucleic Acids Res 43(Database issue):D571–D577PubMedPubMedCentralGoogle Scholar
  13. 13.
    Roux S, Tournayre J, Mahul A et al (2014) Metavir 2: new tools for viral metagenome comparison and assembled virome analysis. BMC Bioinformatics 15:76CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Ehlers A, Osborne J, Slack S et al (2002) Poxvirus orthologous clusters (POCs). Bioinformatics (Oxford, England) 18:1544–1545CrossRefGoogle Scholar
  15. 15.
    Sievers F, Wilm A, Dineen D et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol 7:539CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Hillary W, Lin S-H, Upton C (2011) Base-by-base version 2: single nucleotide-level analysis of whole viral genome alignments. Microb Inform Exp 1:2CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Hatfull GF, Jacobs-Sera D, Lawrence JG et al (2010) Comparative genomic analysis of 60 mycobacteriophage genomes: genome clustering, gene acquisition, and gene size. J Mol Biol 397:119–143CrossRefPubMedPubMedCentralGoogle Scholar
  20. 20.
    Goris J, Konstantinidis KT, Klappenbach JA et al (2007) DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol 57:81–91CrossRefPubMedGoogle Scholar
  21. 21.
    Hatfull GF (2008) Bacteriophage genomics. Curr Opin Microbiol 11:447–453CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Bateman A, Martin MJ, O’Donovan C et al (2015) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212CrossRefGoogle Scholar
  23. 23.
    Hasegawa M, Fujiwara M (1993) Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. Mol Phylogenet Evol 2(1):1–5CrossRefPubMedGoogle Scholar
  24. 24.
    B. Chevreux (2005) MIRA: an automated genome and EST assembler, Duisburg, Heidelberg. pp 1–161Google Scholar
  25. 25.
    Martins LF, Antunes LP, Pascon RC et al (2013) Metagenomic analysis of a tropical composting operation at the São Paulo zoo park reveals diversity of biomass degradation functions and organisms. PLoS One 8:e61928CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Tatusova T, Ciufo S, Fedorov B et al (2014) RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res 42:553–559CrossRefGoogle Scholar
  27. 27.
    Angly FE, Willner D, Prieto-Davó A et al (2009) The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLoS Comput Biol 5:e1000593CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Ondov BD, Bergman NH, Phillippy AM (2011) Interactive metagenomic visualization in a web browser. BMC Bioinformatics 12:385CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Duffy S, Shackelton LA, Holmes EC (2008) Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet 9:267–276CrossRefPubMedGoogle Scholar
  30. 30.
    Chenna R, Sugawara H, Koike T et al (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 31:3497–3500CrossRefPubMedPubMedCentralGoogle Scholar
  31. 31.
    Di Tommaso P, Moretti S, Xenarios I et al (2011) T-coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension. Nucleic Acids Res 39:13–17CrossRefGoogle Scholar
  32. 32.
    Notredame C (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3(8):e123CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Darling AE, Mau B, Perna NT (2010) Progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147CrossRefPubMedPubMedCentralGoogle Scholar
  34. 34.
    Da Silva M, Upton C (2012) Bioinformatics for analysis of poxvirus genomes. Methods Mol Biol 890:233–258CrossRefPubMedGoogle Scholar
  35. 35.
    Yutin N, Wolf YI, Raoult D et al (2009) Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol J 6:223CrossRefPubMedPubMedCentralGoogle Scholar
  36. 36.
    Huerta-Cepas J, Szklarczyk D, Forslund K et al (2016) eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 44:D286–D293CrossRefPubMedGoogle Scholar
  37. 37.
    Tamura K, Stecher G, Peterson D et al (2013) MEGA6: molecular evolutionary genetics analysis version 6.0. Mol Biol Evol 30:2725–2729CrossRefPubMedPubMedCentralGoogle Scholar
  38. 38.
    Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Bankevich A, Nurk S, Antipov D et al (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Peng Y, Leung HCM, Yiu SM et al (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28:1420–1428CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media LLC 2018

Authors and Affiliations

  1. 1.Department of Biochemistry, Institute of ChemistryUniversity of São PauloSão PauloBrazil
  2. 2.Department of Biochemistry and MicrobiologyUniversity of VictoriaVictoriaCanada

Personalised recommendations