Detection of Bacteriophages: Sequence-Based Systems

  • Siân V. Owen
  • Blanca M. Perez-Sepulveda
  • Evelien M. AdriaenssensEmail author
Living reference work entry


The invention of sequencing technologies has fundamentally changed molecular biology, including the way we look at bacteriophages. In addition to investigating bacteriophage plaques, electron micrographs, or the phenotypes of different mutants, we are now able to explore the entire genetic potential encoded in their genomes. As of July 2018, over 6,000 complete phage genome sequences were published in public databases, with over 28,000 partial sequences available.

In this chapter, we give an overview of the latest technologies that can be used to determine phage genome sequences, ranging from short-read platforms which generally give multiple gigabases of sequence data to long-read technologies which have the potential to sequence a bacteriophage genome in one single read. We then look at applications of sequencing technologies in detecting bacteriophages, from a single gene, over entire genomes, to the community level.


The recent advances in sequencing technologies have profoundly changed our understanding of bacteriophage biology and genomics. Phage genomics, the study of the nucleic acid content or full gene complement of a phage, has demonstrated the remarkable genetic diversity of the phage world. This diversity is so high that phages with similar morphologies can have no nucleotide sequence similarity and very limited amino acid similarities of orthologous proteins (Krupovic et al. 2011). As a result, sequencing of specific phage marker genes or the entire genome is more accurate in detecting the exact variant or isolate of a phage than electron microscopic- or host range-based methods.

Sequencing is becoming increasingly important in phage research. For example, phage taxonomy has started to move away from morphology-based classification based on the presence/absence, type, and length of the tail toward a genome-based strategy more accurately reflecting the full diversity present in the virosphere (Ackermann 2011; Adriaenssens and Brister 2017; Adriaenssens et al. 2018). In phage therapy – the use of phages as therapeutic agents to control bacterial infections – sequencing of the entire genome is encouraged to ensure that the phages in question do not encode potentially damaging genes (e.g., toxins, lysogeny-related genes, antibiotic resistance genes) (Alavidze et al. 2016; Abedon 2017). Sequencing has also provided an extra dimension to viral ecology, offering a greater resolution of their diversity and unearthing viral dark matter, i.e., viral/phage sequence fragments and genes without any database homologs (Filée et al. 2005; Hatfull 2015).

In the first part of this chapter, we give an overview of available sequencing technologies for phage research, made up of short-read platforms and long-read platforms (Table 1). For these platforms, we review the different methodological steps and discuss some of the advantages and disadvantages of each technology. While the information presented here was up-to-date to the time of writing, these technologies evolve rapidly, and it is highly likely that new platforms or optimized technological approaches will come to the forefront in the next few years. The second part of the chapter deals with the application of sequencing technology in the detection of bacteriophages. We have divided these applications according to level of complexity, from detecting a single gene of a phage, over sequencing the complete genome, to detection of prophages in bacterial genomes, and even to simultaneous detection of entire communities of phages.
Table 1

Overview of technologies used in phage sequencing

Sequencing technology

Read length (bp)




800–1,000 single read

Capillary electrophoresis of labeled chain-terminating dideoxynucleotides

High accuracy and resolution at the nucleotide level

Low throughput

High sequencing cost per base

Illumina MiSeq

150–350 paired-end reads

Sequencing by synthesis, with detection of fluorescently labeled nucleotides

Low amount of starting material (ng) allows the sequencing of bacteriophages from single plaques

PCR-free library preparation reduces the introduction of sequencing errors

Widely used, allowing access in most laboratories at a low cost per read

Nextera-based library preparation will not be able to sequence phage genome ends

Ion Torrent PGM

300–400 paired reads

Detection of protons released during DNA synthesis

Ability to detect more variants compared to Illumina MiSeq, but with more false positives

Short sequencing run time


Up to 60,000 single reads

Real-time detection of labeled nucleotide incorporation by a tethered DNA polymerase

Long-read lengths allow repeat region resolution

Additional data on DNA-chemical modifications is generated

Poor accuracy at single read level

High amount of starting material required (μg)

Large hardware cost, large machine

Oxford Nanopore

Up to 1,000,000 single reads

Real-time detection of nucleotide-mediated current changes as nucleic acid molecule passes through a tethered nanopore protein

Long-read lengths allow repeat region resolution

Additional data on DNA-chemical modifications can be generated

Base calling in real time permits instant sequence interrogation (i.e., results are available as they are generated, not only once the machine has finished)

Low hardware cost, portable machine allows for anytime-anywhere sequencing

Only technology that allows direct sequencing of RNA molecules

Poor accuracy at single read level

High amount of starting material required (ug)

aSequencing yield was not considered as a factor because of small genome sizes of phage

Overview of Sequencing Methods

Short-Read Platforms

Historically, bacteriophage sequencing has been performed using Sanger technology. This sequencing technology uses capillary electrophoresis to detect precisely the selective incorporation of labeled chain-terminating dideoxynucleotides during in vitro replication of DNA. However, this approach can be expensive, depends in many cases on cloning the phage DNA, and does not allow separation of contaminating host DNA sequence (Klumpp et al. 2012). High-throughput (or “next-generation”) short-read sequencing has become the preferred method for sequencing phages, mostly due to the low cost per base and high output compared to other platforms (Rihtman et al. 2016). Currently, popular platforms for sequencing bacteriophages include Illumina® MiSeq and Ion Torrent Personal Genome Machine (Ion PGMTM). These sequencing platforms differ mostly in their chemistry but are generally based on the same three steps: library preparation, amplification, and sequencing of dsDNA. For RNA phages, cDNA needs to be generated by reverse transcription, either before the library preparation step or by using library preparation kits that are optimized for RNA sequencing and can provide information on whether the genome is made up of dsRNA or ssRNA and the sense of the single-stranded RNA.

During library preparation, the extracted phage DNA is quality checked before fragmentation into random overlapping parts, which can be done by either mechanical (e.g., sonication) or chemical methods (e.g., the nuclease fragmentase (New England Biolabs)). The amount of input DNA for library preparation is a crucial step, especially in the study of phages, where limiting factors are either phage propagation to large numbers of particles or varying genome sizes. Several kits (e.g., Nextera XT (Illumina)) have been designed to significantly decrease the required amount of input DNA (to 1 ng) and reduce hands-on time by using “tagmentation,” a process that fragments, size-selects, and tags the input DNA. These new kits increase sequencing efficiency as they produce greater output data in a shorter overall time (Marine et al. 2011). However, the use of Nextera kits comes with the trade-off that the defined ends of linear phage genomes will be missed and is, therefore, not recommended when the aim of sequencing is to generate a high-quality reference genome (Kot et al. 2014).

After fragmentation, the pieces of DNA are size-selected and ligated with adaptors specific for each platform. During the following amplification step, the DNA is copied several times in its specific position, creating reaction centers or clusters that allow the sequencer to distinguish the input DNA from background noise. The amplification step is carried out by annealing to complementary adapters attached either to beads in micelle droplets (emulsion PCR; e.g., 454 pyrosequencing and Ion Torrent), to solid plates (e.g., Illumina sequencing), or by creating nanoballs that are then placed in a flow cell (Complete Genomics (BGI)). The most notable differences between platforms can be observed during the sequencing step, which can be by synthesis or instead by ligation. Platforms based on sequencing by synthesis include Illumina (formerly Solexa) which detects fluorescently labeled nucleotides; Ion Torrent, which detects a change in pH; and 454 pyrosequencing, which senses the amount of light generated due to pyrophosphate release during nucleotide incorporation. Alternatively, SOLiD is a platform based on sequencing by ligation of a labeled probe to the target DNA (Goodwin et al. 2016).

The development of high-throughput sequencing technologies decreased sequencing costs and time compared to construction of clone libraries (Loman et al. 2012). The 454 pyrosequencing platform, the first platform to be optimized for high-throughput phage sequencing, has been recently discontinued (Henn et al. 2010; Marine et al. 2011). Currently, the most widely used platform for phage sequencing is Illumina technology, with its benchtop sequencer, MiSeq, particularly popular for phages, and the high-throughput machine HiSeq used for bacterial, eukaryotic, and metagenomes. These machines can provide large amounts of high-quality sequencing data in a reduced time when compared to other technologies. Due to the small size of phage genomes relative to other organisms, and the large amount of data generated, short-read platforms, especially Illumina, thus far are the preferred method for bacteriophage metagenomics, genome sequencing, and re-sequencing or, most recently, to detect termini and packaging mechanisms (Garneau et al. 2017).

Long-Read Platforms

Long-read sequencing represents the most recent and transformative development in sequencing technology, and these technologies are collectively referred to as third-generation sequencing (TGS). Whereas short-read sequencing platforms may generate sequence reads of up to 1 kb, the capability of long-read sequencing platforms is now approaching read lengths of 1 Mb (which is approximately three times longer than the longest known phage genome). Unlike short-read sequencing, which has become dominated by a single platform, two platforms utilizing very different technologies are used for long-read sequencing.

Pacific Biosciences (PacBio) was the first company to make long-read sequencing technology available to the mass market, with the release of its RS platform in 2011, shortly followed by the widely adopted RS-II platform in 2013. PacBio sequencing technology is comparable to Illumina short-read technology in that it is polymerase-dependent; sequence data result from the detection of base incorporation by a DNA polymerase enzyme onto a growing nucleotide chain. However, whereas Illumina sequencing relies on detection of base incorporation within a clonal population of concomitantly amplified DNA fragments, PacBio sequencers capture incorporation signals from single DNA molecules. This feat is made possible by physical anchoring of polymerase enzymes within narrow wells, which allows video recording of laser excitation of each fluorescently labeled nucleotide in direct contact with the anchored polymerase during DNA synthesis. Although the technology is intrinsically error-prone due to dependence on a polymerase enzyme and signal noise resulting from unincorporated nucleotides, a high degree of accuracy is achieved by using hairpin DNA adapters to create circular templates, which are sequenced continuously until polymerase function declines. Repetitive sequencing of the same DNA fragment allows random errors to be detected downstream and results in the output of high-quality sequence data. The maximum read length of a PacBio sequencer is dependent on the life of individual polymerase enzymes, thought to be between 10 and 60 kb (Rhoads and Au 2015). Sequencing takes place on SMRT Cells, which are chips containing 150,000 anchored polymerase wells. The second PacBio sequencing platform, Sequel, was released in 2015 and increased the capacity of each sequencing run nearly seven times (Sequel SMRT Cells contain one million polymerase wells), generating 5–10 gb of data per run. PacBio sequence library preparation protocols require double-stranded DNA, and therefore this technology is only directly applicable to dsDNA phages. There are currently no reports of PacBio technology to sequence ssDNA or RNA phage genomes. However, RNA may be reverse-transcribed into cDNA for sequencing on a PacBio instrument (Tseng and Underwood 2013), and a second strand DNA synthesis step may facilitate the sequencing of ssDNA phage genomes.

A frequently cited advantage of PacBio technology is the additional generation of chemical modification data along with sequence data (Rhoads and Au 2015). DNA-chemical modifications, such as the addition of methyl groups to cytosine residues (DNA methylation), cause characteristic kinetic changes in nucleotide incorporation rates by the polymerase and can be detected by automated analysis of the kinetic pattern of the DNA synthesis reaction. Identification of DNA modifications can be of particular interest for bacteriophages, as certain phage groups are known to incorporate modified bases into their genomes, potentially involved in escaping host restriction modification systems (Klumpp et al. 2010; Adriaenssens et al. 2012; Lee et al. 2018).

The second long-read sequencing technology in widespread use is being developed by Oxford Nanopore Technologies (ONT), and their prototype platform, MinION, was released in 2014 (Ip et al. 2015). Unlike the majority of DNA sequencing technologies, ONT does not detect nucleotide addition during DNA synthesis but instead directly detects the nucleotide composition of a single-stranded DNA or RNA molecule. The technology employs anchored pore proteins (nanopores), each under an electric current. Nucleic acid molecules are threaded and natively transported through the pore protein through the action of a coupled motor protein, and each nucleotide on the molecule causes characteristic and detectable current disruptions which are translated into sequence. ONT uses nucleic acid adaptors and tethers to facilitate the threading of single-stranded molecules into individual nanopore proteins. Individual reads of ~200,000 bp are reported (Ip et al. 2015), but reads of >1,000,000 bp have been reported anecdotally. On the MinION platform, sequencing takes place on flow cells harboring 512 nanopore channels and capable of generating 10–20 gb of sequence data. ONT has recently released two high-throughput sequencing platforms based on the same technology as the MinION, the GridION and the PromethION, which run multiple flow cells or use flow cells with an increased number of nanopore channels. Though commercial ONT sequencing is one of the purported applications of the larger GridION and PromethION platforms, at the time of writing, ONT is not commercially available at the majority of sequencing facilities, and currently the primary users of ONT are research labs in possession of MinION devices.

The primary advantage of ONT is the portability and usability of the platforms. All platforms can be run on benchtops, using very simple DNA library preparation protocols, and sequence data can be interpreted in real time. Indeed, the portability of ONT sequencing technology was recently demonstrated by the sequencing of the phage lambda genome using a MinION device onboard the International Space Station (Castro-Wallace et al. 2017). Consequently, the obvious application of ONT is circumstances where rapid outputs are required such as in field or diagnostic settings. A major limitation of ONT sequencing is that it requires a high amount of input DNA, which may limit its use in low-biomass environments such as the ocean virome or for phages which are difficult to amplify. On the other hand, there may be an application for ONT technologies in high-biomass environments such as fecal samples. Perhaps ONT sequencing could be used to monitor the real-time abundance and sequence variation of phages and bacteria during phage therapy trials of gastrointestinal pathogens such as Clostridium difficile.

Furthermore, ONT is the only sequencing platform that permits direct sequencing of RNA molecules (as opposed to reverse transcription of RNA into cDNA for sequencing). Like PacBio technology, detection of chemically modified nucleotides is made possible by analysis of signature current changes (Rand et al. 2017). ONT may therefore be highly applicable to the sequencing of RNA phage genomes, and the technology has recently been used to sequence the native RNA genome of an influenza A virus for the first time, identifying chemical modifications that were undetected using cDNA-sequencing strategies (Keller et al. 2018).

A disadvantage of both long-read sequencing technologies is the poor accuracy of single reads. As single reads from both technologies represent the sequencing of single molecules, and therefore single-base calls, random error is frequent. Reads from the ONT MinION platform have been reported to have an error rate of 38.2% (Laver et al. 2015), and the error rate for single PacBio reads has been reported to be 11–15% (Rhoads and Au 2015). Therefore, some caution must be taken when utilizing these technologies for applications where accuracy is crucial, such as de novo genome assembly. Provided the input DNA is sequenced in enough depth, i.e., each nucleotide in the sequence is represented on multiple independent reads, erroneous base calls can be eliminated by consensus calling. This process is termed “read polishing,” and increased accuracy can also be achieved by combining the sequencing outputs from long- and short-read sequencing technologies to yield “hybrid” assemblies (Phillippy 2017).

The primary advantage of long-read sequencing technologies such as PacBio and ONT over short-read technology such as Illumina is the ability to generate reads spanning repetitive sequence regions, thereby greatly improving DNA sequence assembly. Though large repetitive regions, such as sequence duplications and transposable elements, are not common features of phage genomes, there are instances where short-read technology alone is insufficient to complete phage genomes. For example, myoviruses infecting bacteria of the genus Bacillus can exhibit heavily chemically modified DNA which impedes routine short-read sequencing strategies (Klumpp et al. 2010). PacBio sequencing has been used to complete several of these myovirus genomes, showing that it can be a useful alternative for difficult-to-sequence phages (Klumpp et al. 2014). Restriction modification systems are known to be important in the interaction between phages and their bacterial hosts (Adams and Burdon 1985), and therefore the secondary use of long-read sequencing technologies to map chemical genome modifications may be an important application of long-read sequencing technology to phage biology in the future.

A Note on Sequence Data and Sharing

The platforms described above generate multiple gigabytes of data that need to be processed for quality, assembled and analyzed, but where possible should also be shared to improve reproducibility and promote open science.

The most appropriate way for data sharing is through the International Nucleotide Sequence Database Collaboration (INSDC) which links the three main international sequence data organizations, DDBJ (DNA Data Bank of Japan), EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute), and NCBI (National Center for Biotechnology Information) (Karsch-Mizrachi et al. 2012). Submissions only need to be made to a database of one of these organizations to be shared across all. For the DDBJ and NCBI, unassembled sequencing reads should be deposited in the Sequence Read Archive, shortened to DRA (DDBJ) or SRA (NCBI) (Leinonen et al. 2011b). The associated metadata for studies and samples are collected as BioProjects and BioSamples, respectively (Barrett et al. 2012). The assembled and annotated complete phage genomes can be deposited in GenBank (Benson et al. 2013) or DDBJ annotated sequence submissions. For EMBL-EBI, the European Nucleotide Archive (ENA) accepts all types of sequences and associated metadata through the same submission portal (Leinonen et al. 2011a). Any genomes deposited and released through the above-described resources will become a part of publicly available databases that can be searched through the Basic Local Alignment Search Tool (BLAST) (Johnson et al. 2008).

NCBI have in recent years developed virus-specific tools that can be used for phage sequence analysis. The NCBI Viral Genomes Resource groups offer specialized resources for analysis of phage genomes, such as curated reference genome databases, sequence comparison tools, protein clusters, and custom downloads (Brister et al. 2015).

Applications of Sequencing-Based Detection Methods

Sequencing-based detection of bacteriophages is possible at different levels of complexity (Fig. 1). In this section, we discuss four specific applications of sequencing-based methods, single-gene amplicon sequencing, phage genome sequencing, prophage detection, and metagenomic sequencing of viral communities.
Fig. 1

Overview of possible applications of sequencing-based detection of bacteriophages at different levels of complexity. Sequencing discussed in this chapter describes, in order of rising complexity, the single-gene level, complete phage genome sequencing, prophage detection in bacterial genomes, and metagenomes. More complex systems to be sequenced require higher sequencing yields. Sanger sequencing, therefore, is only appropriate for gene-level detection or sequencing of very small genomes. Illumina MiSeq (Ion Torrent PGM) is ideally suited for phage genome sequencing, which can be complemented with long-read technology for detection of a phage genome in a single read. Prophage detection requires bacterial genome sequencing and specific computational identification tools. Metagenomic-based detection of phages requires high sequencing yields and often has low-input biomass, requiring experiment-specific decisions on the sequencing platform to use (e.g., Illumina HiSeq giving higher yields than MiSeq)

Gene-Based Detection of Bacteriophages

Perhaps the simplest way of detecting bacteriophages is to amplify a gene or gene fragment of the bacteriophage in question and then determine its sequence and taxonomic affiliation, a method called amplicon sequencing. This can be straightforward if the target bacteriophage has a published genome in an INSDC database. Then it is simply a matter of designing primers in a unique location of the genome, performing a PCR amplification and sequencing the PCR fragment. But it is also possible to detect “unknown” phages in a sample by amplifying a signature or marker gene. Unfortunately, there is no such thing as a universal viral/phage marker gene, comparable to the 16S rRNA gene in bacteria or the 18S rRNA gene in eukaryotes (Rohwer and Edwards 2002), which can be used to screen for any bacteriophage. There have been, however, primers developed that amplify signature genes targeting specific phage groups (Adriaenssens and Cowan 2014). In many cases, these are specific for a small group of viruses within a family of tailed phages.

Before the rise of short-read sequencing technology, PCR fragments used to be cloned into plasmid vectors, and the single inserts were sequenced with plasmid primers using Sanger sequencing. Currently, sequencing of amplicons representing diverse communities can be done with any of the sequencing platforms described above, but longer reads will lead to better resolution of viral diversity. Processing of the sequencing results is generally performed by clustering sequencing reads into operational taxonomic units (OTUs) based on a threshold similarity score (90–99% identity) or by direct sequence variant comparison. Several pipelines are available for the analysis of amplicon data dealing with quality control, removal of chimeric reads, and OTU clustering, such as MOTHUR (Schloss et al. 2009) and QIIME (Caporaso et al. 2010). The resulting OTUs can then be assigned to a taxonomic group and further investigated using phylogenetics.

In the following section, we will give an overview of the phage groups which have been detected in previous studies using PCR amplification and sequencing.


Most phage amplicon studies have been targeted toward discovering the diversity of cyanophages in the environment, i.e., phages infecting cyanobacteria. The signature genes which have been used successfully in the past include structural genes, such as the portal protein (Fuller et al. 1998; Zhong et al. 2002; Short and Suttle 2005) or major capsid protein of myoviruses (Baker et al. 2006), or metabolism-related genes such as the ones coding for photosystem II proteins psbA and psbD (Zeidner et al. 2003; Millard et al. 2004; Clokie et al. 2006; Sullivan et al. 2006; Wang and Chen 2008) or the ATPase phoH (Marston and Amrich 2009; Goldsmith et al. 2011). Some of these marker genes target more specific or less diverse groups than others, for example, the structural protein markers target only a subgroup of phages belonging to the Myoviridae, whereas phoH is present in 40% of cultured marine phages, in certain eukaryotic viruses and in some phages infecting enteric bacteria.

T4-Like Phages

The type isolate for this group of phages with contractile tails, Escherichia phage T4, is an iconic phage with a long history in molecular biology. Its major capsid protein (gp23) has been the basis for most of the primer sets (Filée et al. 2005; Comeau and Krisch 2008; Marston and Amrich 2009; Chow and Fuhrman 2012). Originally used in the marine environment, these gp23 amplicons have been found around the globe in, for example, rice paddy soil (Fujii et al. 2008; Wang et al. 2009a, b), freshwater lakes in Russia (Butina et al. 2010), and even in Antarctica (López-Bueno et al. 2009) and an Arctic glacier (Bellas and Anesio 2013). The detection of this very diverse group using amplicons overlaps with cyanophage detection as many cyanophages with myovirus morphology are related to T4.

T7-Like Phages

Detection of T7-like phages belonging to the family Podoviridae can be done by targeting the DNA polymerase (polA) gene, with at least nine primer sets published so far (Breitbart et al. 2004; Labonté et al. 2009; Chen et al. 2009; Dekel-Bird et al. 2013). Partial sequences have been detected in a range of habitats around the world, including marine, freshwater, and terrestrial (Breitbart et al. 2004; Chen et al. 2009; Huang et al. 2010).


ssDNA phages belonging to the subfamily Gokushovirinae, family Microviridae, have been recently shown to be ubiquitous across habitats and geographic regions, based on amplification of the major capsid protein gene (Hopkins et al. 2014; Székely and Breitbart 2016). Using this gene, bloom-bust patterns and fluctuations in microvirus abundance were discovered in two freshwater lakes in France (Zhong et al. 2015).

Other Potential Signature Genes

There are other genes which are conserved among groups of phages, but which have not yet been investigated using gene-based sequencing approaches. A great resource to find a signature gene for a phage group of interest is the prokaryotic virus orthologous groups (pVOGs) database, formerly known as phage orthologous groups (POGs) (Kristensen et al. 2013; Grazziotin et al. 2017). This database comprises protein clusters (pVOGs) for all sequenced bacterial and archaeal viruses and can therefore be used to identify shared genes between any taxon of interest. For example, the second largest pVOG, VOG4544 terminase large subunit, is found in 83% of the tailed phages (order Caudovirales) (Grazziotin et al. 2017) and has been used in many phylogenetic analyses from isolates to virome studies (Sullivan et al. 2009; Roux et al. 2014). RNA phages all encode an RNA-dependent RNA polymerase, which can act as a signature gene, but only eukaryotic-infecting RNA viruses have been found with published primer sets (Culley et al. 2003; Culley and Steward 2007). A helpful tool for the choice of signature genes and primer design of a phage group of interest is PhiSigns, which is both available as stand-alone tool and web-based application (Dwivedi et al. 2012).

Genome Sequencing

The year 2017 marked 40 years since the first bacteriophage, the ssDNA phage ɸX174, was sequenced (Sanger et al. 1977), followed shortly after by the genomes of the reference phages lambda and T7 (Sanger et al. 1982; Dunn et al. 1983). Due to their small sizes, ranging from approximately 3.5 kb (Friedman et al. 2009) to nearly 500 kbp (Hatfull and Hendrix 2011), phage genomes were sequenced well in advance of the first bacterial genome, in 1995 (Fleischmann et al. 1995). However, the number of complete, i.e., finished or entirely sequenced genomes, and “whole genome shotgun” sequences available in public databases is higher for bacteria than for phages, despite the genome size differences that make the latter easier to sequence and assemble. Still, sequencing of novel phage genomes has increased greatly during the last two decades (Adriaenssens and Brister 2017).

The detection of bacteriophages based on their genome sequences could only be achieved after having a repertoire of different completed or close to completion genomes. Likewise, sequencing bacteriophage genomes is crucial to broadening knowledge of their biology, such as metabolic processes and interactions with their host and environment. This is one of the main drawbacks of the currently small number of phage genomes available, which is only a very small fraction compared to the predicted diversity of bacteriophages (Perez Sepulveda et al. 2016). Despite the relatively small number of sequenced phage genomes, their sequences have contributed significantly to the discovery of several phage genes directly involved in the host metabolism (Zeidner et al. 2003; Millard et al. 2004, 2009, 2010; Sullivan et al. 2006; Wang and Chen 2008; Sabehi et al. 2012; Chan et al. 2015). These auxiliary metabolic genes (AMGs) can play important roles in redirecting host metabolism by, for example, guiding carbon flux to the biosynthesis of deoxynucleotides through the pentose phosphate pathway and hence favoring bacteriophage replication (Thompson et al. 2011). As mentioned before, detection of bacteriophages using marker genes can only be used for a reduced number of bacteriophages, but by having a fuller understanding of their genomes, either the range of markers can be extended or potentially the relatively small genomes can be used as markers, allowing detection of supposed “unculturable” bacteriophages.

Obtaining bacteriophages for sequencing normally involves the infection of suitable hosts by “plaque assay” in which individual plaques (i.e., visible clearing of bacterial culture that represents the lysis of the host by replication of bacteriophages) (see “Detection of Bacteriophage: Plaques and Plaque Assay”) are picked (i.e., selected) and propagated further to increase phage biomass. Bacteriophages may be further concentrated by methods such as polyethylene glycol precipitation and/or cesium chloride density gradient centrifugation. Nucleic acids are then extracted and purified for sequencing using standard methods, such as phenol-chloroform phase separation or the use of commercial kits. The dependence on the use of a sensitive host bacterium has limited the approach to sequencing only those bacteriophages capable of infecting the small proportion of bacteria that can be grown under laboratory conditions, even if those bacteria do not necessarily represent the “primary” host. This has become a problem for environments where not many hosts can be cultured. An example of this phenomenon is presented in a study by Brum and colleagues who found only 39 genomes that could be associated to “cultured” bacteriophages compared to an estimated total of 5,476 different dominant bacteriophage populations in the upper ocean (Brum et al. 2015).

Another issue related to this method of culturing phages is that these plaques can contain more than one phage, which can be due to spontaneous induction of any host prophages (Henn et al. 2010; Cowley et al. 2015). Having multiples phages in a sample can cause incorrect assemblies and misinterpretation of genomes. During recent years, there have been developments and method optimizations to sequence phage genomes extracted from single plaques, reducing the input material and costs associated with library preparation (Kot et al. 2014; Baym et al. 2015). In line with these advances, Rihtman and colleagues demonstrated successful Illumina high-throughput sequencing and assembly of samples containing multiple bacteriophages; the combination of multiple genomes per library preparation allows for cost-effective nucleic acid extraction and sequencing (Rihtman et al. 2016). As a general rule, a read coverage (i.e., the number of sequencing reads generated for each base of the genome) of 30X is adequate for successful assembly of a genome. Due to variation in the properties of different bacteriophages, in the case of multiplexing genomes, a coverage of 100X has been suggested for obtaining a reliable assembly. It is worth noting, however, that the efficiency described when sequencing multiple bacteriophages per library does not necessarily apply to closely related, i.e., similar, bacteriophages. Thus, in order to avoid mistakes caused by possible undetectable mis-assemblies, it is recommended to only multiplex genomes obtained from different hosts.

The study of bacteriophage genomes has increased the knowledge of their structure and composition, which consequently allows the development and design of novel methods for detection, including new bioinformatic tools. Phage genomes are normally free from complex sequences than can massively affect genome assembly, such as transposable elements and repetitive sequences (i.e., gene duplications or variable-number tandem repeats). However, re-sequencing of cultured bacteriophages using different sequencing technologies can help to address the potential issues derived from sequence complexity, allowing increased accuracy and better discrimination of correct assemblies by including more data. Additionally, re-sequencing of bacteriophages decreases the needed coverage for proper assembly, thereby providing significant information regarding the evolution of bacteriophage genomes (Puxty et al. 2015).

Sequence-Based Identification of Prophages

A prophage represents a stage in the life cycle of a temperate phage wherein the phage genetic material is transmitted vertically with that of the bacterial host, either integrated into the chromosome or existing as a low-copy number plasmid (see “Lysogeny”). Prophages have been shown to be important elements of horizontal gene transfer which can contribute significantly to bacterial niche adaptation and population dynamics (Bossi et al. 2003; Fortier and Sekulovic 2013) (see “Transduction”). In particular, prophages have played a crucial role in the evolution of some notable bacterial pathogens, by encoding toxins and virulence factors, such as the Stx toxin of Shiga-toxigenic Escherichia coli, the cholera toxin of Vibrio cholerae, and the C1 neurotoxin of Clostridium botulinum (Brüssow et al. 2004). Consequently, the identification of prophages is an important application of sequencing technologies.

As part of the realm of the temperate phage is within the genome of bacteria, temperate phage genome sequencing is a natural by-product of the sequencing of bacterial genomes, and no specific methodological considerations are necessary to sequence bacterial genomes containing prophages. It is likely, therefore, that temperate phages are the most deeply sequenced of all phages. In fact, given that bacteria typically have multiple prophage sequences incorporated into their genomes (Casjens 2003), it could be argued that more genome sequences exist for temperate phages than for their bacterial hosts and therefore any other group of organisms on the planet. However, the challenge of accessing this wealth of temperate phage genome sequence data is in identifying prophages in bacterial genome sequences.

Analogous to identifying bacteriophages from the environment using gene markers, temperate phages can be identified from within bacterial sequence space using sequence markers. A number of computational tools have been developed to mine bacterial genome datasets for prophage sequences including Phage_Finder (Fouts 2006), Prophage Finder (Bose and Barber 2006), Prophinder (Lima-Mendez et al. 2008), PHAST (Zhou et al. 2011), PhiSpy (Akhter et al. 2012), VirSorter (Roux et al. 2015), and PHASTER (Arndt et al. 2016). All these tools rely on characteristics of prophage genome sequences, such as identification of sequences with homology to characterized phage genes and identification of direct repeats corresponding to phage attachment (att) sites or regions of DNA with differential GC skew, protein length, or transcription strand directionality.

Unfortunately, there are currently several limitations to these approaches. Firstly, most of the listed prophage identification tools are designed to be run on complete (single-contig) bacterial genome sequences. This is because the naturally modular nature of prophages makes it impossible to accurately predict which prophage-containing contigs belong together when multiple contigs are present. Prophage prediction tools rely heavily on the locality of prophage-signature sequences, but this locality is likely to be arbitrary in unfinished, multi-contig assemblies. Though one of the recent prophage prediction tools, PHASTER, is able to handle multi-contig files, a single functional prophage that assembles as two contigs would likely be assigned as two “incomplete” prophages. As the vast majority of bacterial genome sequence data exists as unfinished, contiguous genomes, accessing the prophage content of these genomes remains challenging.

A second limitation lies in inferring functionality of the prophage. Prophage sequences within bacterial genomes can exist in a variety of complex functional states, from fully functional (capable of induction and replication), to defective but capable of resuscitation, to extremely degraded representing a host-domesticated prophage remnant. Distinguishing between functional, nonfunctional, and incomplete prophage remnants in silico is extremely difficult. This task is even further complicated by the issue of multi-contig genome assemblies discussed above. How can a remnant prophage island be computationally distinguished from a functional prophage disrupted across separate contigs in an unfinished bacterial genome assembly? Even in complete, single-contig assemblies, mutations that inactivate prophage function can be as subtle as singular noncoding SNPs (Owen et al. 2017), making the accurate computational prediction of prophage function virtually impossible.

As well as the sequencing of prophages together with their bacterial hosts, other methods to sequence temperate phages exist. Functional temperate phages may be easily sequenced by amplification on a sensitive host, virion concentration, and purification, as has been described for phage genome sequencing above. Even when putative temperate prophages cannot be replicated, for example, if a sensitive host strain is not available, chemically induced prophage induction (i.e., using SOS response-inducing agents such as mitomycin C, norfloxacin, nalidixic acid, or UV-light exposure) may be used to induce the formation of phage particles (Raya and Hébert 2009) (see “Lysogeny”). Phage particles can then be purified from the culture supernatant for sequencing. From this it can be determined which, if multiple prophage-like sequences are present in the genome, can be induced to form phage particles.

An important consideration when sequencing induced phages from culture supernatant is the removal of contaminating bacterial chromosomal DNA. Thorough DNase treatment procedures should therefore be undertaken, such that non-prophage chromosomal DNA can no longer be detected in the sample by PCR, to ensure that only virion-encapsulated phage DNA is sequenced. The advantage of this technique is that it may allow temperate phages to be identified in previously uncharacterized bacteria, in which little functional information can be obtained based on homology to known genes. Sequencing of protein-encapsulated DNA would provide convincing evidence for the existence of novel prophages, even if the sequence had no homology to known phages.

Finally, methods have been developed to selectively enrich DNA samples for phages that exist as extrachromosomal elements, for example, as circular or linear plasmids. Certain bacterial genera such Chlamydia and Borrelia which are associated with small genome sizes have been found to harbor extrachromosomal prophage plasmids rather than integrated prophages (Casjens 2003). A method to selectively enrich genomic DNA samples for extrachromosomal prophage elements in Staphylococcus by the use of plasmid purification kits has been reported (Utter et al. 2014). However, the increasing adoption of long-read sequencing technology may represent a more effective way to investigate extrachromosomal prophage elements.

Metagenomics-Based Detection of Bacteriophages

It is possible to detect phages in any environment without any prior knowledge about their genome using shotgun metagenomics or more specifically metaviromics. In its most basic form, metagenomics is the sequencing of all the nucleic acid extracted from the environment (or any sample of interest). For phages, and viruses in general, these metagenomics protocols get amended to account for the smaller genome sizes of bacteriophages compared to bacteria, fungi, and protists. In many environments with low biomass, a concentration step is necessary to reduce the volume of aquatic input material before sample processing, with concentration generally performed using tangential flow filtration (Vega Thurber et al. 2009) or, for smaller volumes, by using spin filter columns in a benchtop centrifuge (Bolduc et al. 2012). The most commonly used step following concentration is the combination of viral enrichment by size exclusion of the cellular fraction using filters and nuclease treatment for the removal of free-floating DNA and RNA (Vega Thurber et al. 2009; Hall et al. 2014). When using size exclusion to remove bacteria, it is important to keep in mind the size of the phage target, with some of the jumbo myoviruses such as Pseudomonas phages phiKZ and EL and their relatives (head diameter of >120 nm and tail length of ~200 nm) potentially clogging a 0.22 μm filter pore (Hertveldt et al. 2005). Subsequently, the phage (viral) community nucleic acid can be extracted and sequenced using next-generation sequencing platforms. In most cases, researchers will be targeting dsDNA as this genome group represents most known phages.

The first use of metaviromics was to explore uncultured marine virus communities and found that a large fraction of these communities were made up of phages (Breitbart et al. 2002). Ever since this seminal paper, the techniques and approaches in metaviromics have been optimized and updated to investigate viral/phage ecology, in a field that has started to come of age (Sullivan 2015; Sullivan et al. 2017). Virtually every habitat sampled and analyzed with metaviromics has shown that phages make up a substantial fraction, if not the majority of identified sequences. This includes from pristine environments in the polar regions (López-Bueno et al. 2009; Zablocki et al. 2014; Aguirre de Cárcer et al. 2015; Adriaenssens et al. 2017), over the global oceans (Huang et al. 2010; Mizuno et al. 2013; Brum et al. 2015), freshwater lakes (Roux et al. 2012; Labonté and Suttle 2013; Skvortsov et al. 2016), and soils (Fierer et al. 2007; Zablocki et al. 2017). Currently, only marine epipelagic habitats have been sampled near saturation, allowing network-based clustering approaches to describe the full diversity, revealing that the most abundant viral clusters represent phages infecting members of the phyla Actinobacteria, Proteobacteria, Bacteroidetes, Cyanobacteria, and Deferribacteres (Roux et al. 2016).

Metaviromics has become the go-to method to identify human-associated phage communities. Early studies of the human gut revealed highly diverse and unknown phage communities, with the majority of the known phage signatures belonging to tailed phages of the order Caudovirales (Breitbart et al. 2003). Comparative analyses revealed high interpersonal differences in gut phage communities, but low levels of change over time within the same individual, and a predominance of temperate phages (Reyes et al. 2010, 2012, 2015; Minot et al. 2011; Manrique et al. 2016). Metaviromic sequencing of ultrasmall amounts of DNA also showed unique phage communities on the human skin with differences according to topical sites and large intrapersonal differences (Hannigan et al. 2015). Other metavirome studies have found viral communities dominated by bacteriophages in bodily fluids, such as saliva and urine, and to a lesser extent in blood where phages might have originated from contamination of the sequencing procedure (Pride et al. 2012; Santiago-Rodriguez et al. 2015; Moustafa et al. 2017).

In an alternative approach, metaviromic sequencing analyses have been used to identify the phages present in phage cocktails used in phage therapy treatments and experiments. The first such cocktail analyzed was a Russian cocktail called ColiProteus (Microgen) used against E. coli and Proteus infections (McCallin et al. 2013). This study revealed 17 different phage groups present in the cocktail at different abundances, suggesting that some of the low abundance groups were by-products of phage cocktail production. The second study investigated the Intesti phage cocktail from the Eliava Institute in Georgia, active against a range of enteric bacteria (Zschach et al. 2015) (see “Current Updates in the Long-Standing Phage Research Centers in Georgia, Poland and Russia”). The metaviromic analysis showed 23 different sequence groups, called phage clusters by the researchers, falling within the families Myoviridae, Siphoviridae, and Podoviridae and an unassigned grouping. These two studies showed that these phage cocktails were more complex than initially assumed and might contain additional (unwanted) phage sequences at low abundances.

One of the most interesting discoveries to come from using a metaviromic sequencing approach is the discovery of a highly abundant phage in human gut viromes (Dutilh et al. 2014). The researchers in this study used a cross-assembly approach (assembling multiple datasets together in one contig set) in order to increase contig length and establish a co-occurrence profile of contigs over the different samples (Dutilh et al. 2012). They were able to reconstruct a circular contig of ~97 kb representing a phage genome they labeled crAssphage and verified its existence by long-range PCR and Sanger sequencing. In a comparison with all published metagenomes at the time, this phage was found to make up a significant portion of gut metagenomes [up to 90% of reads of the twin dataset (Reyes et al. 2010)] and represented 1.7% of all sequencing reads from human feces, making crAssphage one of the most abundant phages in publicly available datasets. With more phage genomes sequenced and metagenome data becoming available, it is possible that more of these abundant, unknown phages will be discovered.


The development of sequencing technology and the subsequent boom in next-generation sequencing platforms (both short- and long-read platforms) has been fundamental in advancing bacteriophage research. These methods have not only contributed to an explosion of genomes in public databases but have also provided an opportunity to exploit sequencing-based methods for bacteriophage detection. At the simplest level of complexity, phages can be detected by the sequencing of a single marker gene. Whole genome sequencing of isolated phages has populated reference databases, while bacterial genome sequencing led to the discovery of a plethora of previously undetected prophage genomes. At the community level, shotgun sequencing methods have made it possible for researchers to investigate all bacteriophages in a sample without previous knowledge of its content. In conclusion, sequencing and next-generation sequencing technology has added a new layer to bacteriophage research, opening new avenues of research, from exploitation of genes for biotechnological applications to population ecology.



  1. Abedon ST (2017) Information phage therapy research should report. Pharmaceuticals 10:1–17. Scholar
  2. Ackermann H-W (2011) Bacteriophage taxonomy. Microbiol Aust 32:90–94Google Scholar
  3. Adams RLP, Burdon RH (1985) The function of DNA methylation in bacteria and phage. In: Molecular biology of DNA methylation. Springer, New York, pp 73–87CrossRefGoogle Scholar
  4. Adriaenssens E, Brister JR (2017) How to name and classify your phage: an informal guide. Viruses 9:70. Scholar
  5. Adriaenssens EM, Cowan DA (2014) Using signature genes as tools to assess environmental viral ecology and diversity. Appl Environ Microbiol 80:4470–4480. Scholar
  6. Adriaenssens EM, Ackermann H-W, Anany H et al (2012) A suggested new bacteriophage genus: “Viunalikevirus”. Arch Virol 157:2035–2046. Scholar
  7. Adriaenssens EM, Kramer R, Van Goethem MW et al (2017) Environmental drivers of viral community composition in Antarctic soils identified by viromics. Microbiome 5:83. Scholar
  8. Adriaenssens EM, Wittmann J, Kuhn JH et al (2018) Taxonomy of prokaryotic viruses: 2017 update from the ICTV Bacterial and Archaeal Viruses Subcommittee. Arch Virol 163:1125–1129. Scholar
  9. Aguirre de Cárcer D, López-Bueno A, Pearce DA, Alcamí A (2015) Biodiversity and distribution of polar freshwater DNA viruses. Sci Adv 1:e1400127. Scholar
  10. Akhter S, Aziz RK, Edwards RA (2012) PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res 40:e126. Scholar
  11. Alavidze Z, Aminov R, Betts A et al (2016) Silk route to the acceptance and re-implementation of bacteriophage therapy. Biotechnol J 11:595–600. Scholar
  12. Arndt D, Grant JR, Marcu A et al (2016) PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res 44:W16–W21. Scholar
  13. Baker AC, Goddard VJ, Davy J et al (2006) Identification of a diagnostic marker to detect freshwater cyanophages of filamentous cyanobacteria. Appl Environ Microbiol 72:5713–5719. Scholar
  14. Barrett T, Clark K, Gevorgyan R et al (2012) BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res 40:57–63. Scholar
  15. Baym M, Kryazhimskiy S, Lieberman TD et al (2015) Inexpensive multiplexed library preparation for megabase-sized genomes. PLoS One 10:e0128036. Scholar
  16. Bellas CM, Anesio AM (2013) High diversity and potential origins of T4-type bacteriophages on the surface of Arctic glaciers. Extremophiles 17:861–870. Scholar
  17. Benson DA, Cavanaugh M, Clark K et al (2013) GenBank. Nucleic Acids Res 41:36–42. Scholar
  18. Bolduc B, Shaughnessy DP, Wolf YI et al (2012) Identification of novel positive-strand RNA viruses by metagenomic analysis of archaea-dominated Yellowstone hot springs. J Virol 86:5562–5573. Scholar
  19. Bose M, Barber RD (2006) Prophage Finder: a prophage loci prediction tool for prokaryotic genome sequences. In Silico Biol 6:223–227PubMedGoogle Scholar
  20. Bossi L, Fuentes JA, Mora G, Figueroa-Bossi N (2003) Prophage contribution to bacterial population dynamics. J Bacteriol 185:6467–6471CrossRefGoogle Scholar
  21. Breitbart M, Salamon P, Andresen B et al (2002) Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci U S A 99:14250–14255. Scholar
  22. Breitbart M, Hewson I, Felts B et al (2003) Metagenomic analyses of an uncultured viral community from human feces. J Bacteriol 185:6220–6223. Scholar
  23. Breitbart M, Miyake JH, Rohwer F (2004) Global distribution of nearly identical phage-encoded DNA sequences. FEMS Microbiol Lett 236:249–256. Scholar
  24. Brister JR, Ako-Adjei D, Bao Y, Blinkova O (2015) NCBI viral genomes resource. Nucleic Acids Res 43:D571–D577. Scholar
  25. Brum JR, Ignacio-Espinoza JC, Roux S et al (2015) Patterns and ecological drivers of ocean viral communities. Science (80-) 348:1261498. Scholar
  26. Brüssow H, Canchaya C, Hardt WD (2004) Phages and the evolution of bacterial pathogens: from genomic rearrangements to lysogenic conversion. Microbiol Mol Biol Rev 68:560–602. Scholar
  27. Butina TV, Belykh OI, Maksimenko SY, Belikov SI (2010) Phylogenetic diversity of T4-like bacteriophages in Lake Baikal, East Siberia. FEMS Microbiol Lett 309:122–129. Scholar
  28. Caporaso JG, Kuczynski J, Stombaugh J et al (2010) QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7:335–336. Scholar
  29. Casjens S (2003) Prophages and bacterial genomics: what have we learned so far? Mol Microbiol 49:277–300CrossRefGoogle Scholar
  30. Castro-Wallace SL, Chiu CY, John KK et al (2017) Nanopore DNA sequencing and genome assembly on the International Space Station. Sci Rep 7:18022. Scholar
  31. Chan Y-WW, Millard AD, Wheatley PJ et al (2015) Genomic and proteomic characterization of two novel siphovirus infecting the sedentary facultative epibiont cyanobacterium Acaryochloris marina. Environ Microbiol 17:4239–4252. Scholar
  32. Chen F, Wang K, Huang S et al (2009) Diverse and dynamic populations of cyanobacterial podoviruses in the Chesapeake Bay unveiled through DNA polymerase gene sequences. Environ Microbiol 11:2884–2892. Scholar
  33. Chow C-ET, Fuhrman JA (2012) Seasonality and monthly dynamics of marine myovirus communities. Environ Microbiol 14:2171–2183. Scholar
  34. Clokie MRJ, Millard AD, Mehta JY, Mann NH (2006) Virus isolation studies suggest short-term variations in abundance in natural cyanophage populations of the Indian Ocean. J Mar Biol Assoc UK 86:499–505. Scholar
  35. Comeau AM, Krisch HM (2008) The capsid of the T4 phage superfamily: the evolution, diversity, and structure of some of the most prevalent proteins in the biosphere. Mol Biol Evol 25:1321–1332. Scholar
  36. Cowley LA, Beckett SJ, Chase-Topping M et al (2015) Analysis of whole genome sequencing for the Escherichia coli O157:H7 typing phages. BMC Genomics 16:271. Scholar
  37. Culley AI, Steward GF (2007) New genera of RNA viruses in subtropical seawater, inferred from polymerase gene sequences. Appl Environ Microbiol 73:5937–5344. Scholar
  38. Culley AI, Lang AS, Suttle CA (2003) High diversity of unknown picorna-like viruses in the sea. Nature 424:1054–1057. Scholar
  39. Dekel-Bird NP, Avrani S, Sabehi G et al (2013) Diversity and evolutionary relationships of T7-like podoviruses infecting marine cyanobacteria. Environ Microbiol 15:1476–1491. Scholar
  40. Dunn JJ, Studier FW, Gottesman M (1983) Complete nucleotide sequence of bacteriophage T7 DNA and the locations of T7 genetic elements. J Mol Biol 166:477–535. Scholar
  41. Dutilh BE, Schmieder R, Nulton J et al (2012) Reference-independent comparative metagenomics using cross-assembly: CrAss. Bioinformatics 28:3225–3231. Scholar
  42. Dutilh BE, Cassman N, McNair K et al (2014) A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat Commun 5:4498. Scholar
  43. Dwivedi B, Schmieder R, Goldsmith DB et al (2012) PhiSiGns: an online tool to identify signature genes in phages and design PCR primers for examining phage diversity. BMC Bioinform 13:37. Scholar
  44. Fierer N, Breitbart M, Nulton J et al (2007) Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil. Appl Environ Microbiol 73:7059–7066. Scholar
  45. Filée J, Tétart F, Suttle CA, Krisch HM (2005) Marine T4-type bacteriophages, a ubiquitous component of the dark matter of the biosphere. Proc Natl Acad Sci U S A 102:12471–12476. Scholar
  46. Fleischmann R, Adams M, White O et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science (80-) 269:496–512. Scholar
  47. Fortier L-C, Sekulovic O (2013) Importance of prophages to evolution and virulence of bacterial pathogens. Virulence 4:354–365. Scholar
  48. Fouts DE (2006) Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res 34:5839–5851. Scholar
  49. Friedman SD, Genthner FJ, Gentry J et al (2009) Gene mapping and phylogenetic analysis of the complete genome from 30 single-stranded RNA male-specific coliphages (family Leviviridae). J Virol 83:11233–11243. Scholar
  50. Fujii T, Nakayama N, Nishida M et al (2008) Novel capsid genes (g23) of T4-type bacteriophages in a Japanese paddy field. Soil Biol Biochem 40:1049–1058. Scholar
  51. Fuller NJ, Wilson WH, Joint IR, Mann NH (1998) Occurrence of a sequence in marine cyanophages similar to that of T4 g20 and its application to PCR-based detection and quantification techniques. Appl Environ Microbiol 64:2051–2060PubMedPubMedCentralGoogle Scholar
  52. Garneau JR, Depardieu F, Fortier L-C et al (2017) PhageTerm: a tool for fast and accurate determination of phage termini and packaging mechanism using next-generation sequencing data. Sci Rep 7:8292. Scholar
  53. Goldsmith DB, Crosti G, Dwivedi B et al (2011) Development of phoH as a novel signature gene for assessing marine phage diversity. Appl Environ Microbiol 77:7730–7739. Scholar
  54. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351. Scholar
  55. Grazziotin AL, Koonin EV, Kristensen DM (2017) Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res 45:D491–D498. Scholar
  56. Hall RJ, Wang J, Todd AK et al (2014) Evaluation of rapid and simple techniques for the enrichment of viruses prior to metagenomic virus discovery. J Virol Methods 195:194–204. Scholar
  57. Hannigan GD, Meisel JS, Tyldsley AS et al (2015) The human skin double-stranded DNA virome: topographical and temporal diversity, genetic enrichment, and dynamic associations with the host microbiome. MBio 6:e01578–e01515. Scholar
  58. Hatfull GF (2015) Dark matter of the biosphere: the amazing world of bacteriophage diversity. J Virol 89:8107–8110. Scholar
  59. Hatfull GF, Hendrix RW (2011) Bacteriophages and their genomes. Curr Opin Virol 1:298–303. Scholar
  60. Henn MR, Sullivan MB, Stange-Thomann N et al (2010) Analysis of high-throughput sequencing and annotation strategies for phage genomes. PLoS One 5:e9083. Scholar
  61. Hertveldt K, Lavigne R, Pleteneva E et al (2005) Genome comparison of Pseudomonas aeruginosa large phages. J Mol Biol 354:536–545. Scholar
  62. Hopkins M, Kailasan S, Cohen A et al (2014) Diversity of environmental single-stranded DNA phages revealed by PCR amplification of the partial major capsid protein. ISME J 8:2093–2103. Scholar
  63. Huang S, Wilhelm SW, Jiao N, Chen F (2010) Ubiquitous cyanobacterial podoviruses in the global oceans unveiled through viral DNA polymerase gene sequences. ISME J 4:1243–1251. Scholar
  64. Ip CLC, Loose M, Tyson JR et al (2015) MinION analysis and reference consortium: phase 1 data release and analysis. F1000Research 4:1075. Scholar
  65. Johnson M, Zaretskaya I, Raytselis Y et al (2008) NCBI BLAST: a better web interface. Nucleic Acids Res 36:W5–W9. Scholar
  66. Karsch-Mizrachi I, Nakamura Y, Cochrane G (2012) The international nucleotide sequence database collaboration. Nucleic Acids Res 40:D33–D37. Scholar
  67. Keller MW, Rambo-Martin BL, Wilson MM, et al (2018) Direct RNA sequencing of the complete Influenza A virus genome. bioRxiv.
  68. Klumpp J, Lavigne R, Loessner MJ, Ackermann H-W (2010) The SPO1-related bacteriophages. Arch Virol 155:1547–1561. Scholar
  69. Klumpp J, Fouts DE, Sozhamannan S (2012) Next generation sequencing technologies and the changing landscape of phage genomics. Bacteriophage 2:190–199. Scholar
  70. Klumpp J, Schmuki M, Sozhamannan S et al (2014) The odd one out: Bacillus ACT bacteriophage CP-51 exhibits unusual properties compared to related Spounavirinae W.Ph. and Bastille. Virology 462:299–308. Scholar
  71. Kot W, Vogensen FK, Sørensen SJ, Hansen LH (2014) DPS – a rapid method for genome sequencing of DNA-containing bacteriophages directly from a single plaque. J Virol Methods 196:152–156. Scholar
  72. Kristensen DM, Waller AS, Yamada T et al (2013) Orthologous gene clusters and taxon signature genes for viruses of prokaryotes. J Bacteriol 195:941–950. Scholar
  73. Krupovic M, Prangishvili D, Hendrix RW, Bamford DH (2011) Genomics of bacterial and archaeal viruses: dynamics within the prokaryotic virosphere. Microbiol Mol Biol Rev 75:610–635. Scholar
  74. Labonté JM, Suttle CA (2013) Metagenomic and whole-genome analysis reveals new lineages of gokushoviruses and biogeographic separation in the sea. Front Microbiol 4:404. Scholar
  75. Labonté JM, Reid KE, Suttle CA (2009) Phylogenetic analysis indicates evolutionary diversity and environmental segregation of marine podovirus DNA polymerase gene sequences. Appl Environ Microbiol 75:3634–3640. Scholar
  76. Laver T, Harrison J, O’Neill PA et al (2015) Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol Detect Quantif 3:1–8. Scholar
  77. Lee Y-J, Dai N, Walsh SE et al (2018) Identification and biosynthesis of thymidine hypermodifications in the genomic DNA of widespread bacterial viruses. Proc Natl Acad Sci.
  78. Leinonen R, Akhtar R, Birney E et al (2011a) The European nucleotide archive. Nucleic Acids Res 39:31–34. Scholar
  79. Leinonen R, Sugawara H, Shumway M (2011b) The sequence read archive. Nucleic Acids Res 39:2010–2012. Scholar
  80. Lima-Mendez G, Van Helden J, Toussaint A, Leplae R (2008) Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics 24:863–865. Scholar
  81. Loman NJ, Misra RV, Dallman TJ et al (2012) Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 30:434–439. Scholar
  82. López-Bueno A, Tamames J, Velázquez D et al (2009) High diversity of the viral community from an Antarctic lake. Science 326:858–861. Scholar
  83. Manrique P, Bolduc B, Walk ST et al (2016) Healthy human gut phageome. Proc Natl Acad Sci 113:10400–10405. Scholar
  84. Marine R, Polson SW, Ravel J et al (2011) Evaluation of a transposase protocol for rapid generation of shotgun high-throughput sequencing libraries from nanogram quantities of DNA. Appl Environ Microbiol 77:8071–8079. Scholar
  85. Marston MF, Amrich CG (2009) Recombination and microdiversity in coastal marine cyanophages. Environ Microbiol 11:2893–2903. Scholar
  86. McCallin S, Alam Sarker S, Barretto C et al (2013) Safety analysis of a Russian phage cocktail: from MetaGenomic analysis to oral application in healthy human subjects. Virology 443:187–196. Scholar
  87. Millard A, Clokie MRJ, Shub DA, Mann NH (2004) Genetic organization of the psbAD region in phages infecting marine Synechococcus strains. Proc Natl Acad Sci U S A 101:11007–11012. Scholar
  88. Millard AD, Zwirglmaier K, Downey MJ et al (2009) Comparative genomics of marine cyanomyoviruses reveals the widespread occurrence of Synechococcus host genes localized to a hyperplastic region: implications for mechanisms of cyanophage evolution. Environ Microbiol 11:2370–2387. Scholar
  89. Millard AD, Gierga G, Clokie MRJ et al (2010) An antisense RNA in a lytic cyanophage links psbA to a gene encoding a homing endonuclease. ISME J 4:1121–1135. Scholar
  90. Minot S, Sinha R, Chen J et al (2011) The human gut virome: inter-individual variation and dynamic response to diet. Genome Res 21:1616–1625. Scholar
  91. Mizuno CM, Rodriguez-Valera F, Kimes NE, Ghai R (2013) Expanding the marine virosphere using metagenomics. PLoS Genet 9:e1003987. Scholar
  92. Moustafa A, Xie C, Kirkness E et al (2017) The blood DNA virome in 8,000 humans. PLoS Pathog 13:e1006292. Scholar
  93. Owen SV, Wenner N, Canals R et al (2017) Characterization of the prophage repertoire of African Salmonella Typhimurium ST313 reveals high levels of spontaneous induction of novel phage BTP1. Front Microbiol 8:235. Scholar
  94. Perez Sepulveda B, Redgwell T, Rihtman B et al (2016) Marine phage genomics: the tip of the iceberg. FEMS Microbiol Lett 363:fnw158. Scholar
  95. Phillippy AM (2017) New advances in sequence assembly. Genome Res 27:xi–xiii. Scholar
  96. Pride DT, Salzman J, Haynes M et al (2012) Evidence of a robust resident bacteriophage population revealed through analysis of the human salivary virome. ISME J 6:915–926. Scholar
  97. Puxty RJ, Perez-Sepulveda B, Rihtman B et al (2015) Spontaneous deletion of an “ORFanage” region facilitates host adaptation in a “photosynthetic” cyanophage. PLoS One 10:e0132642. Scholar
  98. Rand AC, Jain M, Eizenga JM et al (2017) Mapping DNA methylation with high-throughput nanopore sequencing. Nat Methods 14:411–413. Scholar
  99. Raya R, Hébert EM (2009) Isolation of phage via induction of lysogens. In: Clokie MR, Kropinski AM (eds) Bacteriophages: methods and protocols. Humana Press, New York, pp 23–32CrossRefGoogle Scholar
  100. Reyes A, Haynes M, Hanson N et al (2010) Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature 466:334–338. Scholar
  101. Reyes A, Semenkovich NP, Whiteson K et al (2012) Going viral: next-generation sequencing applied to phage populations in the human gut. Nat Rev Microbiol 10:607–617. Scholar
  102. Reyes A, Blanton LV, Cao S et al (2015) Gut DNA viromes of Malawian twins discordant for severe acute malnutrition. Proc Natl Acad Sci 112:201514285. Scholar
  103. Rhoads A, Au KF (2015) PacBio sequencing and its applications. Genomics Proteomics Bioinform 13:278–289. Scholar
  104. Rihtman B, Meaden S, Clokie MRJ et al (2016) Assessing Illumina technology for the high-throughput sequencing of bacteriophage genomes. PeerJ 4:e2055. Scholar
  105. Rohwer F, Edwards R (2002) The Phage Proteomic Tree: a genome-based taxonomy for phage. J Bacteriol 184:4529–4535. Scholar
  106. Roux S, Enault F, Robin A et al (2012) Assessing the diversity and specificity of two freshwater viral communities through metagenomics. PLoS One 7:e33641. Scholar
  107. Roux S, Tournayre J, Mahul A et al (2014) Metavir 2: new tools for viral metagenome comparison and assembled virome analysis. BMC Bioinform 15:76. Scholar
  108. Roux S, Enault F, Hurwitz BL, Sullivan MB (2015) VirSorter: mining viral signal from microbial genomic data. PeerJ 3:e985. Scholar
  109. Roux S, Brum JR, Dutilh BE et al (2016) Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537:689–693. Scholar
  110. Sabehi G, Shaulov L, Silver DH et al (2012) A novel lineage of myoviruses infecting cyanobacteria is widespread in the oceans. Proc Natl Acad Sci U S A 109:2037–2042. Scholar
  111. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci 74:5463–5467. Scholar
  112. Sanger F, Coulson AR, Hong GF et al (1982) Nucleotide sequence of bacteriophage lambda DNA. J Mol Biol 162:729–773CrossRefGoogle Scholar
  113. Santiago-Rodriguez TM, Ly M, Bonilla N, Pride DT (2015) The human urine virome in association with urinary tract infections. Front Microbiol 6:1–12. Scholar
  114. Schloss PD, Westcott SL, Ryabin T et al (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541. Scholar
  115. Short CM, Suttle CA (2005) Nearly identical bacteriophage structural gene sequences are widely distributed in both marine and freshwater environments. Appl Environ Microbiol 71:480–486. Scholar
  116. Skvortsov T, de Leeuwe C, Quinn JP et al (2016) Metagenomic characterisation of the viral community of Lough Neagh, the largest freshwater lake in Ireland. PLoS One 11:e0150361. Scholar
  117. Sullivan MB (2015) Viromes, not gene markers, for studying double-stranded DNA virus communities. J Virol 89:2459–2461. Scholar
  118. Sullivan MB, Lindell D, Lee JA et al (2006) Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biol 4:e234. Scholar
  119. Sullivan MB, Krastins B, Hughes JL et al (2009) The genome and structural proteome of an ocean siphovirus: a new window into the cyanobacterial “mobilome”. Environ Microbiol 11:2935–2951. Scholar
  120. Sullivan MB, Weitz JS, Wilhelm S (2017) Viral ecology comes of age. Environ Microbiol Rep 9:33–35. Scholar
  121. Székely AJ, Breitbart M (2016) Single-stranded DNA phages: from early molecular biology tools to recent revolutions in environmental microbiology. FEMS Microbiol Lett 363:1–9. Scholar
  122. Thompson LR, Zeng Q, Kelly L et al (2011) Phage auxiliary metabolic genes and the redirection of cyanobacterial host carbon metabolism. Proc Natl Acad Sci 108:E757–E764. Scholar
  123. Tseng E, Underwood JG (2013) Full length cDNA sequencing on the PacBio® RS. J Biomol Tech 24:S45PubMedCentralGoogle Scholar
  124. Utter B, Deutsch DR, Schuch R et al (2014) Beyond the chromosome: the prevalence of unique extra-chromosomal bacteriophages with integrated virulence genes in pathogenic Staphylococcus aureus. PLoS One 9:e100502. Scholar
  125. Vega Thurber R, Haynes M, Breitbart M et al (2009) Laboratory procedures to generate viral metagenomes. Nat Protoc 4:470–483. Scholar
  126. Wang K, Chen F (2008) Prevalence of highly host-specific cyanophages in the estuarine environment. Environ Microbiol 10:300–312. Scholar
  127. Wang G, Hayashi M, Saito M et al (2009a) Survey of major capsid genes (g23) of T4-type bacteriophages in Japanese paddy field soils. Soil Biol Biochem 41:13–20. Scholar
  128. Wang G, Murase J, Taki K et al (2009b) Changes in major capsid genes (g23) of T4-type bacteriophages with soil depth in two Japanese rice fields. Biol Fertil Soils 45:521–529. Scholar
  129. Zablocki O, Van Zyl L, Adriaenssens EM et al (2014) High-level diversity of tailed phages, eukaryote-associated viruses and virophage-like elements in the metaviromes of Antarctic soils. Appl Environ Microbiol 80:6888–6897. Scholar
  130. Zablocki O, Adriaenssens EM, Frossard A et al (2017) Metaviromes of extracellular soil viruses along a Namib Desert aridity gradient. Genome Announc 5:e01470–e01416. Scholar
  131. Zeidner G, Preston CM, Delong EF et al (2003) Molecular diversity among marine picophytoplankton as revealed by psbA analyses. Environ Microbiol 5:212–216. Scholar
  132. Zhong Y, Chen F, Wilhelm SW et al (2002) Phylogenetic diversity of marine cyanophage isolates and natural virus communities as revealed by sequences of viral capsid assembly protein gene g20. Appl Environ Microbiol 68:1576–1584. Scholar
  133. Zhong X, Guidoni B, Jacas L, Jacquet S (2015) Structure and diversity of ssDNA Microviridae viruses in two peri-alpine lakes (Annecy and Bourget, France). Res Microbiol 166:644–654. Scholar
  134. Zhou Y, Liang Y, Lynch KH et al (2011) PHAST: a fast phage search tool. Nucleic Acids Res 39:W347–W352. Scholar
  135. Zschach H, Joensen KG, Lindhard B et al (2015) What can we learn from a metagenomic analysis of a Georgian bacteriophage cocktail? Viruses 7:6570–6589. Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Siân V. Owen
    • 1
    • 2
  • Blanca M. Perez-Sepulveda
    • 1
  • Evelien M. Adriaenssens
    • 1
    Email author
  1. 1.Microbiology Research Group, Institute of Integrative BiologyUniversity of LiverpoolLiverpoolUK
  2. 2.Department of Biomedical Informatics and Laboratory of Systems PharmacologyHarvard Medical SchoolBostonUSA

Personalised recommendations