Detection of Bacteriophages: Sequence-Based Systems
- 561 Downloads
The invention of sequencing technologies has fundamentally changed molecular biology, including the way we look at bacteriophages. In addition to investigating bacteriophage plaques, electron micrographs, or the phenotypes of different mutants, we are now able to explore the entire genetic potential encoded in their genomes. As of July 2018, over 6,000 complete phage genome sequences were published in public databases, with over 28,000 partial sequences available.
In this chapter, we give an overview of the latest technologies that can be used to determine phage genome sequences, ranging from short-read platforms which generally give multiple gigabases of sequence data to long-read technologies which have the potential to sequence a bacteriophage genome in one single read. We then look at applications of sequencing technologies in detecting bacteriophages, from a single gene, over entire genomes, to the community level.
The recent advances in sequencing technologies have profoundly changed our understanding of bacteriophage biology and genomics. Phage genomics, the study of the nucleic acid content or full gene complement of a phage, has demonstrated the remarkable genetic diversity of the phage world. This diversity is so high that phages with similar morphologies can have no nucleotide sequence similarity and very limited amino acid similarities of orthologous proteins (Krupovic et al. 2011). As a result, sequencing of specific phage marker genes or the entire genome is more accurate in detecting the exact variant or isolate of a phage than electron microscopic- or host range-based methods.
Sequencing is becoming increasingly important in phage research. For example, phage taxonomy has started to move away from morphology-based classification based on the presence/absence, type, and length of the tail toward a genome-based strategy more accurately reflecting the full diversity present in the virosphere (Ackermann 2011; Adriaenssens and Brister 2017; Adriaenssens et al. 2018). In phage therapy – the use of phages as therapeutic agents to control bacterial infections – sequencing of the entire genome is encouraged to ensure that the phages in question do not encode potentially damaging genes (e.g., toxins, lysogeny-related genes, antibiotic resistance genes) (Alavidze et al. 2016; Abedon 2017). Sequencing has also provided an extra dimension to viral ecology, offering a greater resolution of their diversity and unearthing viral dark matter, i.e., viral/phage sequence fragments and genes without any database homologs (Filée et al. 2005; Hatfull 2015).
Overview of technologies used in phage sequencing
Read length (bp)
800–1,000 single read
Capillary electrophoresis of labeled chain-terminating dideoxynucleotides
High accuracy and resolution at the nucleotide level
High sequencing cost per base
150–350 paired-end reads
Sequencing by synthesis, with detection of fluorescently labeled nucleotides
Low amount of starting material (ng) allows the sequencing of bacteriophages from single plaques
PCR-free library preparation reduces the introduction of sequencing errors
Widely used, allowing access in most laboratories at a low cost per read
Nextera-based library preparation will not be able to sequence phage genome ends
Ion Torrent PGM
300–400 paired reads
Detection of protons released during DNA synthesis
Ability to detect more variants compared to Illumina MiSeq, but with more false positives
Short sequencing run time
Up to 60,000 single reads
Real-time detection of labeled nucleotide incorporation by a tethered DNA polymerase
Long-read lengths allow repeat region resolution
Additional data on DNA-chemical modifications is generated
Poor accuracy at single read level
High amount of starting material required (μg)
Large hardware cost, large machine
Up to 1,000,000 single reads
Real-time detection of nucleotide-mediated current changes as nucleic acid molecule passes through a tethered nanopore protein
Long-read lengths allow repeat region resolution
Additional data on DNA-chemical modifications can be generated
Base calling in real time permits instant sequence interrogation (i.e., results are available as they are generated, not only once the machine has finished)
Low hardware cost, portable machine allows for anytime-anywhere sequencing
Only technology that allows direct sequencing of RNA molecules
Poor accuracy at single read level
High amount of starting material required (ug)
Overview of Sequencing Methods
Historically, bacteriophage sequencing has been performed using Sanger technology. This sequencing technology uses capillary electrophoresis to detect precisely the selective incorporation of labeled chain-terminating dideoxynucleotides during in vitro replication of DNA. However, this approach can be expensive, depends in many cases on cloning the phage DNA, and does not allow separation of contaminating host DNA sequence (Klumpp et al. 2012). High-throughput (or “next-generation”) short-read sequencing has become the preferred method for sequencing phages, mostly due to the low cost per base and high output compared to other platforms (Rihtman et al. 2016). Currently, popular platforms for sequencing bacteriophages include Illumina® MiSeq and Ion Torrent Personal Genome Machine (Ion PGMTM). These sequencing platforms differ mostly in their chemistry but are generally based on the same three steps: library preparation, amplification, and sequencing of dsDNA. For RNA phages, cDNA needs to be generated by reverse transcription, either before the library preparation step or by using library preparation kits that are optimized for RNA sequencing and can provide information on whether the genome is made up of dsRNA or ssRNA and the sense of the single-stranded RNA.
During library preparation, the extracted phage DNA is quality checked before fragmentation into random overlapping parts, which can be done by either mechanical (e.g., sonication) or chemical methods (e.g., the nuclease fragmentase (New England Biolabs)). The amount of input DNA for library preparation is a crucial step, especially in the study of phages, where limiting factors are either phage propagation to large numbers of particles or varying genome sizes. Several kits (e.g., Nextera XT (Illumina)) have been designed to significantly decrease the required amount of input DNA (to 1 ng) and reduce hands-on time by using “tagmentation,” a process that fragments, size-selects, and tags the input DNA. These new kits increase sequencing efficiency as they produce greater output data in a shorter overall time (Marine et al. 2011). However, the use of Nextera kits comes with the trade-off that the defined ends of linear phage genomes will be missed and is, therefore, not recommended when the aim of sequencing is to generate a high-quality reference genome (Kot et al. 2014).
After fragmentation, the pieces of DNA are size-selected and ligated with adaptors specific for each platform. During the following amplification step, the DNA is copied several times in its specific position, creating reaction centers or clusters that allow the sequencer to distinguish the input DNA from background noise. The amplification step is carried out by annealing to complementary adapters attached either to beads in micelle droplets (emulsion PCR; e.g., 454 pyrosequencing and Ion Torrent), to solid plates (e.g., Illumina sequencing), or by creating nanoballs that are then placed in a flow cell (Complete Genomics (BGI)). The most notable differences between platforms can be observed during the sequencing step, which can be by synthesis or instead by ligation. Platforms based on sequencing by synthesis include Illumina (formerly Solexa) which detects fluorescently labeled nucleotides; Ion Torrent, which detects a change in pH; and 454 pyrosequencing, which senses the amount of light generated due to pyrophosphate release during nucleotide incorporation. Alternatively, SOLiD is a platform based on sequencing by ligation of a labeled probe to the target DNA (Goodwin et al. 2016).
The development of high-throughput sequencing technologies decreased sequencing costs and time compared to construction of clone libraries (Loman et al. 2012). The 454 pyrosequencing platform, the first platform to be optimized for high-throughput phage sequencing, has been recently discontinued (Henn et al. 2010; Marine et al. 2011). Currently, the most widely used platform for phage sequencing is Illumina technology, with its benchtop sequencer, MiSeq, particularly popular for phages, and the high-throughput machine HiSeq used for bacterial, eukaryotic, and metagenomes. These machines can provide large amounts of high-quality sequencing data in a reduced time when compared to other technologies. Due to the small size of phage genomes relative to other organisms, and the large amount of data generated, short-read platforms, especially Illumina, thus far are the preferred method for bacteriophage metagenomics, genome sequencing, and re-sequencing or, most recently, to detect termini and packaging mechanisms (Garneau et al. 2017).
Long-read sequencing represents the most recent and transformative development in sequencing technology, and these technologies are collectively referred to as third-generation sequencing (TGS). Whereas short-read sequencing platforms may generate sequence reads of up to 1 kb, the capability of long-read sequencing platforms is now approaching read lengths of 1 Mb (which is approximately three times longer than the longest known phage genome). Unlike short-read sequencing, which has become dominated by a single platform, two platforms utilizing very different technologies are used for long-read sequencing.
Pacific Biosciences (PacBio) was the first company to make long-read sequencing technology available to the mass market, with the release of its RS platform in 2011, shortly followed by the widely adopted RS-II platform in 2013. PacBio sequencing technology is comparable to Illumina short-read technology in that it is polymerase-dependent; sequence data result from the detection of base incorporation by a DNA polymerase enzyme onto a growing nucleotide chain. However, whereas Illumina sequencing relies on detection of base incorporation within a clonal population of concomitantly amplified DNA fragments, PacBio sequencers capture incorporation signals from single DNA molecules. This feat is made possible by physical anchoring of polymerase enzymes within narrow wells, which allows video recording of laser excitation of each fluorescently labeled nucleotide in direct contact with the anchored polymerase during DNA synthesis. Although the technology is intrinsically error-prone due to dependence on a polymerase enzyme and signal noise resulting from unincorporated nucleotides, a high degree of accuracy is achieved by using hairpin DNA adapters to create circular templates, which are sequenced continuously until polymerase function declines. Repetitive sequencing of the same DNA fragment allows random errors to be detected downstream and results in the output of high-quality sequence data. The maximum read length of a PacBio sequencer is dependent on the life of individual polymerase enzymes, thought to be between 10 and 60 kb (Rhoads and Au 2015). Sequencing takes place on SMRT Cells, which are chips containing 150,000 anchored polymerase wells. The second PacBio sequencing platform, Sequel, was released in 2015 and increased the capacity of each sequencing run nearly seven times (Sequel SMRT Cells contain one million polymerase wells), generating 5–10 gb of data per run. PacBio sequence library preparation protocols require double-stranded DNA, and therefore this technology is only directly applicable to dsDNA phages. There are currently no reports of PacBio technology to sequence ssDNA or RNA phage genomes. However, RNA may be reverse-transcribed into cDNA for sequencing on a PacBio instrument (Tseng and Underwood 2013), and a second strand DNA synthesis step may facilitate the sequencing of ssDNA phage genomes.
A frequently cited advantage of PacBio technology is the additional generation of chemical modification data along with sequence data (Rhoads and Au 2015). DNA-chemical modifications, such as the addition of methyl groups to cytosine residues (DNA methylation), cause characteristic kinetic changes in nucleotide incorporation rates by the polymerase and can be detected by automated analysis of the kinetic pattern of the DNA synthesis reaction. Identification of DNA modifications can be of particular interest for bacteriophages, as certain phage groups are known to incorporate modified bases into their genomes, potentially involved in escaping host restriction modification systems (Klumpp et al. 2010; Adriaenssens et al. 2012; Lee et al. 2018).
The second long-read sequencing technology in widespread use is being developed by Oxford Nanopore Technologies (ONT), and their prototype platform, MinION, was released in 2014 (Ip et al. 2015). Unlike the majority of DNA sequencing technologies, ONT does not detect nucleotide addition during DNA synthesis but instead directly detects the nucleotide composition of a single-stranded DNA or RNA molecule. The technology employs anchored pore proteins (nanopores), each under an electric current. Nucleic acid molecules are threaded and natively transported through the pore protein through the action of a coupled motor protein, and each nucleotide on the molecule causes characteristic and detectable current disruptions which are translated into sequence. ONT uses nucleic acid adaptors and tethers to facilitate the threading of single-stranded molecules into individual nanopore proteins. Individual reads of ~200,000 bp are reported (Ip et al. 2015), but reads of >1,000,000 bp have been reported anecdotally. On the MinION platform, sequencing takes place on flow cells harboring 512 nanopore channels and capable of generating 10–20 gb of sequence data. ONT has recently released two high-throughput sequencing platforms based on the same technology as the MinION, the GridION and the PromethION, which run multiple flow cells or use flow cells with an increased number of nanopore channels. Though commercial ONT sequencing is one of the purported applications of the larger GridION and PromethION platforms, at the time of writing, ONT is not commercially available at the majority of sequencing facilities, and currently the primary users of ONT are research labs in possession of MinION devices.
The primary advantage of ONT is the portability and usability of the platforms. All platforms can be run on benchtops, using very simple DNA library preparation protocols, and sequence data can be interpreted in real time. Indeed, the portability of ONT sequencing technology was recently demonstrated by the sequencing of the phage lambda genome using a MinION device onboard the International Space Station (Castro-Wallace et al. 2017). Consequently, the obvious application of ONT is circumstances where rapid outputs are required such as in field or diagnostic settings. A major limitation of ONT sequencing is that it requires a high amount of input DNA, which may limit its use in low-biomass environments such as the ocean virome or for phages which are difficult to amplify. On the other hand, there may be an application for ONT technologies in high-biomass environments such as fecal samples. Perhaps ONT sequencing could be used to monitor the real-time abundance and sequence variation of phages and bacteria during phage therapy trials of gastrointestinal pathogens such as Clostridium difficile.
Furthermore, ONT is the only sequencing platform that permits direct sequencing of RNA molecules (as opposed to reverse transcription of RNA into cDNA for sequencing). Like PacBio technology, detection of chemically modified nucleotides is made possible by analysis of signature current changes (Rand et al. 2017). ONT may therefore be highly applicable to the sequencing of RNA phage genomes, and the technology has recently been used to sequence the native RNA genome of an influenza A virus for the first time, identifying chemical modifications that were undetected using cDNA-sequencing strategies (Keller et al. 2018).
A disadvantage of both long-read sequencing technologies is the poor accuracy of single reads. As single reads from both technologies represent the sequencing of single molecules, and therefore single-base calls, random error is frequent. Reads from the ONT MinION platform have been reported to have an error rate of 38.2% (Laver et al. 2015), and the error rate for single PacBio reads has been reported to be 11–15% (Rhoads and Au 2015). Therefore, some caution must be taken when utilizing these technologies for applications where accuracy is crucial, such as de novo genome assembly. Provided the input DNA is sequenced in enough depth, i.e., each nucleotide in the sequence is represented on multiple independent reads, erroneous base calls can be eliminated by consensus calling. This process is termed “read polishing,” and increased accuracy can also be achieved by combining the sequencing outputs from long- and short-read sequencing technologies to yield “hybrid” assemblies (Phillippy 2017).
The primary advantage of long-read sequencing technologies such as PacBio and ONT over short-read technology such as Illumina is the ability to generate reads spanning repetitive sequence regions, thereby greatly improving DNA sequence assembly. Though large repetitive regions, such as sequence duplications and transposable elements, are not common features of phage genomes, there are instances where short-read technology alone is insufficient to complete phage genomes. For example, myoviruses infecting bacteria of the genus Bacillus can exhibit heavily chemically modified DNA which impedes routine short-read sequencing strategies (Klumpp et al. 2010). PacBio sequencing has been used to complete several of these myovirus genomes, showing that it can be a useful alternative for difficult-to-sequence phages (Klumpp et al. 2014). Restriction modification systems are known to be important in the interaction between phages and their bacterial hosts (Adams and Burdon 1985), and therefore the secondary use of long-read sequencing technologies to map chemical genome modifications may be an important application of long-read sequencing technology to phage biology in the future.
A Note on Sequence Data and Sharing
The platforms described above generate multiple gigabytes of data that need to be processed for quality, assembled and analyzed, but where possible should also be shared to improve reproducibility and promote open science.
The most appropriate way for data sharing is through the International Nucleotide Sequence Database Collaboration (INSDC) which links the three main international sequence data organizations, DDBJ (DNA Data Bank of Japan), EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute), and NCBI (National Center for Biotechnology Information) (Karsch-Mizrachi et al. 2012). Submissions only need to be made to a database of one of these organizations to be shared across all. For the DDBJ and NCBI, unassembled sequencing reads should be deposited in the Sequence Read Archive, shortened to DRA (DDBJ) or SRA (NCBI) (Leinonen et al. 2011b). The associated metadata for studies and samples are collected as BioProjects and BioSamples, respectively (Barrett et al. 2012). The assembled and annotated complete phage genomes can be deposited in GenBank (Benson et al. 2013) or DDBJ annotated sequence submissions. For EMBL-EBI, the European Nucleotide Archive (ENA) accepts all types of sequences and associated metadata through the same submission portal (Leinonen et al. 2011a). Any genomes deposited and released through the above-described resources will become a part of publicly available databases that can be searched through the Basic Local Alignment Search Tool (BLAST) (Johnson et al. 2008).
NCBI have in recent years developed virus-specific tools that can be used for phage sequence analysis. The NCBI Viral Genomes Resource groups offer specialized resources for analysis of phage genomes, such as curated reference genome databases, sequence comparison tools, protein clusters, and custom downloads (Brister et al. 2015).
Applications of Sequencing-Based Detection Methods
Gene-Based Detection of Bacteriophages
Perhaps the simplest way of detecting bacteriophages is to amplify a gene or gene fragment of the bacteriophage in question and then determine its sequence and taxonomic affiliation, a method called amplicon sequencing. This can be straightforward if the target bacteriophage has a published genome in an INSDC database. Then it is simply a matter of designing primers in a unique location of the genome, performing a PCR amplification and sequencing the PCR fragment. But it is also possible to detect “unknown” phages in a sample by amplifying a signature or marker gene. Unfortunately, there is no such thing as a universal viral/phage marker gene, comparable to the 16S rRNA gene in bacteria or the 18S rRNA gene in eukaryotes (Rohwer and Edwards 2002), which can be used to screen for any bacteriophage. There have been, however, primers developed that amplify signature genes targeting specific phage groups (Adriaenssens and Cowan 2014). In many cases, these are specific for a small group of viruses within a family of tailed phages.
Before the rise of short-read sequencing technology, PCR fragments used to be cloned into plasmid vectors, and the single inserts were sequenced with plasmid primers using Sanger sequencing. Currently, sequencing of amplicons representing diverse communities can be done with any of the sequencing platforms described above, but longer reads will lead to better resolution of viral diversity. Processing of the sequencing results is generally performed by clustering sequencing reads into operational taxonomic units (OTUs) based on a threshold similarity score (90–99% identity) or by direct sequence variant comparison. Several pipelines are available for the analysis of amplicon data dealing with quality control, removal of chimeric reads, and OTU clustering, such as MOTHUR (Schloss et al. 2009) and QIIME (Caporaso et al. 2010). The resulting OTUs can then be assigned to a taxonomic group and further investigated using phylogenetics.
In the following section, we will give an overview of the phage groups which have been detected in previous studies using PCR amplification and sequencing.
Most phage amplicon studies have been targeted toward discovering the diversity of cyanophages in the environment, i.e., phages infecting cyanobacteria. The signature genes which have been used successfully in the past include structural genes, such as the portal protein (Fuller et al. 1998; Zhong et al. 2002; Short and Suttle 2005) or major capsid protein of myoviruses (Baker et al. 2006), or metabolism-related genes such as the ones coding for photosystem II proteins psbA and psbD (Zeidner et al. 2003; Millard et al. 2004; Clokie et al. 2006; Sullivan et al. 2006; Wang and Chen 2008) or the ATPase phoH (Marston and Amrich 2009; Goldsmith et al. 2011). Some of these marker genes target more specific or less diverse groups than others, for example, the structural protein markers target only a subgroup of phages belonging to the Myoviridae, whereas phoH is present in 40% of cultured marine phages, in certain eukaryotic viruses and in some phages infecting enteric bacteria.
The type isolate for this group of phages with contractile tails, Escherichia phage T4, is an iconic phage with a long history in molecular biology. Its major capsid protein (gp23) has been the basis for most of the primer sets (Filée et al. 2005; Comeau and Krisch 2008; Marston and Amrich 2009; Chow and Fuhrman 2012). Originally used in the marine environment, these gp23 amplicons have been found around the globe in, for example, rice paddy soil (Fujii et al. 2008; Wang et al. 2009a, b), freshwater lakes in Russia (Butina et al. 2010), and even in Antarctica (López-Bueno et al. 2009) and an Arctic glacier (Bellas and Anesio 2013). The detection of this very diverse group using amplicons overlaps with cyanophage detection as many cyanophages with myovirus morphology are related to T4.
Detection of T7-like phages belonging to the family Podoviridae can be done by targeting the DNA polymerase (polA) gene, with at least nine primer sets published so far (Breitbart et al. 2004; Labonté et al. 2009; Chen et al. 2009; Dekel-Bird et al. 2013). Partial sequences have been detected in a range of habitats around the world, including marine, freshwater, and terrestrial (Breitbart et al. 2004; Chen et al. 2009; Huang et al. 2010).
ssDNA phages belonging to the subfamily Gokushovirinae, family Microviridae, have been recently shown to be ubiquitous across habitats and geographic regions, based on amplification of the major capsid protein gene (Hopkins et al. 2014; Székely and Breitbart 2016). Using this gene, bloom-bust patterns and fluctuations in microvirus abundance were discovered in two freshwater lakes in France (Zhong et al. 2015).
Other Potential Signature Genes
There are other genes which are conserved among groups of phages, but which have not yet been investigated using gene-based sequencing approaches. A great resource to find a signature gene for a phage group of interest is the prokaryotic virus orthologous groups (pVOGs) database, formerly known as phage orthologous groups (POGs) (Kristensen et al. 2013; Grazziotin et al. 2017). This database comprises protein clusters (pVOGs) for all sequenced bacterial and archaeal viruses and can therefore be used to identify shared genes between any taxon of interest. For example, the second largest pVOG, VOG4544 terminase large subunit, is found in 83% of the tailed phages (order Caudovirales) (Grazziotin et al. 2017) and has been used in many phylogenetic analyses from isolates to virome studies (Sullivan et al. 2009; Roux et al. 2014). RNA phages all encode an RNA-dependent RNA polymerase, which can act as a signature gene, but only eukaryotic-infecting RNA viruses have been found with published primer sets (Culley et al. 2003; Culley and Steward 2007). A helpful tool for the choice of signature genes and primer design of a phage group of interest is PhiSigns, which is both available as stand-alone tool and web-based application (Dwivedi et al. 2012).
The year 2017 marked 40 years since the first bacteriophage, the ssDNA phage ɸX174, was sequenced (Sanger et al. 1977), followed shortly after by the genomes of the reference phages lambda and T7 (Sanger et al. 1982; Dunn et al. 1983). Due to their small sizes, ranging from approximately 3.5 kb (Friedman et al. 2009) to nearly 500 kbp (Hatfull and Hendrix 2011), phage genomes were sequenced well in advance of the first bacterial genome, in 1995 (Fleischmann et al. 1995). However, the number of complete, i.e., finished or entirely sequenced genomes, and “whole genome shotgun” sequences available in public databases is higher for bacteria than for phages, despite the genome size differences that make the latter easier to sequence and assemble. Still, sequencing of novel phage genomes has increased greatly during the last two decades (Adriaenssens and Brister 2017).
The detection of bacteriophages based on their genome sequences could only be achieved after having a repertoire of different completed or close to completion genomes. Likewise, sequencing bacteriophage genomes is crucial to broadening knowledge of their biology, such as metabolic processes and interactions with their host and environment. This is one of the main drawbacks of the currently small number of phage genomes available, which is only a very small fraction compared to the predicted diversity of bacteriophages (Perez Sepulveda et al. 2016). Despite the relatively small number of sequenced phage genomes, their sequences have contributed significantly to the discovery of several phage genes directly involved in the host metabolism (Zeidner et al. 2003; Millard et al. 2004, 2009, 2010; Sullivan et al. 2006; Wang and Chen 2008; Sabehi et al. 2012; Chan et al. 2015). These auxiliary metabolic genes (AMGs) can play important roles in redirecting host metabolism by, for example, guiding carbon flux to the biosynthesis of deoxynucleotides through the pentose phosphate pathway and hence favoring bacteriophage replication (Thompson et al. 2011). As mentioned before, detection of bacteriophages using marker genes can only be used for a reduced number of bacteriophages, but by having a fuller understanding of their genomes, either the range of markers can be extended or potentially the relatively small genomes can be used as markers, allowing detection of supposed “unculturable” bacteriophages.
Obtaining bacteriophages for sequencing normally involves the infection of suitable hosts by “plaque assay” in which individual plaques (i.e., visible clearing of bacterial culture that represents the lysis of the host by replication of bacteriophages) (see “Detection of Bacteriophage: Plaques and Plaque Assay”) are picked (i.e., selected) and propagated further to increase phage biomass. Bacteriophages may be further concentrated by methods such as polyethylene glycol precipitation and/or cesium chloride density gradient centrifugation. Nucleic acids are then extracted and purified for sequencing using standard methods, such as phenol-chloroform phase separation or the use of commercial kits. The dependence on the use of a sensitive host bacterium has limited the approach to sequencing only those bacteriophages capable of infecting the small proportion of bacteria that can be grown under laboratory conditions, even if those bacteria do not necessarily represent the “primary” host. This has become a problem for environments where not many hosts can be cultured. An example of this phenomenon is presented in a study by Brum and colleagues who found only 39 genomes that could be associated to “cultured” bacteriophages compared to an estimated total of 5,476 different dominant bacteriophage populations in the upper ocean (Brum et al. 2015).
Another issue related to this method of culturing phages is that these plaques can contain more than one phage, which can be due to spontaneous induction of any host prophages (Henn et al. 2010; Cowley et al. 2015). Having multiples phages in a sample can cause incorrect assemblies and misinterpretation of genomes. During recent years, there have been developments and method optimizations to sequence phage genomes extracted from single plaques, reducing the input material and costs associated with library preparation (Kot et al. 2014; Baym et al. 2015). In line with these advances, Rihtman and colleagues demonstrated successful Illumina high-throughput sequencing and assembly of samples containing multiple bacteriophages; the combination of multiple genomes per library preparation allows for cost-effective nucleic acid extraction and sequencing (Rihtman et al. 2016). As a general rule, a read coverage (i.e., the number of sequencing reads generated for each base of the genome) of 30X is adequate for successful assembly of a genome. Due to variation in the properties of different bacteriophages, in the case of multiplexing genomes, a coverage of 100X has been suggested for obtaining a reliable assembly. It is worth noting, however, that the efficiency described when sequencing multiple bacteriophages per library does not necessarily apply to closely related, i.e., similar, bacteriophages. Thus, in order to avoid mistakes caused by possible undetectable mis-assemblies, it is recommended to only multiplex genomes obtained from different hosts.
The study of bacteriophage genomes has increased the knowledge of their structure and composition, which consequently allows the development and design of novel methods for detection, including new bioinformatic tools. Phage genomes are normally free from complex sequences than can massively affect genome assembly, such as transposable elements and repetitive sequences (i.e., gene duplications or variable-number tandem repeats). However, re-sequencing of cultured bacteriophages using different sequencing technologies can help to address the potential issues derived from sequence complexity, allowing increased accuracy and better discrimination of correct assemblies by including more data. Additionally, re-sequencing of bacteriophages decreases the needed coverage for proper assembly, thereby providing significant information regarding the evolution of bacteriophage genomes (Puxty et al. 2015).
Sequence-Based Identification of Prophages
A prophage represents a stage in the life cycle of a temperate phage wherein the phage genetic material is transmitted vertically with that of the bacterial host, either integrated into the chromosome or existing as a low-copy number plasmid (see “Lysogeny”). Prophages have been shown to be important elements of horizontal gene transfer which can contribute significantly to bacterial niche adaptation and population dynamics (Bossi et al. 2003; Fortier and Sekulovic 2013) (see “Transduction”). In particular, prophages have played a crucial role in the evolution of some notable bacterial pathogens, by encoding toxins and virulence factors, such as the Stx toxin of Shiga-toxigenic Escherichia coli, the cholera toxin of Vibrio cholerae, and the C1 neurotoxin of Clostridium botulinum (Brüssow et al. 2004). Consequently, the identification of prophages is an important application of sequencing technologies.
As part of the realm of the temperate phage is within the genome of bacteria, temperate phage genome sequencing is a natural by-product of the sequencing of bacterial genomes, and no specific methodological considerations are necessary to sequence bacterial genomes containing prophages. It is likely, therefore, that temperate phages are the most deeply sequenced of all phages. In fact, given that bacteria typically have multiple prophage sequences incorporated into their genomes (Casjens 2003), it could be argued that more genome sequences exist for temperate phages than for their bacterial hosts and therefore any other group of organisms on the planet. However, the challenge of accessing this wealth of temperate phage genome sequence data is in identifying prophages in bacterial genome sequences.
Analogous to identifying bacteriophages from the environment using gene markers, temperate phages can be identified from within bacterial sequence space using sequence markers. A number of computational tools have been developed to mine bacterial genome datasets for prophage sequences including Phage_Finder (Fouts 2006), Prophage Finder (Bose and Barber 2006), Prophinder (Lima-Mendez et al. 2008), PHAST (Zhou et al. 2011), PhiSpy (Akhter et al. 2012), VirSorter (Roux et al. 2015), and PHASTER (Arndt et al. 2016). All these tools rely on characteristics of prophage genome sequences, such as identification of sequences with homology to characterized phage genes and identification of direct repeats corresponding to phage attachment (att) sites or regions of DNA with differential GC skew, protein length, or transcription strand directionality.
Unfortunately, there are currently several limitations to these approaches. Firstly, most of the listed prophage identification tools are designed to be run on complete (single-contig) bacterial genome sequences. This is because the naturally modular nature of prophages makes it impossible to accurately predict which prophage-containing contigs belong together when multiple contigs are present. Prophage prediction tools rely heavily on the locality of prophage-signature sequences, but this locality is likely to be arbitrary in unfinished, multi-contig assemblies. Though one of the recent prophage prediction tools, PHASTER, is able to handle multi-contig files, a single functional prophage that assembles as two contigs would likely be assigned as two “incomplete” prophages. As the vast majority of bacterial genome sequence data exists as unfinished, contiguous genomes, accessing the prophage content of these genomes remains challenging.
A second limitation lies in inferring functionality of the prophage. Prophage sequences within bacterial genomes can exist in a variety of complex functional states, from fully functional (capable of induction and replication), to defective but capable of resuscitation, to extremely degraded representing a host-domesticated prophage remnant. Distinguishing between functional, nonfunctional, and incomplete prophage remnants in silico is extremely difficult. This task is even further complicated by the issue of multi-contig genome assemblies discussed above. How can a remnant prophage island be computationally distinguished from a functional prophage disrupted across separate contigs in an unfinished bacterial genome assembly? Even in complete, single-contig assemblies, mutations that inactivate prophage function can be as subtle as singular noncoding SNPs (Owen et al. 2017), making the accurate computational prediction of prophage function virtually impossible.
As well as the sequencing of prophages together with their bacterial hosts, other methods to sequence temperate phages exist. Functional temperate phages may be easily sequenced by amplification on a sensitive host, virion concentration, and purification, as has been described for phage genome sequencing above. Even when putative temperate prophages cannot be replicated, for example, if a sensitive host strain is not available, chemically induced prophage induction (i.e., using SOS response-inducing agents such as mitomycin C, norfloxacin, nalidixic acid, or UV-light exposure) may be used to induce the formation of phage particles (Raya and Hébert 2009) (see “Lysogeny”). Phage particles can then be purified from the culture supernatant for sequencing. From this it can be determined which, if multiple prophage-like sequences are present in the genome, can be induced to form phage particles.
An important consideration when sequencing induced phages from culture supernatant is the removal of contaminating bacterial chromosomal DNA. Thorough DNase treatment procedures should therefore be undertaken, such that non-prophage chromosomal DNA can no longer be detected in the sample by PCR, to ensure that only virion-encapsulated phage DNA is sequenced. The advantage of this technique is that it may allow temperate phages to be identified in previously uncharacterized bacteria, in which little functional information can be obtained based on homology to known genes. Sequencing of protein-encapsulated DNA would provide convincing evidence for the existence of novel prophages, even if the sequence had no homology to known phages.
Finally, methods have been developed to selectively enrich DNA samples for phages that exist as extrachromosomal elements, for example, as circular or linear plasmids. Certain bacterial genera such Chlamydia and Borrelia which are associated with small genome sizes have been found to harbor extrachromosomal prophage plasmids rather than integrated prophages (Casjens 2003). A method to selectively enrich genomic DNA samples for extrachromosomal prophage elements in Staphylococcus by the use of plasmid purification kits has been reported (Utter et al. 2014). However, the increasing adoption of long-read sequencing technology may represent a more effective way to investigate extrachromosomal prophage elements.
Metagenomics-Based Detection of Bacteriophages
It is possible to detect phages in any environment without any prior knowledge about their genome using shotgun metagenomics or more specifically metaviromics. In its most basic form, metagenomics is the sequencing of all the nucleic acid extracted from the environment (or any sample of interest). For phages, and viruses in general, these metagenomics protocols get amended to account for the smaller genome sizes of bacteriophages compared to bacteria, fungi, and protists. In many environments with low biomass, a concentration step is necessary to reduce the volume of aquatic input material before sample processing, with concentration generally performed using tangential flow filtration (Vega Thurber et al. 2009) or, for smaller volumes, by using spin filter columns in a benchtop centrifuge (Bolduc et al. 2012). The most commonly used step following concentration is the combination of viral enrichment by size exclusion of the cellular fraction using filters and nuclease treatment for the removal of free-floating DNA and RNA (Vega Thurber et al. 2009; Hall et al. 2014). When using size exclusion to remove bacteria, it is important to keep in mind the size of the phage target, with some of the jumbo myoviruses such as Pseudomonas phages phiKZ and EL and their relatives (head diameter of >120 nm and tail length of ~200 nm) potentially clogging a 0.22 μm filter pore (Hertveldt et al. 2005). Subsequently, the phage (viral) community nucleic acid can be extracted and sequenced using next-generation sequencing platforms. In most cases, researchers will be targeting dsDNA as this genome group represents most known phages.
The first use of metaviromics was to explore uncultured marine virus communities and found that a large fraction of these communities were made up of phages (Breitbart et al. 2002). Ever since this seminal paper, the techniques and approaches in metaviromics have been optimized and updated to investigate viral/phage ecology, in a field that has started to come of age (Sullivan 2015; Sullivan et al. 2017). Virtually every habitat sampled and analyzed with metaviromics has shown that phages make up a substantial fraction, if not the majority of identified sequences. This includes from pristine environments in the polar regions (López-Bueno et al. 2009; Zablocki et al. 2014; Aguirre de Cárcer et al. 2015; Adriaenssens et al. 2017), over the global oceans (Huang et al. 2010; Mizuno et al. 2013; Brum et al. 2015), freshwater lakes (Roux et al. 2012; Labonté and Suttle 2013; Skvortsov et al. 2016), and soils (Fierer et al. 2007; Zablocki et al. 2017). Currently, only marine epipelagic habitats have been sampled near saturation, allowing network-based clustering approaches to describe the full diversity, revealing that the most abundant viral clusters represent phages infecting members of the phyla Actinobacteria, Proteobacteria, Bacteroidetes, Cyanobacteria, and Deferribacteres (Roux et al. 2016).
Metaviromics has become the go-to method to identify human-associated phage communities. Early studies of the human gut revealed highly diverse and unknown phage communities, with the majority of the known phage signatures belonging to tailed phages of the order Caudovirales (Breitbart et al. 2003). Comparative analyses revealed high interpersonal differences in gut phage communities, but low levels of change over time within the same individual, and a predominance of temperate phages (Reyes et al. 2010, 2012, 2015; Minot et al. 2011; Manrique et al. 2016). Metaviromic sequencing of ultrasmall amounts of DNA also showed unique phage communities on the human skin with differences according to topical sites and large intrapersonal differences (Hannigan et al. 2015). Other metavirome studies have found viral communities dominated by bacteriophages in bodily fluids, such as saliva and urine, and to a lesser extent in blood where phages might have originated from contamination of the sequencing procedure (Pride et al. 2012; Santiago-Rodriguez et al. 2015; Moustafa et al. 2017).
In an alternative approach, metaviromic sequencing analyses have been used to identify the phages present in phage cocktails used in phage therapy treatments and experiments. The first such cocktail analyzed was a Russian cocktail called ColiProteus (Microgen) used against E. coli and Proteus infections (McCallin et al. 2013). This study revealed 17 different phage groups present in the cocktail at different abundances, suggesting that some of the low abundance groups were by-products of phage cocktail production. The second study investigated the Intesti phage cocktail from the Eliava Institute in Georgia, active against a range of enteric bacteria (Zschach et al. 2015) (see “Current Updates in the Long-Standing Phage Research Centers in Georgia, Poland and Russia”). The metaviromic analysis showed 23 different sequence groups, called phage clusters by the researchers, falling within the families Myoviridae, Siphoviridae, and Podoviridae and an unassigned grouping. These two studies showed that these phage cocktails were more complex than initially assumed and might contain additional (unwanted) phage sequences at low abundances.
One of the most interesting discoveries to come from using a metaviromic sequencing approach is the discovery of a highly abundant phage in human gut viromes (Dutilh et al. 2014). The researchers in this study used a cross-assembly approach (assembling multiple datasets together in one contig set) in order to increase contig length and establish a co-occurrence profile of contigs over the different samples (Dutilh et al. 2012). They were able to reconstruct a circular contig of ~97 kb representing a phage genome they labeled crAssphage and verified its existence by long-range PCR and Sanger sequencing. In a comparison with all published metagenomes at the time, this phage was found to make up a significant portion of gut metagenomes [up to 90% of reads of the twin dataset (Reyes et al. 2010)] and represented 1.7% of all sequencing reads from human feces, making crAssphage one of the most abundant phages in publicly available datasets. With more phage genomes sequenced and metagenome data becoming available, it is possible that more of these abundant, unknown phages will be discovered.
The development of sequencing technology and the subsequent boom in next-generation sequencing platforms (both short- and long-read platforms) has been fundamental in advancing bacteriophage research. These methods have not only contributed to an explosion of genomes in public databases but have also provided an opportunity to exploit sequencing-based methods for bacteriophage detection. At the simplest level of complexity, phages can be detected by the sequencing of a single marker gene. Whole genome sequencing of isolated phages has populated reference databases, while bacterial genome sequencing led to the discovery of a plethora of previously undetected prophage genomes. At the community level, shotgun sequencing methods have made it possible for researchers to investigate all bacteriophages in a sample without previous knowledge of its content. In conclusion, sequencing and next-generation sequencing technology has added a new layer to bacteriophage research, opening new avenues of research, from exploitation of genes for biotechnological applications to population ecology.
- Ackermann H-W (2011) Bacteriophage taxonomy. Microbiol Aust 32:90–94Google Scholar
- Breitbart M, Miyake JH, Rohwer F (2004) Global distribution of nearly identical phage-encoded DNA sequences. FEMS Microbiol Lett 236:249–256. https://doi.org/10.1111/j.1574-6968.2004.tb09654.xCrossRefPubMedGoogle Scholar
- Hannigan GD, Meisel JS, Tyldsley AS et al (2015) The human skin double-stranded DNA virome: topographical and temporal diversity, genetic enrichment, and dynamic associations with the host microbiome. MBio 6:e01578–e01515. https://doi.org/10.1128/mBio.01578-15.EditorCrossRefPubMedPubMedCentralGoogle Scholar
- Keller MW, Rambo-Martin BL, Wilson MM, et al (2018) Direct RNA sequencing of the complete Influenza A virus genome. bioRxiv. https://doi.org/10.1101/300384
- Lee Y-J, Dai N, Walsh SE et al (2018) Identification and biosynthesis of thymidine hypermodifications in the genomic DNA of widespread bacterial viruses. Proc Natl Acad Sci. https://doi.org/10.1073/pnas.1714812115
- Millard AD, Zwirglmaier K, Downey MJ et al (2009) Comparative genomics of marine cyanomyoviruses reveals the widespread occurrence of Synechococcus host genes localized to a hyperplastic region: implications for mechanisms of cyanophage evolution. Environ Microbiol 11:2370–2387. https://doi.org/10.1111/j.1462-2920.2009.01966.xCrossRefPubMedGoogle Scholar
- Zablocki O, Van Zyl L, Adriaenssens EM et al (2014) High-level diversity of tailed phages, eukaryote-associated viruses and virophage-like elements in the metaviromes of Antarctic soils. Appl Environ Microbiol 80:6888–6897. https://doi.org/10.1128/AEM.01525-14CrossRefPubMedPubMedCentralGoogle Scholar
- Zhong Y, Chen F, Wilhelm SW et al (2002) Phylogenetic diversity of marine cyanophage isolates and natural virus communities as revealed by sequences of viral capsid assembly protein gene g20. Appl Environ Microbiol 68:1576–1584. https://doi.org/10.1128/AEM.68.4.1576-1584.2002CrossRefPubMedPubMedCentralGoogle Scholar