Background

The bulk of the human-associated virome resides in the distal gastrointestinal tract and is composed of tailed double-stranded (ds) DNA bacteriophages (dsDNA phages) [1,2,3] that, in the recent virus megataxonomy, are classified as the class Caudoviricetes under the phylum Uroviricota [4]. The ternary interactions between phages, bacteria, and their human hosts are being elucidated at an increasing pace through experiments on model systems and sequencing of the uncultured community of viruses (virome) [5,6,7,8,9]. Comparisons of the human gut virome within and between individuals unveil remarkable longitudinal stability and high diversity of resident phages [2, 10, 11]. Although the human gut offers a rich source of phage genomic diversity, the virome so far has been explored to a much lesser extent than the whole community (metagenome), composed of viruses, bacteria, and archaea. The rapid growth of the public whole-community metagenomic data offers the opportunity to identify numerous novel phage genomes lurking in metagenomes.

Tailed dsDNA phages encapsidate their genome as a linear molecule, but depending on the terminal genomic arrangement, many complete phage genomes assemble into a “circular” contig (i.e., a contig with direct terminal repeats) [12]. Thus, circularity can be used as one feature to identify putative complete phage genomes in viromes and metagenomes. However, the comparatively small size of dsDNA phage genomes (~50 kb, on average) [13] and the estimated low virus-to-microbe ratio in the gut (1:10) [1] jointly translate into a relatively small amount of phage DNA present in whole community metagenomic libraries [14]. Moreover, similar-sized plasmids also assemble into circular contigs [15]. A recently developed computational method aims to address this problem by focusing specifically on the assembly of circular phage genomes and their automatic discrimination from plasmids based on gene content [16]. The genetic repertoire shared between plasmids and phages, for example, the parABS partitioning system encoded by both Escherichia coli phage P1 and plasmids [17] can obfuscate their automatic annotation-based discrimination and necessitate manual curation. Despite these challenges, there is a pressing need to reduce the amount of viral “dark matter” in the human gut by identifying and classifying phages for reference-based analyses [18, 19].

The global organization of the virosphere was recently captured in a comprehensive, unified framework using protein domains encoded by viral hallmark genes to infer evolutionary connections between major groups of viruses [4] and subsequently approved by the International Committee on the Taxonomy of Viruses (ICTV) as the comprehensive, multi-rank taxonomy of viruses. In particular, dsDNA viruses possess either the HK97 fold or the double jelly-roll fold in their major capsid proteins, along with distinct ATPases involved in capsid maturation, and thus appear to have independent origins, justifying their separation into two realms (the highest virus taxon rank) [4]. Tailed dsDNA phages, with their HK97 major capsid proteins, comprise the order Caudovirales within the class Caudoviricetes, under the phylum Uroviricota (that also include the distantly related herpesviruses of animals) and are further classified into 9 families. With the now formally recognized ability to classify viruses from sequence data alone [20], phylogenomic analysis of uncultured phage genomes can delineate novel taxa.

Here, we describe 3738 completely assembled phage genomes discovered by analysis of 5742 whole-community human gut metagenomes. Using abundance, taxonomy, and genomic composition as criteria to select genomes for further scrutiny, three groups of phages, all infecting bacteria of the phylum Bacteroidetes comprising potential new families, were analyzed in detail. All these candidate families, named “Quimbyviridae,” “Flandersviridae,” and “Gratiaviridae” consist of phages infecting bacteria of the phylum Bacteroidetes, and the first two are widely distributed and abundant in human gut viromes. The phages in these families and others yet to be classified encode enzymes that are involved in the response of cells to oxidative stress, implicating phages in the tolerance of anaerobes to oxygen. Furthermore, comparative genomic analysis exposed genetic cassettes that are unique to some genera in each family and thus appear to be relatively recent acquisitions involved in phage-host interactions. Addition of all the phage genomes identified here to public databases will substantially expand the known phage diversity and augment taxonomic classification of the human gut virome.

Methods

Identification of phage genomes in human gut metagenomes

5742 whole-community metagenome assemblies generated from human fecal samples were downloaded from the NCBI Assembly database (accessed 8/2019). To limit the search space to likely complete genomes, 95,663 “circular” contigs (50–200 bp direct overlap at contig ends) were extracted from these assemblies. Next, 304 phage-specific protein alignments from the CDD database [21] and 117 custom alignments [22] were converted to Hidden Markov Models (HMMs) using hmmpress (v. 3.2.1). Proteins in the 95,663 contigs were predicted by Prodigal (v. 2.6.3) [23] in the metagenomic mode and searched against the set of 304 phage-specific HMMs using hmmsearch, with the relaxed e-value cutoff of < 0.05. Contigs with at least one hit (n = 4907) were selected for a second round of searches after correcting for re-assigned codons, as follows. All contigs were searched for the presence of tRNAs using tRNA-scan-SE (v. 2.0) [24]. In 212 contigs, an amber stop codon-suppressor tRNA was identified. ORFs were re-predicted for these contigs with the amber stop codon re-assigned to glutamine, given that this reassignment is most commonly observed in human gut phages [25, 26]. The re-translated contigs were added back to the database and all contigs were subjected to a second profile search with a stricter e-value cutoff (< 0.01). Contigs were classified as phagepresently organized into 9 families, but 3 of theses when exceeding 3 kbp in length and possessing at least one ORF that matched a capsid, portal, or large terminase subunit protein profile below the e value threshold. The phage classifications were cross-checked with Seeker [27] and ViralVerify [16]. In cases where both tools classified a contig as non-phage, the protein annotations were examined manually, revealing four contigs of ambiguous identity that were discarded.

Collection of phage genomes in GenBank

Taxonomic accession codes corresponding to all prokaryotic viruses were collected from the NCBI Taxonomy database and used to extract sequences longer than 3 kbp from the non-redundant nucleotide database (accessed 09/2019). The protein predictions for each genome sequence were retrieved using the “efetch” functionality in the entrez direct command line tools [28]. Genomic sequences lacking protein predictions were discarded.

Dereplication and annotation of phage genomes

The collections of GenBank and human gut phage genomes were each dereplicated at 95% average nucleotide identity across 80% of the genome length using dRep (v. 2.6.2) [29] and its associated dependencies, Mash [30] and FastANI [31], with all other settings left as default. The proteins from these contigs were collected and clustered at 95% amino acid identity across 50% of the protein length using mmclust [32]. The representative protein sequences were combined into a single BLAST database and compared against the multiple sequence alignments (MSAs) in the CDD database [21] with PSI-BLAST [33] at an e-value cutoff of 0.01. If the representative protein sequence produced a significant result, the representative and all constituent members of the protein cluster were annotated using the best hit.

Phylogenetic reconstruction

Alignments of the large terminase subunit (TerL), capsid, or portal protein were constructed as previously described [34]. Marker proteins from the metagenomic phages were combined with markers from GenBank phages into a single database and initially clustered to 50% amino acid identity using mmclust [32]. The clusters were aligned using MUSCLE [35]; cluster alignments were then compared to each other using HHsearch (v. 3.0) [36]. The cluster-cluster similarity scores were converted to distances as -ln(SA,B/min(SA,A,SB,B)), where SA,B is similarity between the profiles A and B, then, an unweighted pair group method with arithmetic mean (UPGMA) dendrogram was constructed using the estimated cluster distances. Tips of the tree (depth <1.5) were used to guide the pairwise alignment of the clusters at the tree leaves with HHalign, creating larger protein clusters. The resulting alignments were filtered to remove sites with more than 50% gaps and a homogeneity lower than 0.1 [37]. The filtered alignment was used to construct an approximate maximum-likelihood tree using FastTree [38], with the Whelan-Goldman models of amino acid evolution and gamma-distributed site rates. Examination of the trees identified 353 nearly identical PhiX-174 sequences that were removed from subsequent analyses as a contamination from a sequencing reagent.

Phage genome analysis

A gene-sharing network of phage genomes was constructed using Vcontact2 (v. 0.9.19) [39], with default search settings against the database of dereplicated GenBank phage genomes. The results were imported into Cytoscape (v. 3.8) [40] for visualization.

The ORFs for selected groups of phages (see the main text) were additionally annotated through HHblits searches against the Uniprot database clustered to 30% identity and the PDB database clustered to 70% identity (available at http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/, accessed 02/2020) [41]. Genomes encoding a predicted reverse transcriptase (RT) were examined for the presence of repeats corresponding to a diversity-generating retroelement using DGRScan with default settings [42]. To identify repeats outside of the 10 kb RT-centered window (the default window of DGRScan), the template repeats were used as BLASTn queries against the encoding genome with the following parameters: -dust no -perc_identity 75 -qcov_hsp_perc 50 -ungapped -word_size 4.

Fractional abundance of phage genomes in metagenomes

Dereplicated phage genomes from the NCBI Genbank database were combined with the dereplicated gut phages into a single database and indexed for read recruitment using Bowtie2 [43]. A collection of 1241 human gut viromes were downloaded from the NCBI SRA using the SRA-toolkit (v. 2.10) and quality filtered with fastp (v. 0.20.1) [44]. The quality-filtered virome reads were mapped to a database containing the reference human genome (GCF_000001405), phiX-174 (NC_001422.1) and cloning vectors (available from ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/) using Bowtie2 with default parameters. Unaligned, “decontaminated” reads were then recruited to the phage database using Bowtie2 with default parameters, except for the following additions “--no-unal --maxins 1000000.” The length-normalized fractional abundance of each phage genome in each virome was calculated as described previously [1].

Host prediction from CRISPR-spacer matches

A database of CRISPR spacers was compiled from previous surveys of CRISPR-Cas systems [45, 46]. Each spacer was used as a BLASTN [47] query against the phage genomes, using word size of 8 and low complexity filtering disabled. A phage-host prediction was inferred if the spacer was 95% identical over 95% of its length to a phage sequence.

Prediction of anti-CRISPR proteins

Identification of anti-CRISPR proteins (Acrs) was carried as out as previously described [48]. Briefly, each protein was assigned a score by the Acr prediction model ranging between 0 and 1, where a higher score corresponds to a higher likelihood of the protein being an Acr. The proteins were then clustered at 50% amino acid identity and considered a candidate Acr if they satisfied the following criteria: (1) received a mean score of 0.9 or above, (2) are present in a directon of 5 or fewer genes, (3) at least one of the directons encodes an HTH domain-containing protein, and (4) the cluster does not produce a hit with an HHpred probability greater than 0.9 to any PDB or CDD database sequence [21].

Results

Identification of novel phage genomes from whole-community human gut metagenomes

The collection of 5742 whole-community assembled metagenomes was searched for the presence of complete phage genomes. To limit the search space to likely closed genomes, “circular” contigs were extracted from these assemblies that contained direct repeats at their termini, within the typical k-mer size used to assemble short reads (50–200 bp, n = 95,663). Each contig was searched for open reading frames (ORFs) matching a known phage marker profile (i.e., the terminase large subunit, major capsid protein, or portal protein). In total, 3738 contigs encode at least one ORF that passed the e value and length cutoff criteria (the “Methods” section) (Additional file 1). Dereplication at approximately 95% mean nucleotide identity reduced the number of phage marker-matching contigs to 1886 (Additional file 2). A subset of 664 contigs encoded all three markers, 531 encoded two of the three, and the remaining 691 possessed a single detectable marker (Additional file 3). The putative phage contigs had a median length of 44.9 kb, which is consistent with the recent estimates of the median genome size of dsDNA phages [49]. To exclude any contaminating contigs (e.g., a plasmid harboring an integrated phage), each was assessed with ViralVerify [16] and Seeker [27], and two bioinformatic tools trained to discriminate phage genomes from other sequences. These tools classified all but 36 of the selected contigs as phages with varying levels of confidence (Additional file 2). Upon manual examination for typical phage genes other than the markers, four of the 36 unassigned contigs were discarded and the remainder were found to represent false negative classifications by these tools as judged by the presence of signature phage genes. Although we cannot rule out the possibility that some non-phage contigs were retained erroneously, the results collectively suggest that the set of circular marker-matching contigs predominantly consists of complete phage genomes.

To determine the host ranges of the phages, a database of CRISPR spacers from prokaryotic genomes was used to query the metagenomic phages for potential matches. In total, 553 (29%) of the dereplicated phage genomes were found to be targeted by at least one CRISPR-Cas system allowing host prediction (Additional file 4). The most common predicted hosts were Firmicutes (323 phages), followed by Bacteroidetes (143), Actinobacteria (43), Proteobacteria (41), and Verrucomicrobia (4). Among the identified phages, 111 were predicted to infect at least two different bacterial genera, consistent with other studies demonstrating that related bacteria possess CRISPR spacers targeting the same phage [45, 50]. Notably, 359 of the dereplicated phages harbor at least one protospacer identical to another gut phage (Additional file 4), indicating that a single CRISPR spacer often can confer immunity against multiple phages.

Many phages have been found to encode anti-CRISPR proteins (Acrs) to parry CRISPR-Cas defenses [51,52,53,54]. Given their function in counter-defense, Acrs evolve rapidly and show limited sequence similarity to experimentally characterized Acrs, making inference challenging [51]. However, a machine-learning based method has been recently developed that utilizes genomic context to identify candidate Acrs [48]. Application of this method showed that 41 phages, 16 of which were found to be targeted by a CRISPR-Cas system of their inferred host, encoded at least one candidate Acr (Additional file 5). The highest-scoring Acrs belong to four phages that are targeted by Bifidobacterium CRISPR-Cas systems. All four phages are > 97% identical over > 90% of their length at the nucleotide level to uncharacterized prophages in cultured Bifidobacterium isolates (Additional file 5), confirming their host-tropism assignment via CRISPR spacer-protospacer matches. In these phages, two candidate Acr-encoding genes lie between the large terminase subunit and integrase (Additional file 6). The localization of the Acr-encoding genes suggests they are expressed not only upon initial entry into the host cell and during lysogeny [55], but also upon transition to the lytic program to prevent cleavage of progeny phage genomes by CRISPR-Cas, as demonstrated experimentally in Listeria-infecting phages [56]. Transcription of Acrs is typically regulated by HTH domain-containing proteins termed Acr-associated proteins (Acas) [57]. Indeed, in the Bifidobacterium phages identified here, a short HTH domain-encoding ORF is located immediately downstream of the Acrs and can be predicted to regulate the expression of these two genes throughout the phage lifecycle. While these uncharacterized Bifidobacterium phages possess the characteristic features of Acr loci, the great majority of the phages identified in this work did not harbor any detectable Acrs yet were targeted by CRISPR-Cas (Additional file 4). Some of these phages might encode distinct Acrs undetectable by the method we used that was trained on a collection of previously characterized Acrs, whereas others might employ alternative anti-CRISPR strategies.

Taxonomic decomposition of the gut phages identifies previously unknown putative families

Phylogenetic trees were constructed for the large terminase subunit (TerL), major capsid protein (MCP) and portal protein encoded in each phage genome using an iterative approach to construct the underlying alignments [34]. The trees were constructed alongside reference proteins derived from phage genomes extracted from the NCBI GenBank database. Reflecting the set of protein profiles employed to identify the phage contigs, 1480 (78%) genomes were assigned to the phylum Uroviricota, 360 (19%) to the Phixviricota, and 46 (2%) to the Loebvirae (Additional file 3). The phylum Phixviricota includes Escherichia coli phage phiX174 that is used as a sequencing reagent; however, the 360 Phixviricota phages detected in this analysis do not include any sequences closely related to phiX174 (see the “Methods” section). The remaining analyses focus on the taxonomic decomposition of the phages that belong to Uroviricota, given that these contigs represent by far the largest fraction of the recovered genomes and that the Loebvirae and Phixviricota phyla are the subject of recent taxonomic analyses [58, 59].

The phylum Uroviricota is organized into a single class (Caudoviricetes) and order (Caudovirales), but a new order encompassing the crAss-like phages, a common and apparently most abundant group of phages in the human gut virome, is being proposed [60]. Our profiles recovered 141 phage genomes (dereplicated from 601 total genomes) that displayed phylogenetic relationship with the crAss-like phages and are the subject of a separate study [22]. The Caudovirales are presently organized into 9 families, but 3 of these (Myoviridae, Podoviridae, Siphoviridae) are expansive and demonstrably polyphyletic [61,62,63] and were thus not used for family-level taxonomic assignment although the remaining 6 families represent only a small fraction of the phages available in GenBank. The phylogenetic tree of TerL, a hallmark protein that is frequently used for phylogenetic reconstruction of Caudovirales phages and appears to be the best phylogenetic marker thanks to its ubiquity among phages and high level of sequence conservation [63, 64], reveals only 34 gut phages that belong to one of these 6 ICTV-accepted families (Fig. 1 and Additional file 3). The remainder of these unclassified phages are likely to found new families presently composed entirely of uncultured phages or belong to families with a cultured representative that have yet to be defined under the new multi-rank taxonomy of viruses.

Fig. 1
figure 1

Three candidate families of Caudovirales phages discovered in human gut metagenomes. a Phylogenetic tree of the large terminase subunit encoded by Caudovirales phage genomes in GenBank (n = 3931) and in gut metagenomes (n = 1298). Branches are colored according to the current ICTV families, except for the Myoviridae, Podoviridae, or Siphoviridae, which are in orange. The outermost ring indicates the location of candidate families proposed in this study: 1, “Quimbyviridae” phages; 2, “Flandersviridae” phages; 3, “Gratiaviridae” phages (see main text). b Gene sharing network of the Urovicota phages. Phage genomes identified in human gut metagenomes (blue nodes) were compared to phages in the GenBank database (colored as in Figure 1, with the addition of the crAss-like phages in brown and the new Caudovirales families proposed in this study in black). c Abundance of phages across human gut viromes. The x-axis depicts the fractional abundance of a given phage averaged across all viromes (n = 1241); the y-axis is the fraction of viromes that a given phage recruits at least one read. Each phage genome (n = 7888 total) is colored at the taxonomic level of order (c) or family (Uroviricota families only) (d)

Selection of candidate families for comparative genomics

The taxonomic analysis based on phage hallmark proteins demonstrates that few phages in the human gut belong to a currently accepted ICTV-family. To prioritize candidate families for in-depth analysis, we next complemented the hallmark gene-based taxonomic analysis with whole-genome comparisons and abundance calculations of each phage relative to GenBank phages.

A gene-sharing network was constructed with the phages recovered from metagenomes and those deposited in GenBank. Edges are drawn between two viral genomes, represented as nodes, based on the number of ORFs that share significant sequence similarity [39]. Most of the metagenome-recovered phages bore multiple connections within the network to GenBank phages, in agreement with the manual curation of these contigs as genuine phage genomes (Fig. 1b). However, two large groups of phages (tentatively labelled “Flandersviridae” and “Gratiaviridae”) were weakly connected to the larger network, reflecting disparate genome content. The divergence of the gene content of these phages from those of previously known phages and their distinct position in phylogenetic trees (Fig. 1a and see below) indicate that they represent novel genera, and likely, new families.

To quantify the fractional abundance [1] of each phage in the human gut viral community, reads from a collection of 1241 human gut viromes were competitively mapped against a database containing the metagenome-recovered and GenBank phages. The majority of the genomes do not recruit any reads (“detection”) from more than 2% of the viromes (Q1–Q3, 0–2% of viromes) (Fig. 1c-d), consistent with the previously reported individuality of the human gut viromes [2, 7, 11]. A notable exception are the crAss-like phages [65] that recruit at least one read from about one third of the viromes (Q1–Q3, 9–28%), in agreement with previous reports of their cosmopolitan distribution [66, 67]. One uncharacterized Caudovirales genome was frequently observed in the collection of human gut viromes (54%, Fig. 1c), suggesting that this phage is also cosmopolitan. To rule out the possibility that the observed frequency stemmed from non-specific read mapping to one or a few loci, rather than the complete genome, the coverage of sequencing reads across the genome (accession OMAC01000147.1) was examined. The broad coverage of this genome in the viromes confirms that its frequent detection is not an artifact, although several loci present in the reference sequence were absent in the viromes (Additional file 7). The exceptional detection of this uncharacterized phage (hereafter referred to as Quimbyvirus, after the character Mayor Quimby from the Simpsons) in the human gut viral community warrants its detailed examination.

Thus, three groups of phages were selected for in-depth analysis based on their distinct positions in the phylogenetic trees of the marker genes (all three groups), combined with divergent gene contents (“Flandersviridae” and “Gratiaviridae”) and high abundance in the human gut viral community (“Flandersviridae” and “Quimbyviridae”). A comparative genomic analysis of each candidate family is presented below, case-by-case.

“Quimbyviridae” phages are abundant, hypervariable phages infecting Bacteroides

In the TerL phylogenetic tree, Quimbyvirus belongs to a group of phages whose closest characterized relatives include the Vequintavirinae and Ounavirinae subfamilies, under the now defunct Myoviridae family. To elucidate the taxonomic affiliation of Quimbyvirus, genomes from adjacent branches were examined (Fig. 2). The median genome length of Quimby-like phages is 75.2 kb, close to the genome size of a branch basal to the Quimby-like branch, “group 4986” (72 kb), but smaller than the genomes of other phages in adjacent branches, Ounavirinae (88 kb) and Vequintavirinae (145 kb). Despite the similarity in genome size, phylogenetic reconstruction of the portal protein and MCP separate the Quimby-like phages from group 4986 (Additional file 8). Moreover, most Quimby-like phages encode a DnaG-family primase and DnaB-family helicase that are both absent in group 4986. However, in one branch of Quimby-like phages, the primase was lost from the replication module. The genomes of this branch encode a protein adjacent to the DnaB-family helicase with significant structural similarity to the winged helix-turn-helix domain of RepA (HHpred probability, 96.5) (Fig. 2). RepA-family proteins mediate replication of plasmids by interacting with host DnaG primases [68], suggesting that the RepA-like protein coopts the host primase during replication, triggering the loss of the phage-encoded dnaG in this lineage. Consistent with a RepA-mediated episomal replication strategy, no integrase is identifiable in the genomes on this branch yet the phages encode numerous antirepressors, proteins involved in the lysis-lysogeny decision of temperate phages [69, 70]. The rest of the Quimby-like phages harbor a full-length, three-domain tyrosine integrase, indicating that these phages integrate into their host cell genome (Fig. 3). Based on the topologies of the TerL, portal, MCP, and DnaG trees, we propose that Quimby-like phages represent a novel taxonomic group at the family rank (henceforth, the “Quimbyviridae”). The potential differences in replication strategies (episomal vs. integrated) combined with the topologies of the phylogenetic trees of marker proteins suggest that “Quimbyviridae” splits into two distinct subfamilies.

Fig. 2
figure 2

Phylogenetic tree of the large terminase subunit and genome maps of Quimby-like phages. a Individual genome maps of Quimby-like phages and ICTV classified phages are shown to the right of each branch. The ORFs are colored according to function: large terminase subunit (red), structural components (blue), DNA replication and repair (orange), lysogeny (pink), general function (green), and unknown (grey). b Expansion of four Quimby-like phages and a single gut phage genome from an adjacent branch (“group 4986”). The diversity-generating retroelement and hypervariable ORFs are highlighted with a dashed box and asterisk. The nucleotide scales differ between individual genome maps in both panels

Fig. 3
figure 3

Phylogenetic tree of the large terminase subunit and complete genome maps for “Flandersviridae.” a Genome maps of members of the “Flandersviridae and selected ICTV-classified phages were constructed and colored as in Fig 3. b Genome maps of three genera from the “Flandersviridae” family. The dashed box highlights the insertion of licD- and ispD-family enzymes in the replication module of one “Flandersviridae” phage

The Quimbyvirus genome aligns with a cryptic prophage of the bacterium Bacteroides dorei (CP011531.1), with 95% nucleotide sequence identity across 92% of its length, indicating that B. dorei, a common constituent of human gut microbiomes [71], carries a prophage closely related to Quimbyvirus. Inspection of the alignment shows that Quimbyvirus site-specifically integrates into the tRNA-Asp gene of B. dorei, a typical site of prophage integration [72]. The hosts of the other “Quimbyviridae” phages, determined through CRISPR-spacer analysis, include the Prevotella, Bacteroides, and Parabacteroides genera within the phylum Bacteroidetes and the Lachnospiraceae within the phylum Firmicutes. In contrast, the hosts of group 4986 do not include any Bacteroidetes. The differences in the inferred host ranges support separating group 4986 from “Quimbyviridae” phages and suggests that group 4986 might represent a novel family, but these genomes were not investigated further.

Some of the “Quimbyviridae” phages harbor diversity-generating retroelements (DGRs), a cassette of genes that selectively mutate a short locus, known as the variable repeat, that is part of a C-type lectin or an immunoglobulin-like domain [73, 74]. Targeted mutation of these domains yields proteins with altered binding affinities and specificities [75]. The DGR cassette in Bordetella phage BPP-1 of the genus Rauchvirus is the only experimentally studied DGR system in a phage, where diversification of the C-type lectin domain-containing tail fiber gene enables adsorption to different host cell receptors [76]. In Quimbyvirus, the RT component of the DGR is encoded by overlapping ORFs in all three frames (ORFs 52-54), suggesting that the active RT is produced by two programmed frameshifts. Although overlapping ORFs and programmed frameshifts have been identified in many compact tailed phage genomes [77,78,79,80], DGR RTs have thus far only been predicted to be encoded by a single ORF. To discern if the frameshifts render the RT inactive, the variable repeats were examined for adenine-specific substitutions, a hallmark of DGR-mediated variation [74]. The two variable repeats reside in ORF 47 and 80 of the Quimbyvirus genome, which both encode proteins containing C-type lectin domains, the canonical target of DGRs [73] (Fig. 2). Alignment of the variable repeats with their cognate template repeats from nearly identical Quimbyvirus genomes (> 95% average nucleotide identity) allowed the detection of 22 adenine sites in the variable repeat exhibiting substitutions whereas all other bases were nearly perfectly conserved (Additional file 9). Collectively, these results suggest that the frameshifted RT possesses the selective infidelity that characterizes DGR-mediated hypervariation.

The first variable repeat resides in the C-terminus of ORF 51 that is located downstream of the tail fiber genes, suggesting that this gene codes for a structural component of the virion, similar to the hypervariable tail fiber of phage BPP-1 [76, 81]. The second DGR target locus is in ORF 84 that is distal to the phage structural gene module and is expressed from the opposite DNA strand, suggestive of a non-structural protein. The genomic neighborhood of ORF 80 includes genes coding for a nuclease, four methyltransferases and a tRNA ligase within 7 kb. The nuclease shows significant sequence and structural similarity to E. coli mutY (HHpred probability, 97.3, Additional file 10), a DNA glycosylase involved in base excision repair. The methyltransferases are most similar to adenine- and cytosine-modifying enzymes (HHpred probability 100 and 99.9, respectively, Additional file 10) that likely prevent cleavage by host restriction endonucleases. Similarly, the tRNA ligase might repair tRNAs cleaved by host anticodon endonucleases [82]. Overall, the adjacency of ORF 84 with defense- and counterdefense-related genes implies that this hypervariable phage protein plays a role in the phage-host conflicts; however, the exact functions of the DGR and hypervariable target proteins during the life cycle of “Quimbyviridae” phages remain to be investigated.

“Flandersviridae” phages are common and abundant in whole-community metagenomes

Analysis of the phylogenetic trees of TerL identified a deep branch of 29 gut phages (dereplicated from 196 total genomes) that joins the family Ackermannviridae (Fig. 3a). Annotation of the ORFs encoded by the 29 representative contigs demonstrated that the genomes are colinear, confirming that they belong to a cohesive group (Fig. 3b). The cohesiveness of this group was confirmed by the gene-sharing network, where these genomes form a coherent cluster that has few connections to the larger network (Fig. 1b), reflecting distant (if any) similarity between most of the proteins encoded by these phages and proteins of phages in GenBank. The median genome size of the phages in this group is 85.2 kb, compared to 157.7 kb among the Ackermannviridae phages. There is a conserved module of structural genes that encode the MCP, portal, sheath and baseplate proteins, TerL, and the virion maturation proteinase. The presence of a contractile tail sheath indicates that these viruses possess contractile tails similar to those in the family Ackermannviridae, in agreement with the TerL phylogeny. Several of the genes within the structural block contain immunoglobulin-like or C-type lectin domains (e.g., BACON and GH5, respectively), which are predicted to play a role in adhesion of the virion to bacterial cells or host-associated mucosal glycans [83,84,85,86]. Downstream of the structural block is a module of genes involved in DNA replication that includes a DnaB-family helicase, DnaG-family primase, and DNA polymerase I (PolA). The polA gene is widely distributed among dsDNA phages and therefore serves as a useful marker for delineating the diversity of phage replication modules [87]. Phylogenetic reconstruction of both polA and dnaG encoded by these phages confirmed their monophyly (Additional file 11). Following the replication module is an approximately 20 kb long locus containing ORFs that showed no detectable similarity to functionally characterized proteins. Two of the phages harbor matches to CRISPR spacers encoded by Bacteroides and Parabacteroides spp., indicating these bacteria serve as hosts. Based on the large terminase and polA phylogeny, colinearity of their genomes and differences from known phages in both genome size and content, we propose that these Bacteroides-infecting phages represent a novel taxonomic group, with a family rank hereafter “Flandersviridae” (after the region where some of the metagenomes were sampled).

Although all members of the “Flandersviridae” are syntenic, some contain an insertion of two adjacent genes encoding nucleotidyltransferase superfamily enzymes within the DNA replication module. One enzyme belongs to the ispD family that is involved in the biosynthesis of isoprenoids [88, 89], and the other is a licD family enzyme that is responsible for the addition of phosphorylcholine to teichoic acids present in bacterial cell walls [90] (Fig. 3). To our knowledge, neither of these enzymes has been reported in phages previously. Given that only some members of the “Flandersviridae” possess these genes, they are unlikely to perform essential functions in phage reproduction, and instead could be implicated in phage-host interactions. The licD family enzyme might modify teichoic acids to prevent superinfection by other phages, given that these polysaccharides serve as receptors for some phages to adsorb to the host cells [91]. The role of ispD is less clear because ispD family enzymes catalyze one step in the biosynthesis of isopentenyl pyrophosphate, a building block for a large variety of diverse isoprenoids [92]. Phages manipulate host metabolic networks including central carbon metabolism, nucleotide metabolism and translation [93]; the discovery of ispD present in the “Flandersviridae” phage genomes might add to this list the isoprenoid biosynthetic pathway.

Complete “Flandersviridae” phage genomes were recovered from 249 whole-community human gut metagenomic assemblies. Their frequent assembly into closed contigs suggests that these phages might persist in their host cells as extrachromosomal circular DNA molecules, similar to phage P1 [94]. However, neither genes involved in DNA partitioning nor lysis-lysogeny switches are readily identifiable in the “Flandersviridae” genomes. Thus, this group of phages might be obligately lytic although discerning the lifestyle of a phage from the genome sequence alone is challenging [95]. Regardless of their lifestyle, the frequent recovery of these phages from whole-community metagenomes implies that they are common members of the human gut virome. Indeed, the “Flandersviridae” phages reach similar detection frequency as the crAss-like phages (Fig. 1d) although there are fewer Flanders-like phages in the database. Like the “Quimbyviridae,” the even coverage of sequencing reads across one “Flandersviridae” genome (accession OLOC01000071.1) confirms its detection is not artifactual (Additional file 12). The high fractional abundance and detection of Flanders-like phages in viromes generally agrees with their frequent assembly from whole community metagenomes although they were not the most abundant (see the “Discussion” section). Overall, Flanders-like phages represent a previously undetected phage group that is widely distributed in human gut viromes.

“Gratiaviridae,” a putative novel family of phages infecting Bacteroides

A deeply branching cluster of 18 genomes (dereplicated from 45 total) is basal to the families Autographiviridae, Drexlerviridae, and Chaseviridae on the TerL phylogenetic tree (Fig. 4a). Although not commonly present in gut viromes (Fig. 1d), the deep relationship between these contigs and established phage families prompts in depth genome analysis of these putative phages. All 18 genomes encode a DnaG-family primase and a DnaE-family polymerase, and phylogenetic reconstruction for these genes demonstrates monophyly of these phages; the sole exception is the dnaE gene of bacteriophage phiST, a marine Cellulophaga-infecting phage that belongs to the polyphyletic, currently defunct Siphoviridae family [96] (Additional file 13). The dnaG and dnaE genes are nested within a module of other replication-associated genes that include superfamily I and II helicases, SbcCD exonucleases, and a RecA family ATPase (Fig. 4b). The structural module is composed of genes that encode an MCP, capsid maturation protease, portal protein, baseplate proteins, and a contractile tail sheath protein. Although these genomes are not strictly colinear as observed for the “Flandersviridae” phages, the overall similarity of the proteins encoded by these phages is apparent in the gene-sharing network where they form a coherent cluster that shares some edges with the crAss-like phages (Fig. 1b). Similar to crAss-like phages, the predicted hosts suggested by CRISPR-spacer matches are the Bacteroides and Parabacteroides genera (Additional file 4). Taken together, the phylogenetic and genomic organization of these phages indicate that they represent a new family, provisionally named “Gratiaviridae” (after the pioneering phage biologist Dr. Andre Gratia).

Fig. 4
figure 4

Phylogenetic tree of the large terminase subunit and genome maps of the “Gratiaviridae” phages. a Genome maps of ICTV-classified phages were constructed and colored as in Fig. 3. b Genome maps of four genera from the Gratiaviridae family. The dashed box highlights a HipA-family kinase domain-containing protein, AAA-family ATPase, and glycosyltransferase (see main text)

In addition to structural and replication proteins, “Gratiaviridae” phages encode several enzymes of the ferritin-like diiron-carboxylate superfamily. The ferritin-like enzymes encoded by these phages belong to two families, namely, DNA protecting proteins (DPS) and manganese-catalases. Manganese-catalases have not been documented in phage genomes, and DPS-like enzymes have only been observed in seven Lactobacillus-infecting phages [97]. Both enzymes are involved in the tolerance of anaerobes to oxidative stress. Catalases detoxify hydrogen peroxide to oxygen and water, enhancing survival of anaerobic Bacteroides in the presence of oxygen [98]. DPS enzymes catalyze a reaction between oxygen and free iron to yield insoluble iron oxide, lowering the concentration of both intracellular oxygen and free iron levels that would otherwise react with hydrogen peroxide and produce a hydroxyl radical, the most toxic reactive oxygen species [99, 100]. “Gratiaviridae” phages might deploy catalase- and DPS-like enzymes during infection to enhance the tolerance of their strictly anaerobic Bacteroides hosts to oxidative damage. Notably, these enzymes were not restricted to the “Gratiaviridae” but could be identified in 196 (manganese catalase) and 36 (DPS) other phage genomes, including the “Flandersviridae.” The frequent identification of these enzymes in gut phage genomes underscores the importance of intracellular iron and reactive oxygen species concentration for productive infections in an anaerobic environment.

Five of the “Gratiaviridae” phages encode a protein containing a serine/threonine protein kinase domain with distant but significant sequence similarity to HipA family kinases (HHpred probability 99, Additional file 10). Whereas HipA family kinases are present in numerous, phylogenetically distinct bacterial genomes as the toxin component of a distinct variety of type II toxin-antitoxin systems [101, 102], there are only two characterized examples of protein kinases encoded by phages. The protein kinase of T7-like phages phosphorylates RNA polymerase and RNAse III early during infection as part of the takeover of the host cell transcriptional and translational machinery [103,104,105]. In contrast, the protein kinase of E. coli phage 933W is expressed during lysogeny and mediates abortive infection upon superinfection of the host cell by phage HK97 [106]. The HipA-like kinase is unlikely to function early during infection like the kinase of T7-like phages because, in all five “Gratiaviridae” phages, the kinase is encoded between the portal protein and MCP genes, which are expressed late during infection in numerous cultured phages [107, 108]. Instead, the kinase might confer immunity to heterotypic phage infection, analogous to the kinase encoded by 933W [106]. In support of an immunity-related role, an AAA-family ATPase and a glycosyltransferase are encoded immediately upstream of the kinase in all five phage genomes (Fig. 4). Glycosyltransferases are encoded within capsular polysaccharide biosynthetic loci [109] and phase variation of the capsular polysaccharides confers immunity from phages that rely on these molecules for adsorption [110]. The specific roles of the HipA-family kinase, ATPase and glycosyltransferase are unknown, but collectively, these enzymes might modify host cell capsules, granting temporary immunity to heterotypic phage infection while the morphogenesis of “Gratiaviridae” progeny virions completes.

Discussion

A search of human gut metagenomes identified 3738 putative complete phage genomes. In an attempt to recover complete phage genomes, this analysis restricts the search space to metagenomic contigs with direct terminal repeats which are present at the termini of some phage genomes that consequently form circular assemblies [12]. Circular assemblies can also arise upon sequencing a concatemer of DNA present during phage DNA replication and packaging [12], in which case the direct repeats are a technical artifact and are the same length as the k-mer size used to assemble the contigs. Phages with different replication and DNA packaging strategies, such as members of the phyla Preplasmiviricota, Dividoviricota, or Escherichia phage Mu [12], that lack direct repeats do not yield circular assemblies and thus were not detected here. As a result, the set of phage genomes recovered by this strategy is both biased and an underestimate. The results are also skewed towards smaller genomes that are more likely to assemble into a single contig although, in one metagenome, a 294 kb phage genome was identified (Additional file 1). Despite these limitations, phylogenetic and comparative genomic analyses suggest that this set of contigs includes many previously unnoticed lineages of phages, some, most likely, at the family rank.

The family-rank phage lineages proposed here were defined using a combination of approaches. Principally, phylogenetic analysis of the large terminase subunit (TerL), a hallmark gene of the Uroviricota phylum, revealed branches of genomes distinct from any of those reported in the GenBank database (Fig. 1a). Supporting the phylogenetic results, the genomes of phages on adjacent branches were of similar length and largely syntenic, whereas distant branches possess entirely different architectures (Figs. 2, 3, and 4).

The phages in two of the three proposed families (“Flandersviridae” and “Gratiaviridae”), the genomes are largely disconnected from other phages in the gene-sharing network. The phages in the family “Quimbyviridae” share genes with numerous phylogenetically distinct phages (Fig. 1c), although not enough to warrant automated assignment into the same “viral cluster” (Additional file 2) [39]. Phages that infect phylogenetically related hosts share genes more frequently with one another than they do with phages of distantly related hosts [111, 112]. The proximity of Quimbyviridae with other phages in the gene sharing network, most likely, reflects a similar preference for Bacteroides hosts (Additional file 4). Phylogenetic reconstruction of hallmark genes helped to delineate the Quimbyviridae as a distinct group which was otherwise obscured by the numerous connections in the gene sharing network. Overall, a combination of phylogenetic analysis of hallmark genes and gene sharing analysis will facilitate the taxonomic classification of the gut viral community into higher levels of organization.

Two groups of phages were selected for in-depth analysis based on their frequent recovery in metagenomes and viromes. Complete genomes of “Flandersviridae” and “Quimbyviridae” phages were assembled in 249 and 20 whole-community metagenomes, respectively. Yet, Quimbyvirus was more frequently detected in the viromes than any “Flandersviridae” phage (Fig. 1d). The discrepancy can be attributed to several factors, including sampling bias, the greater number of “Flandersviridae” genomes in the reference database “diluting” the number of mapped reads per genome, or the presence of variable loci (e.g., the variable repeats of DGRs) that break contig assemblies [113]. Regardless, both groups encompass abundant members of the human gut virome. Predictably, the hosts of these phages include Bacteroides spp., which are some of the most dominant bacterial taxa of the human gut [114] and serve as hosts for other common human gut phages [67, 115]. Much of the uncharacterized “dark matter” in these phage genomes is likely to be dedicated to preventing superinfection of the Bacteroides host cells by such phages and to counter the host defenses. Although in general defense systems in Bacteroidetes remain poorly characterized, most of the bacteria possess active CRISPR-Cas systems, and numerous CRISPR spacers targeting the phages analyzed here were detected using stringent thresholds with a low estimated false discovery rate (0.06) [45]. This implies that many if not most of the phages infecting Bacteroidetes would encode Acrs. However, the currently available prediction method that was trained on the sequences of previously identified Acrs detected putative Acrs only in a small minority of these phages. The remaining phages of Bacteroidetes might encode distinct Acrs or employ alternative anti-CRISPR strategies.

Several phage genera possess DGRs, including Quimbyvirus. Metagenomic surveys have shown that DGRs are enriched in the viruses that inhabit gastrointestinal environments [113, 116]. Combined with the induction of DGR-carrying phages from human gut bacteria [117, 118], these observations reflect a prominent role of hypervariability underlying phage-host interactions in the gastrointestinal environment. Notably, Quimbyvirus and another DGR-carrying phage (Hankyphage, BK010646.1), lysogenize the same Bacteroides species and both phages are frequently detected in human gut viromes [118]. The commonalities aside, the Quimbyvirus DGR RT is encoded by three overlapping reading frames and targets two proteins, one in the structural module and one in a defense-related island. DGRs have been associated with putative defense and signaling systems in cyanobacterial and gammaproteobacterial genomes [119, 120], but beyond the presence of the C-type lectin fold, the hypervariable proteins possess few other recognizable domains that obfuscate their precise roles.

The third group analyzed in this study, the “Gratiaviridae,” is not abundant but occupies a deep position on the TerL tree relative to the Autographiviridae, Chaseviridae, and Drexlerviridae families. Analysis of the “Gratiaviridae” genomes will facilitate the future organization of these families into higher taxonomic ranks, potentially, at the order level. Furthermore, analysis of the “Gratiaviridae” genomes demonstrated the presence of catalase- and DPS-family enzymes that arbitrate cellular responses to oxidative stress [121]. Oxygen concentrations vary along the length of the gastrointestinal tract, where the concentration is lower in the distal vs. proximal gut [122]. Oxygen also diffuses from tissues radially into the lumen [123], and in combination with other factors, these gradients affect the structure and composition of the gastrointestinal microbiota [124]. The acquisition of oxygen detoxifying-enzymes by the “Gratiaviridae” and other gut phages signals a need to supplement their host cell’s tolerance to oxidative damage during infection, which might be especially important for cells that reside near the tissue surface where oxygen exposure is higher.

A unique feature of some “Gratiaviridae” phages is a HipA-family protein kinase. The T7-like phages (within the Autographiviridae family) and Escherichia phage 933W (currently unclassified at the family level) encode PKC-family protein kinases that function during host cell takeover and abortive infection, respectively [103, 106]. A third, CotH-family protein kinase domain is occasionally observed in phage genomes where it is fused to a hypervariable C-type lectin domain [73, 116], but these proteins are currently unstudied. The “Gratiaviridae” phages recruited a fourth family of protein kinases that, together with the phage encoded glysosyltransferase, might modify the host cell envelope, contributing to the prevention of superinfection.

Conclusions

In summary, comparative genomic analysis of the phages described here, along with the complementary analysis of crAss-like phages [22], substantially increases the characterized diversity of phages, primarily, those infecting Bacteroidetes bacteria, which are major components of the human gut microbiome. These findings also expand the repertoire of phage gene functions, notably, by adding the isoprenoid metabolic pathway, catalase-like enzymes, HipA family protein kinases, and hypervariable genes implicated in defense. All of these open multiple directions for experimental study.