Introduction

The gut microbiota is a complex microbial ecosystem that plays crucial roles in pig health, nutrient metabolism, and productivity1,2,3,4. Pig intestinal virus infections are prevalent in the global pig industry, but the many diverse viruses that inhabit the pig gut can greatly affect the structure and function of the gut microbial ecosystem5. Controlling the occurrence of these diseases solely through vaccines or medication is difficult. Thus, understanding the pig gut virome is essential for improving pig health and enhancing production efficiency. However, despite their ubiquity, knowledge of the diversity of the virome in the pig gut ecosystem is limited, and most virome genomes fail to be found in existing genome databases. Furthermore, most virome databases are established based on human gut metagenomic samples6,7,8, and there is a lack of a specialized virus database for pigs9. Therefore, a comprehensive database of the virome from the pig gut microbiota is a prerequisite for characterizing virus diversity, understanding host‒virus interactions, and resolving the functions of viral genomes.

Currently, there is a wealth of metagenomic data available that offers a unique opportunity to discover viral genomes10. Despite being generated through untargeted methods without virus particle enrichment, these datasets still contain a significant number of viral genomes. Several databases, such as GOV211, IMG/VR412, GVD7, MGV8, GPD6, and RVD13, have been established to facilitate the analysis of viromes from various environments using metagenomic data. Since the establishment of these databases, the number of publicly available pig gut microbiota datasets has rapidly increased9,14,15,16,17,18,19,20,21.

To make use of the existing resources and provide a comprehensive, global view of the pig gut virome, we developed a comprehensive metav analysis pipeline to examine 4650 metagenomic samples. We optimized several software tools to improve processing speed and end-to-end output. The Pig Virome Database (PVD) of the gut was established, containing 5,566,804 viral contig sequences estimated to be >50% complete, representing 48,299 vOTU genomes. Our analysis revealed a diverse and complex pig gut virome, with a high number of unique vOTUs (92.83%) compared to other databases. Moreover, we identified several potential novel viral species in the pig gut. These findings enhance our understanding of the pig gut virome, and provide insights into the complexity of gut ecosystems, emphasizing the importance of further research in this field.

Results

The DNA viruses from the pig gut microbiome

In this study, our objective was to create a comprehensive pig virome database (PVD) of the gut utilizing next-generation sequencing of metagenomic samples and the development of metav, a virus detection pipeline that integrates established methods and signatures such as VirSorter2 and DeepVirFinder for viral sequence identification22,23. Our findings indicated that the choice of assembler, specifically Megahit versus metaSPAdes24,25, had a minimal effect on the recovery of viruses, which was consistent with prior research8. We used the Megahit assembler and selected -k-list 39, 59, and 79 after testing different Kmer lists based on computational efficiency. We applied metav to 4650 fecal and gut content samples from various regions worldwide, including the USA, China, Europe, and Australia, and identified 19,577,760 unique, single-contig viral genomes exceeding 1.5 kb in length, which allowed us to comprehensively capture DNA viruses in the pig gut (Fig. 1a and Supplementary Fig. 1). By clustering viral contigs using 95% ANI and 85% AF, we identified 3,219,423 viral operational taxonomic units (vOTUs) at the species level (Fig. 1a). In addition, we evaluated vOTU genome quality using CheckV and found 48,299 vOTU genomes (corresponding to 5,566,804 viral contig sequences) that were at least 50% complete26, including 12,515 complete vOTU genomes (corresponding to 2,099,394 viral contig sequences) (Fig. 1b, c, Supplementary Table 1, and Supplementary Fig. 2). Notably, although only 1.50% of the total vOTU genomes were more than 50% complete, they corresponded to 28.43% of the contig sequences with >50% completeness of the total contig genomes (Fig. 1c). Moreover, the size of most vOTU genomes with completeness exceeding 50% was greater than 10 kb, indicating their potential to provide more reliable results8,27.

Fig. 1: Summary of viral genomes recovered from pig gut metagenomes.
figure 1

a Overview of the PVD database viral discovery pipeline. b Proportion of vOTU genomes based on quality tiers. c Distribution of density and number for different vOTU qualities: complete (n = 12,515), high quality (n = 9600), medium quality (n = 26,184), low quality (n = 1,726,336), and not determined (n = 1,444,778).

Newly established and expanded pig gut viral diversity

To ensure comparability and consistency with established quality standards for genomic comparisons28,29, we focused on 48,299 vOTU genomes whose completeness exceeded 50%. We used CheckV on the IMG/VR4, MGV, GVD, and GPD datasets to remove low-quality sequences, retaining vOTU genome fragments with more than 50% completeness. Following previous studies (Camarillo-Guerrero et al. 6; Nayfach et al.)6,8, we combined all vOTU genomes >50% from these four datasets with those from the PVD database and clustered them at the species level using a query coverage of 0.9 and an identity of 0.7. Our results showed that the PVD database significantly enhanced the diversity of viral genomes from pig gut microbiota metagenomic samples. Of the 47,650 species-level vOTUs in the PVD database, 34,952 (92.83%) did not cluster with any vOTU genomes from the other datasets (Fig. 2). The generalist IMG/VR4 database contains many more viruses because it does not contain exclusively mammalian gut samples. Of the 1,504,202 vOTU genomes from the IMG/VR4, 1,455,627 (96.77%) were unique compared to those from other databases (Fig. 2). However, the human gut metagenome samples used to construct the GVD, MGV, and GPD databases represented a greater proportion of viral genomes in the IMG/VR4 database than pig gut viruses. Specifically, GVD, MGV, and GPD had 4712 (58.79%), 43,464 (95.19%), and 25,377 (53.56%) sequences from human gut metagenomes, respectively, while GPD, GVD, and MGV had 20,695 (43.68%), 3247 (40.51%), and 1910 (4.39%) unique sequences compared to the other databases, respectively (Fig. 2). Our findings highlighted the importance of using comprehensive and high-quality databases to improve our understanding of viral diversity in different environments. The improved detection of viral reads in whole metagenomes and expanded coverage of virus‒host connections in the PVD database make it particularly useful for studying pig gut viruses.

Fig. 2: Clustering and comparison with four existing datasets of vOTUs that were above medium-quality.
figure 2

The vOTU genomes from the PVD (n = 37,650) catalog were compared with other virus genomes with >50% completeness from four databases: IMG/VR4 (n = 1,504,202), MGV (n = 43,464), GPD (n = 47,383), and GVD (n = 8015) at the species level with the parameters query coverage 90, and identity 70.

Taxonomic annotation

We utilized the new ICTV classification, which abolished the paraphyletic morphological families Podoviridae, Siphoviridae, and Myoviridae and the order Caudovirales and replaced them with the class Caudoviricetes to categorize all tailed bacterial and archaeal viruses featuring icosahedral capsids and a double-stranded DNA genome30. After using the new ICTV classification, the majority of vOTUs (65.36%) were identified as Caudoviricetes (Fig. 3a and Supplementary Table 2). However, only 10.97% of the medium-quality vOTU genomes could be annotated at the family level using the new ICTV database (Fig. 3b, c). These findings suggested significant gaps in knowledge regarding pig gut virus taxonomy, which is also a significant challenge for analyzing viromes from different environments8,31,32.

Fig. 3: The taxonomy and composition of vOTUs with >50% completeness.
figure 3

a Composition of vOTUs at the class level. b Composition of vOTUs at the family level. c Circle chart that depicts the vOTU taxonomy at both the class and family levels. The names of the classes are shown in red font, and those of families are shown in black font.

Host prediction and temperate identification

Accurately predicting virus hosts is a critical step towards understanding virus-host interactions and utilizing them to manipulate the gut microbiota ecosystem or design innovative phage tools33,34. In this study, we used the iPHoP tool, which integrates multiple methods, to reliably predict the host taxonomy of medium-quality vOTUs at the genus level35. As expected, Firmicutes and Bacteroidetes, the most abundant bacterial phyla in the gut microbiome, were found to be common hosts for viruses at the phylum level (Fig. 4a, b). At the genus level, Prevotella and Bacteroides (belonging to Bacteroidetes) and Vescimonas and Faecousia (belonging to Firmicutes) were the most commonly assigned hosts (Fig. 3a). Additionally, some viruses were predicted to inhabit Lactobacillus, Ruminococcus, Faecalibacterium, and Blautia, which are known to play important roles in pig productivity and feed efficiency36,37.

Fig. 4: Host prediction and temperate identification of vOTUs.
figure 4

a Number of vOTUs with >50% completeness for each predicted host at the phylum level. b Number of vOTUs with >50% completeness for the top 30 predicted hosts at the genus level. c The ratio of virulent and temperate phages for total vOTUs with >50% completeness identified by PhaTYP and BACPHLIP. d The ratio of virulent and temperate phages for the top 30 vOTUs with >50% completeness identified by BACPHLIP at the genus level. The letters “S”, “P”, “F”, “F_C”, “F_A”, and “B” before the italicized genus names represent the Spirochaetota, Pseudomonadota, Firmicutes, Firmicutes_C, Firmicutes_A, and Bacteroidetes phyla, respectively.

Previous study has shown that using incomplete viral contigs for lifestyle prediction can lead to an overestimation of the number of virulent viruses8. Thus, we used only the above high-quality (>90% completeness) vOTU genomes to predict the virulent or temperate phages with two software programs, BACPHLIP and PhaTYP (Fig. 4c). BACPHLIP showed that the majority of the above high-quality (>90% completeness) vOTU genomes (64%) were predicted to be virulent (Fig. 4c). PhaTYP, a python library for bacteriophage prediction with a Bidirectional Encoder Representations from Transformer (BERT)-based model, was used to confirm the lifestyles of the same set of high-quality vOTUs (>90% completeness, n = 22,115). The results showed that the temperate, virulent, and unknown categories accounted for 62%, 34%, and 4%, respectively, of all predictions. However, other studies have reported that virulent viruses accounted for 65%~91% of the total viruses in the human gut environment, which might be due to the inclusion of low-quality vOTU8,38. These discrepancies suggest that further research should employ high-quality vOTUs to predict the lifestyles of viruses more reasonably.

Phylogenomic analyzes of viruses

Viruses that infect vertebrates can cause serious diseases in pigs, leading to significant economic losses for the swine industry. For example, porcine circovirus can cause porcine circovirus-associated disease (PCVAD)39,40. In our metagenomic samples, we did not detect Asfarviridae, a large encapsulated double-stranded DNA virus41, possibly because the samples were collected from healthy or noninfected pigs. To explore the diversity of pig gut viruses, we constructed a phylogenetic tree based on 265 genomes with >50% completeness from the PVD (Fig. 5). The majority of viruses from the PVD represented new lineages across the tree. Smacoviridae and Circoviridae were the most diverse families found, primarily due to the broad phylogenetic distribution of the vOTUs belonging to these groups, which are circular replication-associated protein (Rep)-encoding single-stranded DNA viruses42,43,44. Moreover, we found porcine circovirus-like viruses that have been associated with porcine diarrheal disease43. Additionally, we found 206 viral genomes with >50% completeness in which the host was dominated by Methanobacteriaceae and Methanomethylophilaceae, consistent with the archaeal virome found in the human gut (Supplementary Fig. 3)45. Compared with recently published collections of viruses from diverse environments and the human gut8,46, our analysis of the PVD resulted in a substantial expansion of pig gut viral diversity.

Fig. 5: Phylogenetic tree of vOTUs with vertebrates as hosts.
figure 5

A phylogenetic tree was constructed from 265 viral genomes derived from the PVD. The tree was plotted using iTOL. Branch color indicates whether a lineage is represented by previously polished databases (IMG/VR4, MGV, GPD, and GVD) (black) or is unique to the PVD database (green). The different colors in the middle ring represent classifications at the family level. The outer ring displays the length (kb) for each vOTU.

Functional capacity of the gut virome

To investigate the potential roles of the pig gut virome, we identified 2,362,631 protein-coding genes across 48,299 of the above mentioned medium-quality vOTU genomes from our current study. To explore the functional potential of the pig gut virome, we used eggNOG-mapper v.2 for comparison against eggNOG 5.0 using default parameters6,47. Overall, 60% (1,416,705/2,362,631) of the vOTU genes matched the database (Fig. 6a). However, 73% of these genes were not assigned any biological function, highlighting our limited understanding of the pig gut virome. Among the matched genes, the largest protein clusters were single-stranded DNA-binding proteins (n = 35,964), followed by proteins involved in other typical viral functions, such as DNA polymerase and replicative DNA helicase (Fig. 6a). With respect to potential metabolic functions, we detected a total of 53,090 genes, with the top five categories being nucleotide metabolism (n = 11,016), amino acid metabolism (n = 9492), carbohydrate metabolism (n = 9036), metabolism of cofactors and vitamins (n = 6652), and energy metabolism (n = 4533) (Supplementary Fig. 4).

Fig. 6: Functional landscape of vOTUs that were above medium quality.
figure 6

a Functional annotations for the 30 largest protein clusters. b Protein-coding viral genes were identified against a structured antibiotic-resistance gene (SARG) database, mobile genetic elements (MGEs) database, virulence factor database (VFDB), quorum sensing databases, and BacMet (Antibacterial Biocide & Metal Resistance Genes) with BLASTp using Diamond with the following specific parameters: --evalue1e-5 –query-cover 85 –id 80. c The gene structure of one vOTU containing ARGs and MGEs. The red font represents MGEs, while the blue font represents ARGs.

We then compared these genes to databases such as SARG, MGEs, VFDB, the quorum sensing, and BacMet databases using Diamond’s BLASTp tool, with parameters set to --evalue1e-5, --query-cover 85, and --id 80 (Fig. 6b and Supplementary Tables 37). However, we only identified 69 protein-coding genes in 55 viral contigs (55/48,299 = 0.11%) when checking the SARG database for antibiotic resistance genes (ARGs) (Supplementary Table 3). Among them, there are 10 contigs carrying 2 genes and 2 contigs carrying 3 resistance genes (Supplementary Table 3). The top two categories were macrolide-lincosamide-streptogramin (MLS) (n = 21) and multidrug (n = 20) (Fig. 6b). We observed that the parameters of query cover and identity had a significant impact on the number of ARGs detected. Various studies have used different parameters to detect ARGs carried by viruses, often below the 80% identity threshold used in our study and recommended by Yin et al. 48,49,50,51,52. This could lead to inaccuracies and overestimations in the reported number of ARGs49,50,52. For example, by setting the parameters of –query-cover 85 and –id 60 in BLASTp, we obtained 4564 hits, which is 66 times greater than the previous setting. To compare results across different studies and establish more appropriate parameters for detecting virus-carried ARGs, comprehensive evaluations of parameters should be conducted in the future (Billaud et al.53; Enault et al., 2017)53,54. We utilized a previously established database to detect mobile genetic elements (MGEs) in pig gut vOTUs55. A total of 52 protein-coding genes were found and encompassed two dominant categories: transposase (n = 21) and integrase (n = 15) (Fi. 6b and Supplementary Table 4), which are known to play important roles in the transposition of ARGs21,56. However, we found only one viral genome containing both ARGs (n = 2) and MGEs (n = 1) (Fig. 6c). Moreover, most ARG-positive vOTUs found in pig metagenomes were not active53. Taken together, our results suggest that the pig gut virome carries a small number of ARGs and might contribute minimally to the gut resistome.

We also explored the distribution patterns of virulence factor genes (VFGs) carried by the virome (Fig. 6b and Supplementary Table 5). The results showed that VFGs were dominated by adherence (n = 50) and immune modulation (n = 34), consistent with previous studies57. Notably, we detected a small number of BacMet resistance protein genes (n = 48) in the pig gut vOTUs (Fig. 6b and Supplementary Table 6). The majority of metal resistance genes in the pig gut virome were associated with arsenic (As) (n = 7) and copper (Cu) (n = 6). Cu is extensively used as a feed additive for pigs and has a significant impact on the dissemination of antibiotic resistance genes (ARGs) in manure and soil58. We also examined vOTU proteins against the Quorum Sensing of Human Gut Microbes (QSHGM) database59. We identified only 74 protein genes of pig gut vOTUs that utilized quorum sensing languages, with the top three languages being HAQs, AI-2, and CAI-1 (Fig. 6b and Supplementary Table 7).

Discussion

In the present study, we conducted a comprehensive analysis of pig gut viral genomes by mining publicly available metagenomic samples (n = 4650) using our integrated pipeline designated metav (https://github.com/mijiandui/metav). The software tools in this pipeline were optimized to improve the speed and accuracy of metagenomic analysis and enable comparisons between different microbial ecosystems. We identified 19,577,760 draft-quality viral genomes, representing an estimated 3,219,413 species-level vOTUs. After applying CheckV, the PVD database was established, which included 5,566,804 viral contigs with more than 50% completeness, forming 48,299 species-level vOTUs. This is equivalent to the finding in the MGV (n = 54,118) and GPD (n = 46,480) human gut virome databases6,8. However, a pair-to-pair clustered comparison between IMG/VR4, GPD, MGV, GVD, and PVD showed that PVD still contained 34,952 (92.83%) unique vOTUs, making it the largest resource of extensive pig gut viral genomes to our knowledge. Although this study focused on DNA viruses, other studies have explored RNA viruses5,10,60,61,62,63. Thus, future investigations could utilize metatranscriptomic data from pig gut microbiota samples to investigate RNA viromes.

To rapidly recover viral genomes from metagenomics and metatranscriptomics data, ICTV has updated its taxonomy list, which has led to significant changes compared to the previous version30. geNomad is significantly faster than similar tools and can process large datasets, accurately assigning identified viruses to the latest ICTV taxonomy release64. However, this modification has made it difficult to compare the taxonomic results of this study with those of previous studies. To address this issue, it would be useful, in the future, to integrate several large-scale viral genome databases and create a unified, updated, and standardized catalog that is compatible with the new ICTV taxonomy. It is also essential to establish a unified and standardized taxonomy classification pipeline or platform in the future65.

The results of our study provided significant insights into the composition and function of the pig gut virome. Our analysis revealed a diverse and complex pig gut virome, consisting of numerous viral operational taxonomic units (vOTUs). In addition, we have identified several potentially novel viral species and found evidence of multiple viral species existing in the pig gut. These discoveries highlight the importance of continued research in this field, as they could have significant implications for our understanding of viral evolution, pathogenesis, and host‒pathogen interactions. Moreover, our findings suggest that the existence of multiple viral species is common in the pig gut, which has important implications for future studies investigating the impact of the pig gut virome on pig health and productivity and the potential for viral transmission to other animals and humans. However, this study did not identify certain viruses, such as the African swine fever double-stranded DNA virus, that have a significant impact on the pig industry. The potential reason is that the samples collected for this study were from healthy pigs. In the future, building comprehensive pig intestinal virome databases derived from pigs affected by different pathogens (including viruses), and conducting a comparative analysis with the viromes of healthy pigs, will provide essential data support for the prevention and control of corresponding pig diseases. Given the lessons learned from the COVID-19 pandemic, such as the potential for emerging infectious diseases to adversely affect human health, there is a need for increased awareness of zoonoses66,67. Therefore, we propose a “One Health” framework that emphasizes the importance of studying the “ecosystem-animal-livestock-human” pathogen system using metagenomic technology. This approach should include combining rich metadata, such as climate change data, to reveal the sources, hosts, transmission, and evolutionary mechanisms of known and unknown pathogens68.

In summary, our study lays the groundwork for further research into the pig gut virome. Our identification of new viral species and evidence of coexistence highlight the intricacies of gut ecosystems and emphasize the importance of further investigation in this area. Our findings provide the basis for a more comprehensive understanding of viral and microbial ecology in the pig gut and are expected to facilitate the monitoring and maintenance of pig health.

Methods

Assembly and viral identification in the gut contents and feces of pigs

We developed “metav”, a pipeline for identifying viruses from raw metagenomics generated via next-generation sequencing. This analysis was comprised of three main steps: 1) quality control of the raw reads using fastp v.0.23.2 (with parameters --detect_adapter_for_pe and --dont_eval_duplication -w 16), followed by the removal of the host (pig genome: Sscrofa11.1) contamination using BWA v.0.7.17 with parameters (-k 31 -p -S -K 200000000)69,70; 2) assembly of the data using Megahit v.1.2.9 (with -k-list 39, 59, and 79)25; and 3) viral identification using VirSorter2 v.2.2.3 and DeepVirFinder v.2.022,23. Viral sequences were identified with the criteria Keep1 (viral_gene >0) and Keep2 (viral_gene = 0 AND (host_gene = 0 OR score >= 0.95 OR hallmark >2)) based on VirSorter2’s SOP71 and q-value > =0.95 using DeepVriFinder. We collected metagenomic data from various sources, including SRP188615/PRJNA52640516,17, CNP000082414, PRJEB1175520, PRJNA78846217, PRJNA77506219, PRJEB2206218, PRJCA0096099, PRJEB4411872, and 70 samples collected in our laboratory. In total, 4,650 samples were used in our study to extract the viral contigs. Each sample was processed using the above pipeline, and then all the viral contigs were combined for subsequent analysis. To ensure the reliability of the results, all subsequent analyzes were conducted using vOTUs of at least medium quality (>50% completeness).

Viral contigs cluster and quality control

We applied CheckV v.1.0.1 (database v.1.4) to assess the quality of all viral sequences26. All viral contigs were clustered into species-level vOTUs based on 95% average nucleotide identity (ANI) and 85% alignment fraction (AF) using pairwise ANI calculations from the CheckV repository’s rapid genome clustering supporting code. All-vs-all local alignments were performed with the BLAST+ package v.2.13.0, with the parameters (-max_target_seqs 10000). Pairwise ANI values were calculated by combining local alignments between sequence pairs using anicalc.py script. UCLUST-like clustering was carried out with the aniclust.py script using the recommended parameters (95% ANI + 85% AF) from MIUVIG. The sequences of vOTUs resulting from clustering were extracted from the viral contigs, and their quality was re-evaluated using CheckV.

Viral taxonomy annotation

Viral taxonomy identification is challenging due to the lack of a specific marker gene for viral sequences and the presence of a large amount of ‘dark matter’ in various environments. A new ICTV taxonomy was released in 2022 and abolished the previous large proportion of the order Caudovirales and the families Myoviridae, Siphoviridae, and Podoviridae30. We applied geNomad v.1.2.064 to obtain the taxonomy of vOTUs above 50% completeness. The new ICTV taxonomy database (v.214) was used.

Functional annotation, host prediction, and lifestyle prediction

All protein-coding genes of vOTUs were predicted using prodigal-gv v.2.9.073. Genes were annotated based on Dimond searches against protein databases, including eggNOG 5.0 and VOGDB (http://vogdb.org)47, using EggNOG-mapper with default parameters (Cantalapiedra et al.)74. The structured antibiotic resistance gene (SARG) database48, antibacterial biocide and metal resistance genes (BacMet) database75, mobile genetic elements (MGEs) database55, virulence factor database (VFDB)76, and quorum sensing database59 were searched against with BLASTp using Diamond v.2.0.15.153 with the specific parameters: --evalue 1e-5 –query-cover 85 –id 8077. iPHoP v1.3.335 was used for the vOTU host prediction with the Aug_2023 release database and default parameters. We used BACPHLIP v.0.9.678 and PhaTYP79, which use bidirectional encoder representations from transformers (BERT), to determine whether the vOTUs were likely to be temperate or virulent.

Comparison to other viral reference databases

In this study, the vOTUs from the PVD were compared against four reference databases: IMG/VR v.4.012, GVD v.2.07, MGV v.1.08, and GPD v.1.06. To improve computational efficiency and facilitate cross-database comparison, all sequences were filtered to the vOTU species level and above 50% completeness according to the CheckV results before clustering between different databases. The sequences were then combined and clustered using the supporting code in the CheckV repository with the following BLASTn-specific parameters: evalue 1e-5, max-target-seqs 10,000, query coverage 90, and identity 70. We extracted the VCs between different databases, and if the sequences clustered together, it indicated that the vOTU was shared between two databases; otherwise, it meant that they were different. The comparison results were visualized using the UpSetR package80.

Phylogenetic analyzes

We generated phylogenetic trees for the vOTUs with completeness above 50% with taxonomy, host vertebrates, and archaea. First, we performed multiple sequence alignment for each type of vOTU sequence using MAFFT v.7.515 with the “--auto” model81 for each type of vOTU sequence. Second, we inferred a concatenated nucleic acid sequence phylogeny from the multiple sequence alignment using FastTree v2.1.11 with the parameters “-nt -gtr”82. Finally, the tree was midpoint rooted and visualized using iToL83.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.