Abstract
The human vagina harbours diverse microorganisms—bacteria, viruses and fungi—with profound implications for women’s health. Genome-level analysis of the vaginal microbiome across multiple kingdoms remains limited. Here we utilize metagenomic sequencing data and fungal cultivation to establish the Vaginal Microbial Genome Collection (VMGC), comprising 33,804 microbial genomes spanning 786 prokaryotic species, 11 fungal species and 4,263 viral operational taxonomic units. Notably, over 25% of prokaryotic species and 85% of viral operational taxonomic units remain uncultured. This collection significantly enriches genomic diversity, especially for prevalent vaginal pathogens such as BVAB1 (an uncultured bacterial vaginosis-associated bacterium) and Amygdalobacter spp. (BVAB2 and related species). Leveraging VMGC, we characterize functional traits of prokaryotes, notably Saccharofermentanales (an underexplored yet prevalent order), along with prokaryotic and eukaryotic viruses, offering insights into their niche adaptation and potential roles in the vagina. VMGC serves as a valuable resource for studying vaginal microbiota and its impact on vaginal health.
Similar content being viewed by others
Main
The human vagina is a diverse ecosystem hosting bacteria, viruses, fungi and other micro-eukaryotes which are crucial for maintaining women’s and fetal health1,2. In healthy women, Lactobacillus species dominate the vaginal microbiota, producing lactic acid to sustain an acidic environment and prevent the proliferation of harmful microorganisms3,4. Conversely, dysbiosis of the vaginal microbiota often leads to various health issues, including bacterial vaginosis characterized by the overgrowth of anaerobic bacteria such as Bifidobacterium vaginalis (formerly Gardnerella vaginalis), Fannyhessea vaginae and Prevotella species5,6.
Early research on the vaginal microbiota predominantly relied on culture-based or real-time PCR methods, posing inherent limitations in capturing the complete microbial diversity. The advent of next-generation sequencing, particularly amplicon sequencing targeting 16S ribosomal RNA gene or internal transcribed spacer (ITS) regions, has significantly advanced our understanding, uncovering uncultivated microorganisms associated with vaginal health, such as bacterial vaginosis-associated bacterium (BVAB) 1 to 37. However, while amplicon sequencing reveals taxonomic information, it falls short in exploring the functional characteristics and genome variations (for example, for lactobacilli8) of the microbial community. Studies using whole metagenomic sequencing have elucidated the compositional and functional diversity of the vagina microbiome9,10,11. Utilizing this technology, the Human Microbiome Project (HMP) has provided an overview of microbial and functional profiles across various body sites, including the vagina12. The study by Jie et al.11 has uncovered associations between the vaginal microbiota and life history, while the work of Fettweis et al. linked the vaginal microbiota to preterm birth9. A pivotal contribution to the reference database comes from the human vaginal non-redundant gene catalogue (VIRGO)10, which includes a total of 0.95 million genes compiled from metagenomic and bacterial genomic data. Although these findings confirm the presence of a diverse microbial community in the vagina, our knowledge of the full spectrum of microorganisms remains limited, primarily due to the lack of vaginal reference genomes.
In recent years, significant progress has been made in applying the metagenome-assembled genome (MAG) binning methodology to obtain a vast amount of genomic information from both cultivated and uncultivated microorganisms. This approach has shown efficacy in exploring microbiomes across various human body sites, including the gut13, oral cavity14 and skin15. In particular, Pasolli et al.16 have performed a comprehensive collection of over 150,000 MAGs representing the human microbiota. However, this collection encompasses merely 151 genomes from the vagina16. Consequently, there exists an urgent need to undertake metagenomic-based efforts aimed at establishing an extended reference genome database specifically tailored for the vaginal microbiota.
Herein, we have taken a comprehensive approach by integrating available metagenomic datasets and microbial genomes and engaging in in-house cultivation of vaginal fungi to construct the Vaginal Microbial Genome Collection (VMGC). The VMGC comprises 19,542 prokaryotic genomes, 38 fungal genomes and 14,224 viral genomes, spanning across 786 prokaryotic species, 11 fungal species and 4,263 species-level viral operational taxonomic units (vOTUs). Based on the VMGC, our analyses have revealed unexplored microbial species, genomic variations and functional characteristics within the human vaginal microbiome.
Results
Construction of the VMGC
The flowchart of the construction of VMGC is shown in Extended Data Fig. 1a. We collected a total of 4,472 publicly available vaginal metagenomic samples sourced from the human vagina, spanning 32 studies across the United States (n = 2,741 samples), France (749 samples), China (581 samples) and 11 other transcontinental countries (Supplementary Tables 1 and 2). Applying quality control, metagenomic assembly and both single- and multi-coverage metagenomic binning processes in each sample generated a total of 63,654 preliminary MAGs with a minimum length of 200 kbp. Subsequent dereplication and quality screening yielded 18,570 prokaryotic genomes with ≥50% completeness and <5% contamination (Supplementary Table 3) (see Methods for quality criteria). In addition, a search for eukaryotes from the preliminary MAGs identified 17 eukaryotic genomes with a minimum genome size of 4.46 Mbp, including 13 fungal genomes, 2 genomes of Trichomonas vaginalis and 2 genomes of Theileria parva (Supplementary Table 4).
We complemented our collection with publicly available prokaryotic and fungal genomes sourced from the National Center of Biotechnology Information (NCBI) RefSeq genome database and obtained 1,189 prokaryotic and 18 fungi genomes that were previously isolated from the female vagina. After quality filtering, we retained 972 high-quality prokaryotic and 17 fungal genomes for further analysis (Fig. 1a and Supplementary Tables 3 and 4). In addition, we performed fungal cultivation from the vaginal swabs of healthy women and obtained 8 cultured fungi included in our genome collection.
Among 19,542 prokaryotic genomes (that is, 18,570 MAGs and 972 isolated genomes), 10,127, 8,397 and 1,017 were classified as medium-quality, high-quality and near-complete genomes, respectively (Fig. 1b), following the criteria revised from the MIMAG (Minimum Information about a Metagenome-Assembled Genome) standard17 (Methods). The median N50 length and genome size of all prokaryotic genomes were 38.0 kbp (interquartile range (IQR) = 14.5–81.4 kbp) and 1.36 Mbp (IQR = 1.11–1.62 Mbp), respectively, with a downward trend from near-complete to medium-quality genomes (Fig. 1c,d). Of the 38 fungal genomes (that is, 13 MAGs and 25 isolated genomes), the medium completeness and contamination were 92.7% and 0.4%, respectively, and the medium genome size was 12.7 Mbp (IQR = 11.9–14.6 Mbp) (Fig. 1e).
Finally, to further expand the horizon of the viral community, we performed virus identification on the metagenome-assembled contigs using a combined feature- and homology-based approach, resulting in 14,224 viral sequences achieving at least 50% completeness (Fig. 1a and Supplementary Table 5). Based on the CheckV algorithm18, 10.3% of viral sequences were estimated as complete viral genomes, 29.3% were of high quality (completeness ≥90%) and 60.4% were of medium quality (Fig. 1f). In addition, 42.3% of viral sequences were recognized as integrated proviruses by CheckV, and the viral regions were extracted. Most viral genomes (98.9%) showed a low degree of genomic contamination (<10%) (Fig. 1g), and analysis of the host-to-virus gene ratio also revealed low microbial host contamination (Extended Data Fig. 1b). The genome size of viral sequences ranged from 5.0 kbp to 417.5 kbp, with a median length of 33.5 kbp (IQR = 24.0–41.1 kbp). High-quality viruses had a larger genome size than medium-quality viruses; however, the complete viruses seemed to have the smallest genomes (Fig. 1h).
Taxonomic landscape of vaginal prokaryotic species
We conducted a genome-wide clustering analysis on 19,542 prokaryotic genomes, resulting in 786 species-level genome bins (SGBs, referred to as ‘species’ thereafter) based on the 95% nucleotide similarity threshold19. Rarefaction analysis revealed that species richness did not reach saturation (Extended Data Fig. 2a), suggesting the existence of additional undiscovered species. However, these species are likely to be predominantly rare, as the number of species approaches a plateau when considering only those with at least two conspecific genomes. About 41.8% (n = 329) of the 786 species had one or more genomes previously isolated from the human vagina, while the remaining 58.1% (n = 457) species had only MAGs available (Fig. 2a,b). We compared the prokaryotic species to NCBI RefSeq isolated genomes to extend the search to species from other environments. This analysis identified an additional 249 species (31.7% of all prokaryotic species) showing high homology to the genomes previously cultured from other habitats such as the human gastrointestinal tract, oral cavity or natural environments, leaving 208 species (26.5%) currently uncultured (Fig. 2b and Supplementary Table 6).
Taxonomic analysis of vaginal prokaryotic species revealed the presence of 15 phyla, 18 classes, 43 orders, 87 families and 239 genera. Dominant phyla included Actinomycetota (consisting of 28.5% of all species), Bacillota_A (21.2%), Bacteroidota (16.3%) and Bacillota (16.0%), followed by Pseudomonadota (6.5%) and Bacillota_C (4.5%) (Fig. 2a and Extended Data Fig. 2b). At the lower taxonomic levels, some taxa such as Bacteroidales (15.0% of all species; mainly consisting of Prevotella spp.), Actinomycetales (15.0%; mainly Bifidobacterium and Actinomycetaceae spp.), Lactobacillales (12.6%; mainly Lactobacillus and Streptococcus spp.), Tissierellales (9.2%; mainly Anaerococcus and Peptoniphilus spp.), Coriobacteriales (6.4%; mainly Fannyhessea spp.) and Mycobacteriales (6.1%; mainly Corynebacterium spp.) were of the major clades. Unlike the human gut microbiota which consisted of many species belonging to the genus Bacteroides and the order Lachnospirales (for example, Clostridium, Ruminococcus and Eubacterium spp.)13, the vaginal microbiota had rare representatives of these, with only 0.6% (n = 5) of species being Bacteroides and 5.3% (n = 42) being Lachnospirales.
The uncultured species were widely distributed across different phyla, especially Patescibacteria, Campylobacterota, Spirochaetota and Bacillota_A (Fig. 2a). We further calculated the phylogenetic diversity based on phylogenetic trees for nine phyla with more than three uncultured species. This analysis showed that the uncultured species accounted for, on average, 52.8% (ranging from 34.3% to 83.0%) of phylogenetic diversity in the trees for these phyla, with the highest proportion of phylogenetic diversity in Patescibacteria and Campylobacterota (Fig. 2c). In particular, despite that there are an average of 25.9% of species in 4 dominated vaginal phyla being uncultured, they contributed an average of 51.8% of phylogenetic diversity.
Some uncultured species, including SGB010 (known as BVAB1, a typical uncultured bacterial vaginosis-associated bacterium7, consisting of 461 genomes), SGB013 (391 genomes) and SGB015 (374 genomes), contained a large number of genomes. Importantly, these genomes were assembled from vaginal metagenomes from various countries (Extended Data Fig. 2c), suggesting their widespread presence in the human population. Furthermore, we profiled the relative abundances of all VMGC prokaryotic species in metagenomic samples. Consistent with previous studies10,20, species that belonged to Lactobacillus (for example, Lactobacillus iners, Lactobacillus crispatus and Lactobacillus jensenii), Bifidobacterium (Bifidobacterium vaginale), Prevotella (Prevotella bivia and Prevotella timonensis) and Fannyhessea (F. vaginae) were the main components of the samples. Several unclassified bacteria including BVAB1 (SGB010), Bifidobacterium sp. (SGB042 and SGB058) and Prevotella sp. (SGB013) were also among the high-abundance species (Fig. 2d and Extended Data Fig. 2c).
Functional configuration of vaginal prokaryotic species
We annotated the functions of vaginal prokaryotic species. Approximately 67.4% of protein-coding genes could be accommodated in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database21, mostly in metabolism and genetic information processing genes (Extended Data Fig. 3a). In addition, 5.2% of genes were annotated into the Carbohydrate-Active enZYmes (CAZy) database22, while 1.2% and 0.2% were identified as virulence factors and antibiotic resistance genes, respectively.
Our comprehensive vaginal genome collection may enable a high-resolution functional analysis of the vaginal microbiome. To demonstrate this, we focused on nine functional modules that had been reported to be associated with vaginal diseases or metabolic abnormalities6,23 (Supplementary Table 7). We found that Lactobacillales (mainly L. crispatus and L. iners) and Enterobacterales (primarily Escherichia coli) significantly contributed to gene abundance in biofilm formation synthesis, followed by several potential pathogens such as Pseudomonas aeruginosa and BVAB1 (Fig. 2e and Extended Data Fig. 3b). Genes involved in the biosynthesis of sialidase, an enzyme capable of disrupting the vaginal mucosal barrier23,24, were predominantly encoded by Actinomycetales (mainly Bifidobacterium piotii and B. vaginale) (Extended Data Fig. 3c). Lactobacillales (mainly L. iners) and Bifidobacterium spp. showed the highest gene abundances related to the synthesis of cytolysins (for example, inerolysin and vaginolysin) and haemolysin (Extended Data Fig. 3d); these toxins pose a high risk for inflammation and the disruption of the vaginal epithelial barrier25,26. Mycobacteriales and Enterobacterales showed the highest gene abundance in the synthesis of phospholipase C and urease, while Ureaplasma parvum and Pseudomonas spp. also encoded a high gene abundance for urease. Several biogenic amines, including cadaverine, N-acetylputrescine and trimethylamine, are major amine components of vaginal fluid and are implicated in vaginitis and the activation of proinflammatory responses27,28,29. For these, Enterobacterales (mainly E. coli) and Saccharofermentanales (mainly Amygdalobacter indicium and Mageeibacillus indolicus) encoded the major gene abundance responsible to produce cadaverine and trimethylamine, while various beneficial (for example, Lactobacillus gasseri and Lactobacillus johnsonii) and harmful species (for example, E. coli and P. bivia) harboured genes associated with N-acetylputrescine production (Extended Data Fig. 3e).
Similarly, we quantified the functions of short-chain fatty acids and antibiotic resistance genes synthetization in vaginal metagenomes. This analysis showed that Lactobacillales encoded the highest gene abundances associated with the biosynthesis of acetate and lactate; Actinomycetales was identified as a potential primary producer of succinate, while Veillonellales and Tissierellales played significant roles in producing propionate and butyrate, respectively (Fig. 2e). Besides, Lactobacillales, Actinomycetales, Pseudomonadales and Enterobacterales showed the highest abundances of genes for antibiotic resistance.
Description of Saccharofermentanales in the human vagina
Saccharofermentanales, a prevalent vaginal order seldom found in other parts of the human body, contained several bacterial vaginosis-associated species, including SGB009 (A. indicium, previously known as BVAB2), SGB034 (M. indolicus, known as BVAB3) and SGB080 (Amygdalobacter nucleatus, a BVAB2-like species), in VMGC (Fig. 3a). M. indolicus has been previously isolated and sequenced in a limited number (n = 3) of available genomes, while the genomes of A. indicium and A. nucleatus were first sequenced during the preparation of this study (each with one complete genome in NCBI)30. By contrast, our MAGs, specifically SGB009, SGB034 and SGB080, have yielded 12, 3 and 1 near-complete genomes, along with a substantial number of high-quality genomes (360, 98 and 13, respectively). Interestingly, we observed that the completeness of A. indicium and A. nucleatus might be underestimated in the current CheckM2 algorithm31, as their complete genomes remain at 94.1% estimated completeness, respectively (Fig. 3b). Similar assessment results were also revealed by other algorithms such as CheckM and BUSCO (based on 124 universal single-copy orthologs (USCOs) for prokaryotes32) (Extended Data Fig. 4a). To elucidate this, we analysed the 124 USCOs for each species and found that they individually lack 14 USCO genes (Fig. 3c and Extended Data Fig. 4b), indicating that their genomes may have lost certain key genes during evolution, leading to a decrease in their ‘completeness’. This phenomenon was also observed in SGB034 and other Saccharofermentanales species, with nine USCO genes seemingly absent in all Saccharofermentanales species, likely accompanying marked genome reduction in the human vagina (Extended Data Fig. 4c). In addition, five of the nine missing USCO genes were involved in de novo purine biosynthesis (Fig. 3c); the lack of this pathway has yet to be reported in other reproductive tract pathogens, such as Treponema pallidum, Chlamydia trachomatis and Mycoplasma pneumoniae33.
Next, we investigated the functional characterization of three prevalent Saccharofermentanales species. Nearly half (49.5%) of KEGG functional orthologs overlapped among three species, with some of these being rare in other vaginal species (Extended Data Fig. 5a). Among these, hexokinase (K00844) was only detected in Saccharofermentanales species and a few other taxa (six Veillonellales, three Bacteroidales and two Selenomonadales species), suggesting a high hexose (for example, glucose, mannose and fructose) utilization capacity for them. Conversely, some functions varied significantly between different Saccharofermentanales species. For example, SGB034 harboured more KEGG functional orthologs involved in the metabolism of thiamine, glycerophospholipid and glycerolipid, while the capacity for cadaverine production was only found in SGB009 and SGB080 (Extended Data Fig. 5b). Specifically, we found that the species SGB009 and SGB080 encoded Lsr-type autoinducer-2 (AI-2) receptors, which are virtually absent in other vaginal bacteria (Fig. 3d). As an important signalling molecule for interspecies communication, AI-2 plays a crucial role in regulating multiple bacterial behaviours (for example, survival, biofilm formation and virulence-related gene expression) via AI-2 receptors34,35. In the vaginal microbiome, AI-2 was catalytically synthesized by the LuxS gene, which is encoded by numerous bacteria including Bifidobacterium, Prevotella, Lactobacillus and Streptococcus species (Fig. 3d,e). Research indicated that bacteria possessing AI-2 delivery and internalization pathways can not only regulate the formation of their biofilms through AI-2 but also internalize exogenous AI-2 to disrupt the biofilm formation of other competing bacteria, even if they lack the LuxS gene36,37,38. This suggests that the species SGB009 and SGB080 can interfere with quorum sensing and the growth of other vaginal bacteria by internalizing AI-2, potentially giving them an advantage in competition.
Characteristics of vaginal fungal species
We clustered the 38 fungal genomes in VMGC into 11 species with a 95% genome similarity threshold (Supplementary Table 4). Taxonomic classification revealed nine known fungal species, whereas the other two species (all derived from vaginal fungal cultivation in this study) were not represented in the NCBI RefSeq database (Extended Data Fig. 6a). Two species, Clavispora lusitaniae and Pichia kudriavzevii, were represented only by MAGs without isolated genomes, and the presence of these fungi in the woman’s vagina had been previously reported39. In addition, we calculated the mapping rates and prevalence rates of the 11 fungal species in large-scale vaginal metagenome datasets. This analysis revealed that Candida albicans, Nakaseomyces glabratus and Saccharomyces cerevisiae were the most dominant fungi in the human vagina, followed by Candida parapsilosis, P. kudriavzevii and C. lusitaniae (Extended Data Fig. 6b). Besides, C. albicans, Aspergillus chevalieri, C. parapsilosis and S. cerevisiae were the most prevalent fungal species, each with prevalent rates exceeding 10% in the vaginal mycobiome.
Taxonomic and functional diversity of the vaginal viruses
To explore the taxonomic content of the vaginal virome, we deduplicated the viral genomes of VMGC into 4,263 species-level vOTUs at 95% nucleotide identity (Supplementary Table 8). Rarefaction analysis showed that the accumulation curve of vOTUs was not reaching a plateau at the current number of viral genomes, despite the non-singleton vOTUs (consisting of 35.5% of all vOTUs) reaching saturation (Fig. 4a). In addition, we conducted a clustering analysis on VMGC vOTUs and all viral species from five large-scale virome databases including Gut Virome Database 40, Gut Phage Database41, Metagenomic Gut Virus catalogue42, Oral Virus Database43 and all NCBI RefSeq viral genomes. The result revealed that 85.8% of vOTUs in the VMGC were not found in any of the other virome databases (Fig. 4b). These findings indicate a significant unexplored diversity of viral content in the human vagina as well as in VMGC.
Taxonomic classification revealed that 66.0% (2,814 out of 4,263) of the vOTUs could be robustly assigned to known viral families, including 2,744 vOTUs assigned to 13 prokaryotic viral families and 70 vOTUs assigned to 7 eukaryotic viral families (Fig. 4c). The most frequently observed families in VMGC were Siphoviridae (n = 2,134 vOTUs) and Myoviridae (n = 461), followed by Papillomaviridae (n = 61), Rountreeviridae (n = 39), Podoviridae (n = 36), Quimbyviridae (n = 27), Autographiviridae (n = 14), p-crAss-like (n = 13) and other families. Siphoviridae and Myoviridae were the most dominant viral families in human body sites including the gut and oral cavity40,41,42,43; however, another viral family that predominated in the human gut, Microviridae, was rarely found in the vagina (n = 4 vOTUs in VMGC). In addition, 86.4% of vOTUs in VMGC had at least one predicted prokaryotic host (Fig. 4c). At the phylum level, the predicted hosts of vaginal viruses were dominated by Actinomycetota, Bacteroidota, Bacillota and Bacillota_A.
We constructed a phylogenetic tree of 4,263 vOTUs based on their genome-wide similarity at the protein level (Fig. 4d). The tree revealed that the viruses tend to cluster by both family-level taxonomies and potential host affiliations. This finding largely agreed with that obtained from the oral viral catalogue and suggested that host adaptation was an important factor driving genomic evolution in human-associated viruses43. The Siphoviridae vOTUs were predicted to infect Actinomycetota, Bacillota, Bacillota_A and other bacteria (Fig. 4c). Strikingly, Siphoviridae contained numerous viruses (n = 440; corresponding to 20.6% of all Siphoviridae viruses) that were predicted to infect hosts across multiple prokaryotic phyla; however, this phenomenon was rarely observed in viruses of other families (average 2.4% for 12 prokaryotic viral families) (Supplementary Table 8), suggesting a relatively broad host range of vaginal Siphoviridae members. Myoviridae were predicted to infect Bacillota_A, Bacillota, Pseudomonadota and others, but Rountreeviridae and Autographiviridae seemed to be specific to the infection of Bacillota and Bacillota_C, respectively. Notably, the Bacteroidota phages were remarkably concentrated in the phylogenetic tree (Fig. 4d). Specifically, out of 520 phages infecting Bacteroidota, the majority (86.2%) were unclassified viruses, indicating the existence of many previously unknown taxa of Bacteroidetes phages in the human vagina.
To elaborate on function profile of the vaginal virome, we annotated the functions of all vOTUs using the KEGG database, resulting in 13.9% of viral genes with KEGG orthology annotations. Among these annotated genes, 49.0% were involved in genetic information processing, while 19.9% were viral auxiliary metabolic genes (Fig. 4e). Focusing on the most prevalent auxiliary metabolic genes in VMGC, many of them were related to the metabolism of peptidoglycan, nucleotide, amino acids, folate and sulfur (Fig. 4f). Further CAZy annotation revealed that the vaginal viruses encoded numerous peptidoglycan lyase/hydrolase-related enzymes (Extended Data Fig. 7), likely facilitating bacterial cell wall degradation during viral infection44.
Characteristics of vaginal eukaryotic viruses
Of the 70 eukaryotic vOTUs identified, we specifically focused on the Papillomaviridae viruses (n = 61), considering the wide distribution of human papillomaviruses (HPVs) in the vagina and their implication in diseases such as cervical cancer45. Almost all papillomaviruses in VMGC had complete or high-quality genomes, mostly containing complete L1 gene (encoding the major capsid protein) for HPV typing (Fig. 5a and Supplementary Table 9). We assigned 61 Papillomaviridae vOTUs into 58 HPV types (including 5 unidentified types) based on their L1 genes, with 52 types previously discovered in the woman’s vagina and 6 types previously found in human penile and skin swabs. Likewise, a phylogenetic analysis of 61 Papillomaviridae vOTUs (n = 627 genomes) and the NCBI RefSeq Papillomaviridae viruses (n = 475 non-redundant genomes) revealed that our vOTUs represented the major phylogenetic clades of the human vaginal papillomaviruses (Fig. 5b and Extended Data Fig. 8a). Furthermore, based on our large number of genomes, we found that for certain HPV types, their genome evolutionary relationships had obvious regional stratification (Fig. 5c for the examples of HPV types 52 and 58). Compositional analysis of vaginal metagenomes based on VMGC showcased its potential in identifying enrichment of certain HPV types in patients with cervical lesions or cancer (q < 0.05; Extended Data Fig. 8b). Collectively, these findings highlighted the value of VMGC Papillomaviridae viruses as a representable reference catalogue for subsequent HPV research.
Besides, three eukaryotic vOTUs, v0116 (Virgaviridae virus), v00f5 and v06d5 (Retroviridae viruses), had not been reported in the human vagina. The Virgaviridae v0116 genome was closely related to the pepper mild mottle virus, a plant virus that had also been found in the human gut, with a 99.8% nucleotide similarity, likely due to dietary ingestion46. The Retroviridae v00f5 and v06d5 had a high nucleotide similarity (99.9% and 97.0%) with a xenotropic murine leukaemia virus-related viruses that had previously been detected in human prostate tumour47.
Vaginal microbial gene catalogue and comparison with VIRGO
To explore the gene content of VMGC, we clustered approximately 26 million protein-coding genes from all prokaryotic, fungal and viral genomes at 50% (VMGC-50), 90% (VMGC-90) and 95% (VMGC-95) amino acid identity, generating three catalogues with 595,219, 1,415,799 and 1,786,695 million non-redundant genes, respectively. Rarefaction analysis revealed that the accumulation curves of VMGC-95 and VMGC-90 were not yet reaching a plateau, whereas the curve of VMGC-50 approached saturation (Fig. 6a), suggesting that VMGC represented an insufficient gene space but relatively saturated gene family space of the human vaginal microbiome. In addition, we compared the gene content of VMGC-90 with VIRGO10, which represented 0.62 million genes at 90% amino acid identity (referred to as VIRGO-90; Fig. 6b). About 70.6% of genes in VIRGO-90 were covered by VMGC-90, while only 30.8% VMGC-90 genes were also included in VIRGO-90. The remaining genes in VIRGO-90 might be there because this catalogue contained numerous genes from low-abundance microbes or other sequences (for example, plasmids, accessory genes) not binned into MAGs, or high intraspecies diversity in the human vaginal microbiome (Extended Data Fig. 9a)10. Nevertheless, VMGC provided an increase of 162% in coverage of the vaginal microbiome protein space over the VIRGO (from 0.62 to a total of 1.63 million protein clusters). Comparison with other databases also revealed that VMGC-90 significantly expands the protein content of the vaginal microbiota (Extended Data Fig. 9b). Furthermore, mapping of vaginal metagenomes against the vaginal microbial gene and genome catalogues showed an average mapping rate of 74.5% and 83.8%, respectively, in samples from 14 different countries, slightly higher than that of the VIRGO (average 71.7%) (Fig. 6c and Extended Data Fig. 9c,d). The mapping rate was higher in samples from Europe or North America than in those from China, Fiji or South Africa (p«0.001), suggesting a relatively low representativeness of VMGC genes in the latter regions.
We further compared the gene content of prokaryotes, fungi and viruses at the VMGC-90 level. Over a half (53.7%) of viral genes were shared with the prokaryotes (Fig. 6d and Extended Data Fig. 10a,b), likely due to the frequent exchange of genetic materials in these two kingdoms, whereas only a few genes were shared between fungi and prokaryotes/viruses. We also compared the functions of prokaryote-specific, fungus-specific and virus-specific genes. As expected, the virus-specific genes mainly encoded the typically viral enzymes involving genetic information processing (for example, DNA replication and repair, and transcription) and prokaryotic defence system, while the prokaryote- and fungus-specific genes had more metabolism-associated genes (Extended Data Fig. 10c).
Discussion
The VMGC is a large-scale reference sequence resource of the human vaginal microbiome with 33,804 genomes across three microbial kingdoms. A considerable proportion of species in VMGC, including over 25% of prokaryotic and 85% of viral species, have not been successfully cultured. In addition, surpassing the VIRGO database10, VMGC offers an extensive expansion of microbial gene sequences, enhancing our ability to study the functional potential of the vaginal microbiota.
Functionally, our genome-level analysis found that L. iners and B. vaginale were the primary contributors of genes associated with pore-forming toxins such as haemolysin and inerolysin, consistent with the previous study23. In addition to the common microbial contributors to urease and phospholipase C production in the human vagina such as Ureaplasma spp.23, we have observed frequent encoding of these enzymes in the genomes of Ralstonia pickettii, P. aeruginosa and Staphylococcus epidermidis. For the underexplored prokaryotes, we observed that members of Saccharofermentanales, such as BVAB2, contributed to many genes related to cadaverine production, which has not been reported previously. In addition, Saccharofermentanales members were deficient in the de novo purine biosynthesis pathway, and it is known that similar microorganisms rely on host-derived purines for their growth33. Interestingly, the evolutionary trajectory of Saccharofermentanales members appears to be associated with genome reduction, a characteristic feature observed in obligate intracellular parasites of humans48, providing hints about the nutritional requirements and environmental preferences of these bacteria. These results may provide a reference value for the functional contributions of these bacteria to the vaginal microecosystem and their potential implications in disease development.
The VMGC includes 14,224 viral genomes across 4,263 species, with a significant proportion predicted to infect prokaryotic organisms in the vagina. The exploration of these viruses, particularly those that specifically target vaginal pathogens such as B. vaginale, F. vaginae, BVABs and so on, offers potential insights into phage therapy against vaginal microbial infections. The VMGC also served as a valuable resource for studying the intra-species diversity of viruses within the vaginal environment and investigating the geographical heterogeneity of their transmission.
In summary, the VMGC significantly advanced our knowledge of the vaginal microbiome, aiding in the development of targeted interventions, diagnostics and personalized approaches for vaginal health. However, it is worth noting some limitations of the VMGC. First is that the microbial sequences captured by the VMGC are not complete, especially for vaginal samples from the non-western population. The second is that the VMGC lacked fungal genomes from members of Cladosporium and Malassezia, among others which were observed in the vaginal samples using the ITS sequencing approach49. Third, many phages in the VMGC are unknown, and their functional exploration based on existing databases remains limited. Further isolation, cultivation and physiological studies are necessary to gain a deeper understanding of their genomic information and potential contributions to vaginal health. Another concern is contamination from experimental operations or kits in public vaginal metagenome datasets used for constructing the VMGC. Although our assessment indicated a low level of contamination (Supplementary Fig. 1), caution is still warranted when utilizing the VMGC.
Methods
The retrieval of prokaryotic and fungal genomes in the NCBI database
We searched for human vaginal microbial genomes from approximately 520,000 prokaryotic and 13,000 fungal genomes recorded in the NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/browse/) until April 2023. We first excluded genomes with any mention of the following keywords in their NCBI BioSample information (https://ftp.ncbi.nlm.nih.gov/biosample/): ‘uncult*’, ‘MAG’ and ‘bin’. Subsequently, we extracted genomes that included any mention of the following keywords in their NCBI BioSample information: ‘ovary’, ‘vagin*’, ‘endomet*’, ‘genita*’, ‘cervi*’, ‘uterus’, ‘reprodu*’, ‘posterior’, ‘douglas’ and ‘fallopian’. This process yielded approximately 2,100 candidate genomes. To further refine our selection, we manually checked and extracted genomes of cultured isolates from the candidate genomes based on the metadata records in their BioSample information, Genomes Online Database information (https://gold.jgi.doe.gov/) or original studies. This refinement resulted in 1,189 prokaryotic and 18 fungal isolate genomes. These genomes then underwent genome quality filtering. For prokaryotic genomes, we assessed their quality using CheckM2 v1.0.131 and GUNC v1.0.5 tools50. We retained 972 genomes meeting the criteria of ≥50% completeness, <5% contamination, a genome quality score of ≥50 (defined as %completeness − 5×%contamination) and a clade separation score of <0.45 (Supplementary Table 3). In terms of fungal isolate genomes, the quality assessment was performed using BUSCO v5.4.2 with the parameter ‘-m genome’ based on the fungi_odb10.2019–11–20 database32. We retained 17 fungal genomes with ≥50% completeness and <5% contamination (Supplementary Table 4).
Collection of vaginal metagenomic samples
We performed an extensive search in the NCBI database until October 2023, targeting human vaginal metagenomic samples annotated as vagin*, cervi*, endomet* and so on. The search yielded a collection of 3,123 vaginal metagenomes from 32 different studies (Supplementary Table 1). Furthermore, we obtained additional 1,366 vaginal metagenomic samples from the VIRGO website (https://virgo.igs.umaryland.edu/)10, after removing duplicates found in the NCBI samples. The accession ID of each sample is provided in Supplementary Table 2.
Fungal culture and shotgun genome sequencing
The study received approval from the Ethics Committee of the Second Affiliated Hospital of Dalian Medical University (number DMU20210082). Informed consent was obtained from all volunteers. Fungal cultivation was performed based on fresh specimens of genital tract secretions from three healthy women: subject 1 (21 years old), subject 2 (36 years old) and subject 3 (47 years old). The vaginal samples were collected by a gynaecologist from vaginal fornix with a cotton swab. Before the examination, the recruited women were asked to avoid vaginal douching, sexual intercourse and other vaginal manipulations for 3 days. Women with vaginal bleeding or vaginal and cervical inflammation were excluded from the research.
A total of 1 ml of suspension was inoculated immediately onto each agar plate containing fungal culture medium, which was incubated until colonies were observed under aerobic and anaerobic conditions (32 °C). After incubation for 3–14 days, various fungal colonies were observed on the cultured agar plates, while no fungal colonies were observed on the agar plates for the culture of the sterile diluent buffer as the negative control. From each incubated agar plate, phenotypically distinct colonies were picked onto the corresponding fresh medium for further purification. Then, a single colony was picked and restreaked on the corresponding medium to purify the fungal strains. The purified fungal strains for identification and storage were inoculated in a 5 ml modified Martin broth medium under the same culturing conditions (32 °C, 130 r.p.m. shaking). Shotgun genome sequencing of the cultured vaginal fungi was performed based on the Illumina NovaSeq platform for 2 × 150 bp paired-end sequencing. Sequencing libraries were prepared by using the NEBNext Ultra DNA Library Prep Kit following the manufacturer’s recommendations.
Preprocessing and assembly
To ensure data quality, raw reads from each sample were subjected to quality control using fastp v0.20.151 with the parameters ‘-u 30 -q 20 -l 30 -y --trim_poly_g’ for samples with read lengths of 75 bp or less and ‘-u 30 -q 20 -l 60 -y --trim_poly_g’ for samples with read lengths greater than 75 bp. A secondary complexity filter was applied to each sample using bbduk with the parameters ‘entropy=0.6 entropywindow=50 entropyk=5’ from the BBTools suite v39.0052. Subsequently, the reads deriving from human and Escherichia phage phiX174 were removed by mapping the quality-filtered reads against the CHM13v2.0and phiX174 (NCBI accession NC_001422.1) genomes using Bowtie2 v2.4.1 with the parameters ‘--end-to-end --fast’53. The remaining reads were considered clean reads for each sample. As a result of this preprocessing, a total of 4,472 vaginal metagenomic samples, representing approximately 5.0 Tbp high-quality non-human metagenomic data, were available for further analysis (Supplementary Table 2). The clean reads were de novo assembled into contigs for each sample using MEGAHIT v1.2.954 with the parameters ‘--k-list 21,41,61,81,101,121’, resulting in over 4.1 million long contigs (minimum length of 2 kbp; total length of 38.3 Gbp; with 38.9% of these contigs longer than 5 kbp).
Genome analysis for prokaryotic populations
MAGs
Only assembly files containing contigs with lengths greater than 2,000 bp were used to recover MAGs. Referring to recent studies55,56, we developed a ‘mash-based multi-coverage binning approach’ based on Mash57 to select a specific number of samples for multi-sample read alignment and MAG recovering. The specific approach is outlined as follows: (1) calculate the Mash distance among the assembled contig sets of all sample pairs using Mash v2.3 with the parameters ‘-s 100000 -k 32’57; (2) for each sample, we selected the N most similar samples based on Mash distance, mapping their reads to the contigs of the target samples, and calculated the contigs’ sequencing depth in the most similar samples using the jgi_summarize_bam_contig_depths tool58; (3) for each target sample, we integrated the depth files of its N most similar samples using combine.pl (https://github.com/WatsonLab/single_and_multiple_binning/blob/main/scripts/combine.pl) to obtain the contigs’ multi-sample depth. Based on this integrated depth file, we performed multi-coverage metagenomic binning for the contigs using MetaBAT 2 with the parameters ‘-m 2000 -s 200000 --seed 2020’58, ultimately obtaining the raw bins for each sample. To determine the optimal value for N, representing the number of most similar samples used for calculating sequencing depth, we randomly selected 80 vaginal samples and conducted multi-coverage binning processes for each sample using the top 10–60 closest samples (from 4,472 samples) based on Mash distance. This test revealed that, compared to single-coverage binning, all multi-coverage binning approaches, regardless of selecting 10–60 samples, yielded a significantly higher number of MAGs and a lower proportion of heterozygous MAGs (Supplementary Fig. 2). After N reached 20, the trend of increasing MAG quantity approached a plateau, indicating that selecting the 20 closest samples for multi-coverage binning is appropriate for vaginal samples. Using the described approach, we applied multi-coverage binning to all vaginal metagenomic samples. In addition, we performed single-coverage binning for each sample. Subsequently, using dRep v3.4.0 with the parameters ‘-pa 0.9 -sa 0.99 -nc 0.3 --S_algorithm fastANI’59, we conducted strain-level de-redundancy (99% average nucleotide identity (ANI)) for quality-controlled (see Quality assessment of prokaryotic genomes section) MAGs obtained from both multi-coverage and single-coverage binning within each sample, resulting in a total of quality-controlled 18,570 MAGs (where 4,713 and 13,857 come from single-coverage and multi-coverage binning, respectively).
Quality assessment of prokaryotic genomes
The completeness and contamination of prokaryotic genomes (for example, MAGs and isolated genomes) were estimated using CheckM2 v1.0.131, and the genome chimerism of prokaryotic genomes was evaluated by GUNC v1.0.550. Genomes with <50% completeness, ≥5% contamination, quality score of <50 or clade separation score of ≥0.45 were removed for further analyses. During the process of categorizing the integrity of prokaryotic genomes, we referred to the revised MIMAG standard17. In general, the medium-quality MAGs were characterized by ≥50% completeness and <5% contamination; the high-quality MAGs showed ≥90% completeness and <5% contamination; the near-complete MAGs were defined as the high-quality MAGs with the presence of 5S, 16S and 23S rRNA genes, as well as at least 18 types of transfer RNA. The identification of rRNA within MAGs was performed using cmsearch in INFERNAL v1.1.460, while the identification of tRNAs was conducted using tRNAscan-SE v2.0.1161 with the parameters ‘-L -B’.
Clustering, taxonomic and phylogenetic analysis
All prokaryotic MAGs (n = 18,570) and isolate genomes (n = 972) were clustered at the species level by dRep v3.4.059 with the parameters ‘-pa 0.9 -sa 0.95 -nc 0.3 --S_algorithm fastANI’, resulting in 786 SGBs. For each SGB, MAGs are sorted in the order of quality groups: near-complete, high-quality and medium-quality. Within each group, they are further sorted based on the genome quality score from high to low. The top-ranked MAG was considered as the reference genome for the corresponding SGB. Taxonomic classification of the SGBs was carried out by the classify_wf model in the Genome Taxonomy Database Toolkit (GTDB-Tk) v1.4.0 with GTDB database r214.162. The phylogenetic tree created by GTDB-Tk was visualized using iTOL v6.7.4 (https://itol.embl.de/)63. In addition, the phylogenetic diversity of each phylum was calculated based on the aforementioned phylogenetic tree using the function ‘pd’ in the R package picante v1.8.2.
Taxonomic profiles in vaginal metagenomic samples
We used an in-house script to construct Kraken2 and Bracken databases comprising 786 prokaryotic species in the VMGC. The clean reads were mapped against these prokaryotic genomes using Kraken v2.1.3 with confidence score 0.1 and other default parameters64. Based on these alignment results, Bracken v2.8 was used to assess the number of reads recruited by each taxon65. In the Bracken output, the fraction of total reads for each taxon is considered as the relative abundance of that taxon in the sample.
Functional annotation
We performed gene prediction of prokaryotic genomes in VMGC using Prodigal v2.6.366 with the parameter ‘-p meta’. To characterize the function of these predictive proteins, we searched for them in KEGG21, CAZy22, Virulence Factor Database67 and the combined antimicrobial resistance databases (customized in our previous study43) using diamond v2.0.13.15168. For KEGG and CAZy annotations, a protein was successfully assigned into a specific KEGG functional ortholog or CAZyme if it achieved a score of at least 60 across 50% of the sequences. For Virulence Factor Database annotations, a protein was successfully assigned into a specific virulence factor if it achieved a 60% identity across 50% of the sequences. For different types of antimicrobial resistance gene, we applied specific criteria to annotate proteins based on the method provided by a previous study69. These criteria included the following: (1) proteins associated with beta-lactamases were annotated based on a minimum amino acid similarity of 90%, (2) proteins associated with multiple drug resistance were annotated based on a minimum amino acid similarity of 70% and (3) for other types of antimicrobial resistance gene, a minimum amino acid similarity of 80% was required for annotation.
Quantification of functional family
To quantify the functional contribution of prokaryotic species to the vaginal ecosystem, we calculated the weighted abundance of each functional ortholog (for example, KEGG functional ortholog or CAZyme) within different SGBs. For SGB i, if it has Ni,j genes annotated with a functional ortholog j, the weighted abundance of this ortholog in a sample can be calculated by \({W}_{i,\;j}=\frac{{N}_{i,j}}{{R}_{i}}\), where Ri is the relative abundance of the SGB i in this sample. In terms of each functional module (for example, lactate production), its weighted abundance was calculated by summing the weighted abundances of the functional orthologs associated with this functional module (Supplementary Table 7).
Genomic analysis of Saccharofermentanales members
In addition to CheckM2, we also used CheckM v1.1.370 and BUSCO v5.4.232 (with the parameter ‘-m genome’ based on the bacteria_odb10.2019-06-26 database) to estimate the completeness and contamination of all Saccharofermentanales genomes in the VMGC. Both CheckM v1.1.3 and BUSCO showed <90% completeness of these genomes including three complete genomes (NCBI accession GCF_029101565.1, GCF_000025225.2 and GCF_029101585.1). Moreover, we built a phylogenetic tree comprising 14 Saccharofermentanales species within the VMGC, along with all Saccharofermentanales genomes (n = 192) recorded in GTDB database r214.162, using PhyloPhlAn v0.9971.
Genome analysis for fungal populations
Identification of fungal MAGs
We tried to identify fungal genomes from the pool of over 63,000 raw MAGs. To reduce the computational burden, we first extracted 371 raw MAGs that had a genomic size of over 3 Mb. Subsequently, non-eukaryotic sequences were removed from each bin using EukRep v0.6.7 with the parameter ‘-min 2000’72, which generated 38 MAGs with a genomic size of over 3 Mb. The genome quality of these MAGs was assessed using BUSCO v5.4.2 with the parameter ‘-m genome’ based on the fungi_odb10.2019-11-20 database. The 30 MAGs showed >50% completeness and <5% contamination and were considered as potential fungal genomes. These 30 MAGs further underwent intra-sample and strain-level deduplication (99% ANI) by dRep v3.4.0 with the parameters ‘-pa 0.9 -sa 0.99 -nc 0.3 -S_algorithm fastANI’59, resulting in 13 fungal MAGs.
Taxonomic classification and clustering of fungal MAGs
We performed a preliminary taxonomic classification for 13 MAGs based on their internal transcribed spacer 1 (ITS1) sequences. The ITS1 sequences were extracted using QIIME v2021.2.073 with the parameters ‘-p-f-primer ACCTGCGGARGGATCA -p-r-primer GAGATCCRTTGYTRAAAGTT -p-max-length 300’. Subsequently, we applied QIIME with the parameters ‘feature-classifier classify-sklearn’ to annotate the ITS1 sequences based on the UNITE fungal ITS database version 8.374. Based on ITS1 annotation, we further estimated the ANI between the MAGs and NCBI fungal genomes that shared identical taxonomic assignments at the genus level. This was performed using dRep v3.4.0 with the parameters ‘dereplicate -pa 0.9 -sa 0.95 -nc 0.3 --S_algorithm fastANI’. Based on the ANI results, the MAGs were identified as known species (>95%) and genera (>80%). Likewise, all fungal genomes in the VMGC were grouped into the species-level clusters at 95% ANI.
Phylogenetic analysis of fungal genomes
We constructed a phylogenetic tree for 13 fungal MAGs and 25 fungal isolate genomes based on their single-copy protein markers. First, the prediction of fungal protein-coding genes was implemented using GeneMark-ES v4.68_lic75 with the parameters ‘--fungus --ES --min_contig 20000’, leading to 222,113 predictive proteins. All predictive protein sequences were clustered into 57,230 protein families at 50% identity level with MMseqs2 v12.113e376. Among these protein families, we identified 88 protein families as single-copy protein markers. These markers had to appear only once in over 80% of fungal genomes. We also ensured that the protein sequences within each marker cluster showed a similar sequence size, with a coefficient of variation of 20%. The protein markers from each genome were combined into a concatenated sequence. Furthermore, all concatenated sequences were aligned using MAFFT v7.47577, and a phylogenetic tree was constructed using IQ-TREE v2.1.278 with the parameters ‘-m MFP -bb 1000’.
Taxonomic profiles in vaginal metagenomic samples
First, the clean reads of vaginal metagenomic samples were mapped against fungal species-level genomes in the VMGC using Bowtie253. To improve alignment specificity and minimize read contaminations from non-fungal populations, the clean reads mapped into fungal genomes were further aligned against the following databases: (1) ITS sequences from the UNITE database, (2) 16S ribosomal DNA sequences from the SILVA 138 99% OTUs reference database79 and (3) all prokaryotic genomes in the VMGC. Any reads that were mapped to these databases were excluded from the taxonomic profiling process. To be considered present in a specific vaginal sample, a fungal species was required to capture at least three clean reads. The prevalence rate of a fungal species is then calculated based on this presence, determined by the number of vaginal samples with reads mapped against the species divided by the total number of samples. For the species-level profiles of each vaginal sample, the read count of each species was first normalized by dividing its genomic size. The normalized read count was then divided by the sum of all normalized read counts in the sample, defining the relative abundance of a species.
Genome analysis for viral populations
Identification of virus sequences
We carried out the virus analysis within vaginal metagenomic samples according to our previous studies43,80. Assembled contigs with lengths of >5,000 bp were used for virus identification. We initially evaluated contigs using CheckV v0.7.018, removing those with a host gene count surpassing 10 and a host gene count exceeding five times the number of viral genes. Provirus fragments were detected and extracted from the contigs by CheckV. For the remaining sequences, contigs meeting any of the following criteria are considered possible virus sequences: (1) a higher number of viral genes compared to host genes according to CheckV v0.7.0, (2) detection as viruses using the thresholds of score >0.90 and P < 0.01 in DeepVirFinder v1.081 and (3) identification as viruses by VIBRANT v1.2.182 using default parameters. To reduce non-viral sequence contaminations, we identified bacterial universal single-copy orthologs (bacterial USCOs) within the possible viral genomes using hmmsearch83 with default parameters. The bacterial USCO ratio, defined as the number of bacterial USCOs divided by the total number of genes in each viral genome, was then calculated to assess the level of potential contamination. The genomes showing ≥5% bacterial USCO ratio were subsequently eliminated from the analysis. The remaining genomes with CheckV-estimated completeness of >50% were designated as the final viral genomes in the VMGC (VMGC-v).
Genome clustering and phylogenetic tree
The viral genomes were clustered into a vOTU based on a nucleotide similarity threshold of 95% across 85% of the genome using BLASTn v2.12.0 with the options ‘-evalue 1e-10 -word_size 20’. The representative genome was defined as the largest viral sequence within each vOTU. This clustering approach was also applied to evaluate the overlap of viral sequences among different virome databases (VMGC viruses, Oral Virus Database43, Gut Virome Database40, Gut Phage Database41, Metagenomic Gut Virus catalogue42, NCBI RefSeq-viruses). To understand the phylogenetic relationships between genomes, we constructed a viral proteomic tree using the ViPTreeGen v1.1.2 with the default parameter84.
Taxonomic classification
The taxonomic classification of viral sequences was carried out by searching for their protein sequences in a combined database. This database was generated using proteins from Virus-Host DB (downloaded in May 2021)85, crAss-like proteins from Guerin’s study86 and viral proteins from Benler’s study87 and Ye’s study88. The protein-coding sequences of viral genomes were predicted by Prodigal v2.6.366 with the parameter ‘-p meta’, and blasted against the combined database using diamond v2.0.13.151 with the parameters ‘--id 30 --subject-cover 50 --query-cover 50 --min-score 50’. For small viral genomes with a gene count of less than 30, a viral genome was annotated into a known viral family based on the criterion that more than one-fifth of its proteins showed a match to the same family. Conversely, for large viral genomes with a gene count of 30 or more, a viral sequence was assigned to a known viral family if at least 10 of its proteins were matched to the same family.
Virus-host prediction
Virus-host prediction was conducted based on all prokaryotic genomes (19,542 MAGs) in the VMGC. We applied various criteria to detect potential prokaryotic hosts of viruses, including the following approaches: (1) CRISPR (clustered regularly interspaced short palindromic repeats) spacer sequences were predicted in prokaryotic MAGs using MinCED v0.4.2 with the parameter ‘-minNR 2’89. If a host CRISPR spacer sequence matched to a viral genome with a bit-score of 45 or higher using BLASTn (with the parameters ‘-evalue 1e-5 -word_size 8’), the virus was assigned to that host. (2) The viral sequence was blasted against host genome sequences and assigned a host if the viral sequence was matched to the host genome with ≥90% nucleotide identity and ≥30% viral coverage.
Phylogenetic analysis of Papillomaviridae members
We classified the Papillomaviridae members in the VMGC into traditional papillomavirus types based on their L1 structural protein. We extracted the coding sequence labelled L1 structural proteins from all Papillomaviridae members available in the NCBI database. For our Papillomaviridae genomes, we selected putative L1 structural genes with a minimum length of 1,300 bp and shared a nucleotide identity of at least 40% with known L1 structural genes. Based on L1 sequence similarity, the Papillomaviridae members were identified as known types (>90%), species (>70%) and genera (>60%)90. A phylogenetic tree was built based on available L1 structural genes from all papillomavirus in the VMGC and NCBI database using mafft-sparsecore.rb77 and IQ-TREE tools v2.1.2 (ref. 78).
Function annotation
The viral function annotation followed the same method used for the annotation of prokaryotic protein-coding sequences as described above.
Vaginal microbial gene catalogue
We created a comprehensive gene catalogue of the vaginal microbiome by combining all putative protein-coding sequences derived from prokaryotic, fungal and viral genomes in the VMGC. The clustering of protein sequences was implemented using the easy-linclust model in MMseqs2 v12.113e376. This clustering approach was also applied to evaluate the overlap of protein sequences between the VMGC and VIRGO (a comprehensive vaginal gene catalogue) based on an amino acid similarity of 90% across 80% of the sequence. Furthermore, we compared the performance of the VMGC and VIRGO on recruiting clean reads from vaginal metagenomic samples. We mapped the clean reads from each sample against all gene sequences from VMGC and VIRGO, respectively, using Bowtie2 v2.4.1 with the parameter ‘--end-to-end --mm --fast’. The mapping read rate was calculated as the number of mapped reads divided by the total number of clean reads in each sample. It is worth noting that, when calculating the mapping rate based on the genome set for VMGC, this set comprises genomes from 786 prokaryotic SGBs, 11 fungal species and 4,263 vOTUs in the VMGC.
Statistics and reproducibility
We did not use a statistical method to determine the sample size, and the vast majority of collected samples are included in the analysis, with filtered samples typically being excluded due to lack of available clean reads after data quality control or mapping to the reference database. Data collection and analysis were not conducted blindly to the experimental conditions, except for fungal cultivation. Volunteers providing vaginal samples for fungal cultivation were not randomized; specific exclusion criteria have been described in the ‘Fungal culture and shotgun genome sequencing’ section. Given the sparse and non-normally distributed microbial data, relevant statistical analyses were performed using non-parametric tests such as the Wilcoxon signed-rank test.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The data files of VMGC, including prokaryotic, eukaryotic and viral genome sequences, annotation files and the updated Kraken database, have been deposited in the Zenodo repository with the accession ID 10457006 (https://doi.org/10.5281/zenodo.10457005) (ref. 91). The assembled genomes of the cultivated fungal strains were deposited in the NCBI database with BioProject accession ID PRJNA1100704. All cultivated fungal strains in this study are maintained at Dalian Medical University and can be obtained from the corresponding authors upon request. Source data are provided with this paper.
Code availability
The metadata, intermediate results, and analysis and visualization codes used in this study have been uploaded into the GitHub repository, accessible at https://github.com/RChGO/VMGC.
References
Martin, D. H. The microbiota of the vagina and its influence on women’s health and disease. Am. J. Med. Sci. 343, 2–9 (2012).
Anahtar, M. N., Gootenberg, D. B., Mitchell, C. M. & Kwon, D. S. Cervicovaginal microbiota and reproductive health: the virtue of simplicity. Cell Host Microbe 23, 159–168 (2018).
Petrova, M. I., Lievens, E., Malik, S., Imholz, N. & Lebeer, S. Lactobacillus species as biomarkers and agents that can promote various aspects of vaginal health. Front. Physiol. 6, 81 (2015).
Chee, W. J. Y., Chew, S. Y. & Than, L. T. L. Vaginal microbiota and the potential of Lactobacillus derivatives in maintaining vaginal health. Microb. Cell Fact. 19, 203 (2020).
Ling, Z. et al. Associations between vaginal pathogenic community and bacterial vaginosis in Chinese reproductive-age women. PLoS ONE 8, e76589 (2013).
Onderdonk, A. B., Delaney, M. L. & Fichorova, R. N. The human microbiome during bacterial vaginosis. Clin. Microbiol. Rev. 29, 223–238 (2016).
Oakley, B. B., Fiedler, T. L., Marrazzo, J. M. & Fredricks, D. N. Diversity of human vaginal bacterial communities and associations with clinically defined bacterial vaginosis. Appl. Environ. Microbiol. 74, 4898–4909 (2008).
Arnold, J. W., Simpson, J. B., Roach, J., Kwintkiewicz, J. & Azcarate-Peril, M. A. Intra-species genomic and physiological variability impact stress resistance in strains of probiotic potential. Front. Microbiol. 9, 242 (2018).
Fettweis, J. M. et al. The vaginal microbiome and preterm birth. Nat. Med. 25, 1012–1021 (2019).
Ma, B. et al. A comprehensive non-redundant gene catalog reveals extensive within-community intraspecies diversity in the human vagina. Nat. Commun. 11, 940 (2020).
Jie, Z. et al. Life history recorded in the vagino-cervical microbiome along with multi-omes. Genom. Proteom. Bioinform. 20, 304–321 (2022).
Human Microbiome Project Consortium Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).
Zhu, J. et al. Over 50,000 metagenomically assembled draft genomes for the human oral microbiome reveal new taxa. Genom. Proteom. Bioinform. https://doi.org/10.1016/j.gpb.2021.05.001 (2021).
Saheb Kashaf, S. et al. Integrating cultivation and metagenomics for a multi-kingdom view of skin microbiome diversity and functions. Nat. Microbiol. 7, 169–179 (2022).
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 e620 (2019).
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).
Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. https://doi.org/10.1038/s41587-020-00774-7 (2020).
Jain, C., Rodriguez, R. L., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
Ravel, J. et al. Vaginal microbiome of reproductive-age women. Proc. Natl Acad. Sci. USA 108, 4680–4687 (2011).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P. M. & Henrissat, B. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 42, D490–D495 (2014).
Zhu, B., Tao, Z., Edupuganti, L., Serrano, M. G. & Buck, G. A. Roles of the microbiota of the female reproductive tract in gynecological and reproductive health. Microbiol Mol. Biol. Rev. 86, e0018121 (2022).
Briselden, A. M., Moncla, B. J., Stevens, C. E. & Hillier, S. L. Sialidases (neuraminidases) in bacterial vaginosis and bacterial vaginosis-associated microflora. J. Clin. Microbiol. 30, 663–666 (1992).
Cauci, S. et al. Specific immune response against Gardnerella vaginalis hemolysin in patients with bacterial vaginosis. Am. J. Obstet. Gynecol. 175, 1601–1605 (1996).
Ragaliauskas, T. et al. Inerolysin and vaginolysin, the cytolysins implicated in vaginal dysbiosis, differently impair molecular integrity of phospholipid membranes. Sci. Rep. 9, 10606 (2019).
Wolrath, H., Forsum, U., Larsson, P.-G. & Borén, H. Analysis of bacterial vaginosis-related amines in vaginal fluid by gas chromatography and mass spectrometry. J. Clin. Microbiol. 39, 4026–4031 (2001).
Nelson, T. M. et al. Vaginal biogenic amines: biomarkers of bacterial vaginosis or precursors to vaginal dysbiosis? Front. Physiol. 6, 253 (2015).
Srinivasan, S. et al. Metabolic signatures of bacterial vaginosis. Mbio 6, 10.1128/mBio.00204-15 (2015).
Srinivasan, S. et al. Amygdalobacter indicium gen. nov., sp. nov., and Amygdalobacter nucleatus sp. nov., gen. nov.: novel bacteria from the family Oscillospiraceae isolated from the female genital tract. Int. J. Syst. Evol. Microbiol. 73, 10.1099/ijsem.0.006017 (2023).
Chklovski, A., Parks, D. H., Woodcroft, B. J. & Tyson, G. W. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat. Methods 20, 1203–1212 (2023).
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Chua, S. M. & Fraser, J. A. Surveying purine biosynthesis across the domains of life unveils promising drug targets in pathogens. Immunol. Cell Biol. 98, 819–831 (2020).
Pereira, C. S., de Regt, A. K., Brito, P. H., Miller, S. T. & Xavier, K. B. Identification of functional LsrB-like autoinducer-2 receptors. J. Bacteriol. 191, 6975–6987 (2009).
Xavier, K. B. & Bassler, B. L. Interference with AI-2-mediated bacterial cell–cell communication. Nature 437, 750–753 (2005).
Pereira, C. S., McAuley, J. R., Taga, M. E., Xavier, K. B. & Miller, S. T. Sinorhizobium meliloti, a bacterium lacking the autoinducer‐2 (AI‐2) synthase, responds to AI‐2 supplied by other bacteria. Mol. Microbiol. 70, 1223–1235 (2008).
Han, X. et al. Riemerella anatipestifer lacks luxS, but can uptake exogenous autoinducer-2 to regulate biofilm formation. Res. Microbiol. 166, 486–493 (2015).
Zhang, L. et al. Sensing of autoinducer-2 by functionally distinct receptors in prokaryotes. Nat. Commun. 11, 5371 (2020).
Nejat, Z. A. et al. Molecular identification and antifungal susceptibility pattern of non-albicans Candida species isolated from vulvovaginal candidiasis. Iran. Biomed. J. 22, 33 (2018).
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740 e728 (2020).
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109 e1099 (2021).
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol 6, 960–970 (2021).
Li, S. et al. A catalog of 48,425 nonredundant viruses from oral metagenomes expands the horizon of the human oral virome. iScience 25, 104418 (2022).
Rodriguez-Rubio, L., Martinez, B., Donovan, D. M., Rodriguez, A. & Garcia, P. Bacteriophage virion-associated peptidoglycan hydrolases: potential new enzybiotics. Crit. Rev. Microbiol. 39, 427–434 (2013).
Piyathilake, C. J. et al. Cervical microbiota associated with higher grade cervical intraepithelial neoplasia in women infected with high-risk human papillomaviruses. Cancer Prev. Res. 9, 357–366 (2016).
Zhang, T. et al. RNA viral community in human feces: prevalence of plant pathogenic viruses. PLoS Biol. 4, e3 (2006).
Hohn, O. et al. Lack of evidence for xenotropic murine leukemia virus-related virus (XMRV) in German prostate cancer patients. Retrovirology 6, 1–11 (2009).
Sakharkar, K. R., Dhar, P. K. & Chow, V. T. Genome reduction in prokaryotic obligatory intracellular parasites of humans: a comparative analysis. Int. J. Syst. Evol. Microbiol. 54, 1937–1941 (2004).
Godoy-Vitorino, F. et al. Cervicovaginal fungi and bacteria associated with cervical intraepithelial neoplasia and high-risk human papillomavirus infections in a hispanic population. Front. Microbiol. 9, 2533 (2018).
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 1–19 (2021).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Bushnell, B. BBTools: a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data (Joint Genome Institute, 2018).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, D., Liu, C. M., Luo, R., Sadakane, K. & Lam, T. W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Mattock, J. & Watson, M. A comparison of single-coverage and multi-coverage metagenomic binning reveals extensive hidden contamination. Nat. Methods https://doi.org/10.1038/s41592-023-01934-8 (2023).
Carter, M. M. et al. Ultra-deep sequencing of Hadza hunter–gatherers recovers vanishing gut microbes. Cell 186, 3111–3124 e3113 (2023).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 49, 9077–9096 (2021).
Chaumeil, P. A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics https://doi.org/10.1093/bioinformatics/btz848 (2019).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 49, W293–W296 (2021).
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Chen, L. et al. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 33, D325–D328 (2005).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Li, S. et al. Cataloguing and profiling of the gut virome in Chinese populations uncover extensive viral signatures across common diseases. Preprint at bioRxiv https://doi.org/10.1101/2022.12.27.522048 (2022).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Segata, N., Börnigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4, 2304 (2013).
West, P. T., Probst, A. J., Grigoriev, I. V., Thomas, B. C. & Banfield, J. F. Genome-reconstruction for eukaryotes from complex natural microbial communities. Genome Res. 28, 569–580 (2018).
Kuczynski, J. et al. Using QIIME to analyze 16S rRNA gene sequences from microbial communities. Curr. Protoc. Microbiol. 1, Unit 1E 5 (2012).
Abarenkov, K. et al. The UNITE database for molecular identification of fungi—recent updates and future perspectives. New Phytol. 186, 281–285 (2010).
Ter-Hovhannisyan, V., Lomsadze, A., Chernoff, Y. O. & Borodovsky, M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 18, 1979–1990 (2008).
Steinegger, M. & Soding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Nguyen, L.-T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–D596 (2012).
Yan, Q. et al. Characterization of the gut DNA and RNA viromes in a cohort of Chinese residents and visiting Pakistanis. Virus Evol. 7, veab022 (2021).
Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Nishimura, Y. et al. ViPTree: the viral proteomic tree server. Bioinformatics 33, 2379–2380 (2017).
Mihara, T. et al. Linking virus genomes with host taxonomy. Viruses 8, 66 (2016).
Guerin, E. et al. Biology and taxonomy of crAss-like bacteriophages, the most abundant virus in the human gut. Cell Host Microbe 24, 653–664 e656 (2018).
Benler, S. et al. Thousands of previously unknown phages discovered in whole-community human gut metagenomes. Microbiome 9, 78 (2021).
Ye, S. et al. An atlas of human viruses provides new insights into diversity and tissue tropism of human viruses. Bioinformatics 38, 3087–3093 (2022).
Bland, C. et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinf. 8, 209 (2007).
Bernard, H.-U. et al. Classification of papillomaviruses (PVs) based on 189 PV types and proposal of taxonomic amendments. Virology 401, 70–79 (2010).
Guo, R. VMGC – human vaginal microbiome genome collection. Zenodo https://doi.org/10.5281/zenodo.10457005 (2024).
Liu, H., Liang, H. & Li, D. et al. Association of cervical dysbacteriosis, HPV oncogene expression, and cervical lesion progression. Microbiol. Spectr. 10, e00151–22 (2022).
Bloom, S. M., Mafunda, N. A. & Woolston, B. M. et al. Cysteine dependence of Lactobacillus iners is a potential therapeutic target for vaginal microbiota modulation. Nat. Microbiol. 7, 434–450 (2022).
Acknowledgements
This work was supported by Science and Technology Innovative Project of Bao’an, Shenzhen (grant no. 2019JD318 to L.H.); COVID-19 Research & Application Project, Shenzhen Bao’an Chinese Medicine Developing Fund (grant no. 2020KJCX-KTYJ-76 to L.H.); and National Natural Science Foundation of China (grant no. 82370563 to Q.Y.).
Author information
Authors and Affiliations
Contributions
L.H., S.L., Q.Y. and W.S. conceived the study. R.G., S.L., Yue Zhang, J.K., J. Meng, H.Y. and J.Z. performed the analyses. X.W., S.G., Y.L., Z.X., P.Z., J. Ma, W.Y., Yan Zhang, G.H. and Z.D. performed culturing and sequencing experiments. L.H. and W.S. supervised the work and provided funding. S.L. and R.G. wrote the paper. All authors read, edited and approved the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Microbiology thanks Alexandre Almeida, Donovan Parks and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Supplemental figure for the construction of the VMGC.
(a) Data sources and processing workflow for the construction of the VMGC. MAG, metagenome-assembled genome. QS, quality score. CSS, clade separation score. (b) Host:virus gene ratio analysis of the viral sequences in VMGC. For each viral sequence, the ‘viral genes’ and ‘microbial host genes’ were classified by CheckV based on a database created using known virus and prokaryotic genes, and the ‘host:viral gene ratio’ was calculated to estimate the potential microbial host gene contamination for the viral sequences in VMGC. Left panel, estimated for all viruses of VMGC and compared with other existing viral databases; right panel, estimated for the proviruses and non-proviruses of VMGC. The ‘host:viral gene ratio’ for all VMGC viruses was 1:16.0 (1:33.9 for proviruses and 1:10.5 for non-proviruses), suggesting minimal potential for microbial host gene contamination within our viral collection.
Extended Data Fig. 2 Summary of prokaryotic species in the VMGC.
(a) Rarefaction curves of the number of species detected as a function of the number of nonredundant genomes analyzed. Curves are depicted both for all the prokaryotic species and after excluding singleton species (represented by only one genome). (b) Phylogenetic tree for 786 prokaryotic species. The inner and outer layers depict the phylum-level taxonomic and cultivation information of the species, respectively. (c) The top 100 species with the highest number of genomes in VMGC. Upper panel, heatmap showing the geographic distribution of MAGs of the prokaryotic species. Middle panel, the number of MAGs and isolated genomes of the prokaryotic species. Bottom panel, the average relative abundance of the prokaryotic species.
Extended Data Fig. 3 Summary of prokaryotic functions in the VMGC.
(a) Proportions of annotated proteins in all putative proteins from 786 prokaryotic species using different functional annotation databases. The left bar plot shows the proportion of unannotated proteins (white block), while the right bar plot displays the proportion of proteins annotated into different functional modules. (b-e) The mean weighted abundance of disease-associated functional modules in different prokaryotic species across the vaginal samples. Each bar plot presents the top 15 species with the highest abundance. The species are colored according to their order-level taxonomic classification. The results of functional modules are grouped into 4 categories: (b) biofilm formation (mainly to protect harmful bacteria), (c) sialidase (disruption of the vaginal mucosal barrier, (d) toxins (cytolysin, hemolysin) and enzymes (urease, phospholipase C) (disruption of the epithelial barrier and induction of inflammation), (e) biogenic amines (cadaverine, N-acetylputrescine, and trimethylamine) (mainly to produce unpleasant odor, elevating pH to promote the growth of harmful bacteria).
Extended Data Fig. 4 Completeness of three dominated species of Saccharofermentanales.
(a) Boxplot showing the completeness of three species, estimated by BUSCO, CheckM, and CheckM2. In the boxplot, the center line represents the median, the box limits show the upper and lower quartiles, the whiskers extend to 1.5 times the interquartile range, and points outside the whiskers are considered outliers. (b) The missing proportions of 124 universal single-copy orthologs (USCOs) across the genomes of 3 species of Saccharofermentanales. Each point in the figure represents a specific USCO and is sorted based on the corresponding missing proportion. The total number of genomes within each species is shown in the bracket located in the upper right corner of the figure. (c) Phylogenetic tree of all Saccharofermentanales genomes in the VMGC and the GTDB-tk database. The inner and outer layers depict the sources and genome sizes of the species, respectively.
Extended Data Fig. 5 Functional comparison among SGB009, SGB034, and SGB080 based on the KEGG annotation.
(a) The overlap of KEGG functional orthologs (KOs) among SGB009, SGB034, and SGB080. (b) The overview of KEGG metabolism pathways in SGB009, SGB034, and SGB080. The figure is generated by Interactive Pathways Explorer (iPath) v3 (https://pathways.embl.de/). The line color is consistent with the color of the numbers in Figure (a).
Extended Data Fig. 6 Characteristics of fungal populations in the VMGC.
(a) Phylogenetic tree illustrating the relationship between 38 fungal genomes based on their single-copy protein markers. The orange point represents the genome derived from the metagenomic binning algorithm, the green point represents the genome recorded in the NCBI genome database, and the blue point represents the genome cultivated by this study. (b) Mapping rates of reads and prevalence rates for 11 fungal species in the vaginal mycobiome.
Extended Data Fig. 7 Distribution of carbohydrate-active enzymes (CAZymes) encoded by the vOTUs.
CBM, carbohydrate-binding module; CE, carbohydrate esterase; GH, glycoside hydrolase; GT, glycosyltransferase; PL, polysaccharide lyase.
Extended Data Fig. 8 Phylogenetic and compositional analysis of Papillomaviridae.
(a) Phylogenetic tree constructed based on the L1 proteins of the Papillomaviridae genomes present in the VMGC and the NCBI RefSeq database. The outer ring shows the species-level classification of Papillomaviridae genomes. The numbers located at the ends of the branches correspond to the HPV types. Genomes obtained from the NCBI RefSeq database are depicted as grey points at the tips of the branches, while genomes from the VMGC that failed to classify into a specific HPV type are represented by red triangles. (b) Viral compositional analysis of vaginal metagenomes based on VMGC. Vaginal metagenomes were downloaded from the Liu et al.’s study [1], and the compositional profiles of metagenomes were generated based on VMGC viral genomes. The comparison of HPV abundances is shown between health controls and cervical lesion patients. Boxplot showing the relative abundances of different groups. In the boxplot, the center line represents the median, the box limits show the upper and lower quartiles, the whiskers extend to 1.5 times the interquartile range, and points outside the whiskers are considered outliers. HC, health controls; HR-HPV, high-risk HPV positive without cervical lesion group; CIN, precancerous lesions with high-risk HPV group; CC, invasive cervical cancer group. Wilcoxon rank-sum test: *, q < 0.05; **, q < 0.01; ***, q < 0.00192.
Extended Data Fig. 9 Comparison of VMGC-90 and other vaginal gene catalogues.
(a) Source of the VIRGO-specific genes. (b) The overlap of proteins between the VMGC-90 and other vaginal microbial gene catalogues. Three non-redundant gene catalogues were constructed based on: 1) vaginal MAGs from the Pasolli et al.’s study [1] (left panel), and 2) a recently published Lactobacillus genomic catalogue that included 1,091 previously unreported isolate genomes, partial genomes, and metagenome-assembled genomes (MAGs) [2] (right panel). These gene catalogues were constructed using the same methodology and parameters as VMGC-90. (c) Mapping rate of the vaginal microbial gene catalogue across all investigated samples. (d) Boxplot showing the mapping rates of VIRGO and VMGC genes and genomes across all samples. Samples are grouped by their countries. In the boxplot, the center line represents the median, the box limits show the upper and lower quartiles, the whiskers extend to 1.5 times the interquartile range, and points outside the whiskers are considered outliers16,93.
Extended Data Fig. 10 Comparison of VMGC-90 genes across multiple kingdoms.
(a-b) The overlap of prokaryotic and viral genes in the VMGC-90. Venn plot showing the comparisons of prokaryotic genes and genes from completeness (a) and incomplete viruses (b). Red numbers show the percentages of viral genes covered by prokaryotic genes for each comparison. (c) The KEGG annotation of all microbial genes in the VMGC. Upper panels, proportions of the annotated genes; bottle panels, pie plot showing the proportions of pathways of the annotated genes at the KEGG level B.
Supplementary information
Supplementary Information
Supplementary Figs. 1 and 2.
Supplementary Tables
Supplementary Tables 1–9.
Supplementary Data 1
Source data for Supplementary Figs. 1 and 2.
Source data
Source Data Fig. 1
Statistical source data for Fig. 1.
Source Data Fig. 2
Statistical source data for Fig. 2.
Source Data Fig. 3
Statistical source data for Fig. 3.
Source Data Fig. 4
Statistical source data for Fig. 4.
Source Data Fig. 5
Statistical source data for Fig. 5.
Source Data Fig. 6
Statistical source data for Fig. 6.
Source Data Extended Data Fig. 1
Statistical source data for Extended Data Figs. 1–10.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huang, L., Guo, R., Li, S. et al. A multi-kingdom collection of 33,804 reference genomes for the human vaginal microbiome. Nat Microbiol (2024). https://doi.org/10.1038/s41564-024-01751-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41564-024-01751-5
- Springer Nature Limited