Background

Rapid identification of pathogens in a particular environment is important for pathogen-risk management. Human pathogens are ubiquitous in the environment, and infections from particular environments have been reported worldwide. For example, soil-related infectious diseases are common [1, 2]. Legionella longbeachae infection has been reported in many cases, mainly due to potting mixes and composts [3]. Survival of enteric viruses and bacteria has also been detected in various water environments, including aquifers and lakes [4,5,6,7].

Examination of pathogens from infected individuals with a particular clinical syndrome has been a major achievement of modern medical microbiology [8]. Nevertheless, we still know little about the magnitude of the abundance and diversity of known common pathogens in various environments, which is very important for the development of appropriate precautions for individuals who come in contact with certain environmental substrates. This can be realized through metagenomic detection of pathogenic factors in a time-efficient and high-throughput manner using next-generation sequencing methods [8].

Metagenomic detection of pathogens can be accomplished through different schemes. Li et al. [9] examined the level and diversity of bacterial pathogens in sewage treatment plants using a 16S rRNA amplicon-based metagenomic procedure. Quantitative PCR has also been applied for monitoring specific pathogens in wastewater [10]. More studies have applied the whole-genome-assembly scheme to detect one or multiple dominant pathogens, most of which were for viral detection in clinical samples [11,12,13,14]. Although metagenomic-based whole-genome-assembly for bacterial pathogen detection can be conducted at the single species level [15], its computational requirements are high if it is in a high-throughput fashion. In 2014, Baldwin et al. [16] designed the PathoChip for screening pathogens in human tissues by targeting unique sequences of viral and prokaryotic genomes with multiple probes in a microarray. This approach can screen virtually all pathogen-enriched samples in a high-throughput manner.

Despite the aforementioned progress in metagenomic tools for pathogen detection, metagenomic screening for bacterial pathogens in environments such as soil, where microbial diversity is tremendous, is still challenging. This is mostly due to difficulty in assembling short reads generated by next-generation sequencing [8]. The whole-genome-assembly approach is efficient at identifying viromes, but not at dealing with bacterial pathogens from metagenomes especially when target pathogens are of low abundance. Amplicon-based approaches are able to detect bacterial pathogens in a high-throughput manner; however, it is well known that phenotypic diversity exists widely across and within microbial species of a genus because of divergent evolution [17, 18]. This also holds true for pathogenic factors [19]. Moreover, toxin factors, such as the Shiga toxin (stx) of Shigella, are primarily transferable through lateral gene transfer, which leads to the continuous evolution of pathogen species [20]. Therefore, it is necessary to examine the pathogen diversity in environmental metagenomes using essential virulence genes as biomarkers.

In this study, a toxin-centered virulence factors database was built, and the well-developed Local BLASTP method was applied to detect virulence factors in various environments. This procedure is metagenome-based and can be conducted in a high-throughput fashion, which greatly simplifies development of precautions for pathogen-enriched environments.

Methods and materials

Environments and their metagenomes

Sixty-nine metagenomes were selected and downloaded from the MG-RAST server (Table 1). These metagenomes were derived from ocean water, freshwater, wastewater, natural soil, deserts, and feces, representing the major environmental media found worldwide. Sequencing methods of the metagenomes include the illumina, Ion Torrent and 454 platforms, and predicted proteins in the metagenomes ranged from 33,743 (fresh water, ID mgm4720261) to 11,587,259 (grassland soil, ID mgm 4623645). The gene-calling results from the MG-RAST server were used for toxin factor screening in this study. The taxonomic composition at the genus level was also retrieved from the MG-RAST server for 27 representative metagenomes.

Table 1 General information regarding the metagenomes retrieved from the MG-RAST server

Toxin factor database

A toxin-centered database was established for bacterial pathogen detection in metagenomes in this study. Candidate toxin factors for pathogenic screening of environmental metagenomes were gathered based on well-studied pathogens summarized in the Virulence Factor Database [21], a soil borne pathogen report by Jeffery and van der Putten [2], and a manure pathogen report by the United States Water Environment Federation [22]. Sequences of the toxin factors were then retrieved by searching the UniProt database using the toxin plus pathogen names as an entry [23], while typical homologs at a cut-off E value of 10−6 were gathered from GenBank based on BLAST results. A protein database was then built for Local BLSATP study (Additional file 1). Considering that virulence process involves several essential factors including toxins, various pathogen-derived secretion proteins were also included in the database, and it was tested that whether secretion proteins were as specific as toxin proteins for pathogen detection. The disease relevance of all virulence factors was screened using the WikiGenes system [24] and relevant publications (Table 2).

Table 2 Typical virulence factors investigated in this study and their disease–relevance

Local BLASTP

The Local BLASTP was applied following the procedure used in our previous study [58, 59]. Basically, the gene-calling results of each metagenome were searched against the toxin factor database using BLASTP. The cut-off expectation E value was set as 10−6. The results of the Local BLASTP were then copied to an Excel worksheet, after which they were subjected to duplicate removal, quality control, and subtotaled according to database ID. Duplicate removal was based on the hypothesis that each sequence contains one copy of a specific toxin factor, since the gene-calling results were used. For quality control of the BLAST results, a cut-off value of 40% for identity and 20 aa [1/3 of the length of the shortest toxin factors (e.g., the Heat-Stable Enterotoxin C)] for query alignment length were used to filter the records. The toxins abundance matrix was formed for subsequent analyses.

Specificity tests of the Local BLASTP method

Sequences from the toxin database established in this study, as “known sequences” to the database, were selected randomly and searched against the database using the BLASTP procedure. The genome of Clostridium perfringens ATCC 13124 (NC_008261), as “unknown” sequences to the database, was subject to the Local BLASTX procedure as well. Homologous proteins were searched exhaustively in the GenBank database using BLASTP, with the representative toxin factors in the toxins database as a query. Sequences were retrieved and aligned using ClustalW [60], and Maximum-likelihood phylogeny was conducted with MEGA 7 [61].

Data analysis

The toxin frequency in each metagenome was normalized to a total gene frequency of 10,000,000 to eliminate the effects of gene pool size. Toxin abundance in the 69 metagenomes was visualized using Circos [62]. The genus abundance of 27 selected metagenomes representing the main environment types was calculated and sorted by genus name, followed by manual construction of a genus abundance matrix for subsequent biodiversity-toxin abundance Canonical Correspondence Analysis using R [63] with the package ‘vegan’ [64].

Results and discussion

In this study, a toxin-centered database was built for bacterial pathogen screening in various microbiomes through a Local BLASTP procedure. The specificity of the procedure was tested, the relative abundance of toxins in the microbiomes was examined, and the toxin-taxonomic abundance correspondence analysis was performed.

Like the previously established Local BLASTN method for antibiotic and metal resistance genes screening [58, 59, 65], the Local BLASTP method using the toxin-centered pathogen database in this study was successful at accurately identifying toxin proteins from the database. For screening of the Clostridium perfringens ATCC 13124 genome, the methods successfully detected the pore-forming genes and multiple copies of the glucosyltransferase (toxB-like) and ADP-ribosyltransferase (spvB-like) genes, based on the raw data. These results are consistent with the virulence genetic features of Clostridium sp. [21], which have not been well detailed in the GenBank annotation record. Such a cross-validation positively indicated that the Local BLASTP procedure established here is useful in predicting toxin genes in unknown genomes. Yet for a semi-quantitative method to estimate toxin factors in metagenomes, a false positive analysis is required to examine to what level mismatch is included in the Local BLASTP results. Actually, the cut-off values of identity greatly impact the homolog virulence factor abundance returned. At cut-off values of 40% for identity and 20 aa for alignment length, only four records for Clostridium perfringens ATCC 13124 genome query were returned after duplication removal, one for 1-phosphatidylinositol phosphodiesterase, one for pore-forming alveolysin, one for Ornithine carbamoyltransferase and one for RNA interferase NdoA. At a cut-off identity value of 35%, one more record (Toxin secretion ATP binding protein) was returned. This means that the Local BLASTP procedure was able to detect the virulence factors in unknown genomic dataset at least semi-quantitatively, with proper cut-off values for data quality control. The accuracy of the BLASTP procedure in virulence factor detection was further tested using the genomes of Bacillus thuringiensis serovar konkukian str. 97-27 (AE017355.1) and Helicobacter pylori 26695 (AE000511.1).

As mentioned above, functional genes including toxin factors may partly evolve through lateral gene transfer, which makes their taxonomic affiliation difficult. It is thus interesting to explore how specific toxin factors are associated with the taxonomic units of pathogens. Here, I explored this issue by investigating the taxonomic distribution of homologs of toxins retrieved from the GenBank database. Generally, at a lower expectation value, most toxins were associated with a specific group of pathogens. For example, at the default cut-off E value, 241 out of 242 returned records of Mycobacterium tuberculosis RelE homologs fell within the phylum Actinobacteria. Moreover, 89% of these homologs were from the genus Mycobacterium, while 99.7% of Yersinia pestis CdiA homologs and 92.7% of Bordetella pertussis cya homologs belonged to Proteobacteria, and homologs of Aeromonas dhakensis repeats-in toxin (RtxA) were mostly associated with the class Gammaproteobacteria (206 out of 242). However, no obvious genus-toxin association was identified. It is worth noting that these results largely depended on the availability of toxin sequences in each taxonomic unit. The lack of a genus-toxin association basically denied the possibility of detecting a specific pathogen using a specific toxin as a single signature.

It is still not clear whether virulence secretion proteins are specific for pathogen detection as signatures, though they are essential for virulence process [20]. For example, the contact-dependent toxin delivery protein CdiA was found to be widespread in bacteria [33]. The relative abundance of secretion proteins in the 69 microbiomes was determined as well as that of the toxins which are essential to virulence processes. The results of the present study showed that the abundance of secretion proteins selected in the database was strongly correlated with the toxin abundance (R2 = 0.74, P = 0.0068, Fig. 1). The most abundant secretion proteins included L. waltersii toxin secretion protein (LWT1SS), L. pneumophila toxin secretion protein ApxIB, and Aeromonas hydrophila RTX transporter (RtxB) (data not shown). Further exploration indicated that although A. hydrophila RtxB homologs from GenBank were found in all Proteobacteria classes, most of the RtxB-harboring species have been reported to be pathogens, including Vibrio spp. [64, 66], Pseudomonas spp., Neisseria meningitides [67], Ralstonia spp. [68], and Yersinia spp. [21]. This may imply the pathogen-specific nature of secretion proteins included in the database, and that toxin secretion proteins can be used as signatures for pathogen detection as well.

Fig. 1
figure 1

Correlation between relative abundance of toxins and secretion proteins in the global microbiomes (N = 69)

Toxin-phyla CCA results showed that all phyla can be clearly separated into two groups, and that almost all toxins were associated with Proteobacteria, Nitrospirae, and Firmicutes (Fig. 2). Considering the phylum-specificity of the toxins stated above, these results can be biased because of the taxonomic affiliation of toxins included in the Local BLASTP database. The taxonomic distribution proportion of currently available genomes of identified pathogens was reflected in the toxin database, with Proteobacteria and Firmicutes accounting for the majority of the genomes. However, the CCA results may also indicate, at least in part, a proportional lack of pathogens in some phyla, such as Crenarchaeota, Euryarchaeota, Verrucomicrobia, and Bacteroidetes [69]. Archaea cannot easily absorb phage particles because of their extracellular structures, which differ from bacteria [70]. A recent study by Li et al. [9] also found that the five most abundant bacterial pathogens were from either Proteobacteria or Firmicutes in wastewater microbiomes. Taken together, these findings could indicate that Proteobacteria or Firmicutes were evolutionarily enriched with pathogens when they dominated most environmental microbiomes on the planet [71, 72].

Fig. 2
figure 2

Canonical correspondence analysis of the associations between phyla and toxins from typical environments

Interestingly, there was a strong association between the phylum Nitrospirae and toxins of RNase inteferases (MvpA and MapC) and Listeria monocytogenes1-phosphatidylinositol phosphodiesterase PLC. Further searches against the UniProt database [73] revealed no homologous records of MvpA and PLC from Nitrospirae, and only 109 out of 15,574 bacterial records for VapC were from Nitrospirae. These findings imply that there may be many more Nitrospirae pathogens harboring MvpA and PLC that have yet to be discovered.

The screening of toxins in the 69 global microbiomes revealed the most prevalent toxins and pathogen-enriched environments. Specifically, the results showed that pore-forming RTX toxin and ornithine carbamoyltransferase ArgK were most prevalent globally in terms of both occurrence and relative abundance (Fig. 3). RTX toxins comprise a large family of pore-forming exotoxins. Known homologs in the GenBank database of Aeromonas dhakensis RtxA were mainly in the genera of Aeromonas, Pseudomonas (e.g., CP015992), Vibrio (e.g., CP002556), and Legionella (e.g., CP015953). These genera are well known to be associated with gastroenteritis, eye and wound infections, cholera and legionellosis, and RTX toxins are a key part of the virulence systems of each of these conditions [74,75,76,77]. The argK gene is a part of the Pht cluster, which contains genes for the synthesis of phaseolotoxin in Ps. syringae pv. phaseolicola [78]. ArgK plays an essential role in the survival and pathogenicity of Ps. syringae. Known ArgK proteins mainly come from Pseudomonas, Escherichia, and Mycobacterium, which are widespread and persistent in the environment [79]. In addition, Cya is worth noting as an essential unit of Bacillus anthracis virulence that causes anthrax and may lead to mammalian death [80]. Known homologs in the GenBank database of Bacillus anthracis Cya were mainly from Bacillus spp., Bordetella spp., Pseudomonas aeruginosa, Yersinia pseudotuberculosis, and Vibrio spp. Their presence in the environment should be carefully examined and precautions should be taken to prevent infection by these organisms since many of them are associated with very common diseases such as whooping cough.

Fig. 3
figure 3

Circular visualization of the toxin abundance in the microbiomes selected from locations worldwide. The designated environment abbreviation can be found in Table 1

The main purpose of the Local BLASTP method established here was to screen pathogen-enriched environments to enable development of precautionary measures. Our results clearly indicated that contaminated freshwater, feces, and harbor sediment microbiomes were rich in pathogens (Fig. 4). Although there was no detailed background information regarding these environments in this study, the results presented herein may provide important implications for pathogen-related risk control. Surprisingly, two lake water microbiomes from Nanjing, China contained the highest toxin factors among the 69 samples. Further investigation of the location and contamination status supported the sewage-nature of the lake water. In China, most polluted lakes receive sewage that includes feces materials [81]. According to an official survey conducted in 2015, Nanjing has 28 lakes with a total area of 14 km2, among which 96.4% are classified as polluted (Class V of the national standard). Studies have documented that pathogens tend to be enriched in polluted waters [13]. It is not surprising to find that feces samples had very high abundance of toxins. Epidemical statistics have indicated that feces are the most important pathway for diarrheal diseases, which is a leading cause of childhood death globally [82]. Meanwhile, dry soil environments like desert soil and desert tailings were found to contain relatively less toxin factors. It is still unclear to what extent the environments stressed by long-lasting drought or metal pollution suppress the colonization and development of pathogens [83]. In all, the association between environmental factors and pathogen abundance merits a systematic exploration in the future.

Fig. 4
figure 4

A Boxplot showing the relative abundance of toxins detected from the metagenomes in this study. Drysoil includes the desert soil and desert mine tailings

Conclusions

A Local BLASTP procedure was established for rapid detection of toxins in environmental samples. Screening of global microbiomes in this study provided a quantitative estimate of the most prevalent toxins and most pathogen-enriched environments.