Design and application of a novel two-amplicon approach for defining eukaryotic microbiota
Due to a lack of systematic diagnostics, our understanding of the diversity and role of eukaryotic microbiota in human health is limited. While studies have shown fungal communities to be significant modulators of human health, information on the prevalence of taxa such as protozoa and helminths has been limited to a small number of species for which targeted molecular diagnostics are available. To probe the diversity of eukaryotic microbes and their relationships with other members of the microbiota, we applied in silico and experimental approaches to design a novel two-amplicon surveillance tool, based on sequencing regions of ribosomal RNA genes and their internal transcribed spacers. We subsequently demonstrated the utility of our approach by characterizing the eukaryotic microbiota of 46 hospitalized Malawian children suffering from Severe Acute Malnutrition (SAM).
Through in silico analysis and validation on a diverse panel of eukaryotes, we identified 18S rRNA variable genetic regions 4 and 5 (18S V4 V5), together with a region encoding 28S rRNA variable genetic region 2 and the internal transcribed spacers (transITS), as optimal for the systematic classification of eukaryotes. Sequencing of these regions revealed protozoa in all stool samples from children with SAM and helminths in most, including several eukaryotes previously implicated in malnutrition and diarrheal disease. Clinical comparisons revealed no association between protozoan parasites and diarrhea or HIV reactivity. However, the presence of Blastocystis correlated with bacterial alpha diversity and increased abundance of specific taxa, including Sporobacter, Cellulosibacter, Oscillibacter, and Roseburia.
We suggest this novel two-amplicon based strategy will prove an effective tool to deliver new insights into the role of eukaryotic microbiota in health and disease.
KeywordsMicrobiome Eukaryotes Parasites Next-generation sequencing Amplicon sequencing Malnutrition
Severe Acute Malnutrition
Microbial communities in the human body are significant modulators of health and disease. The contributions of gut bacteria have been well documented; however, excluding fungi , little is known concerning the diversity and role of eukaryotic microbiota. Studies relying on targeted molecular diagnostics, such as PCR and ELISA or microscopy, have revealed the presence of several protozoa, including both commensal and parasitic species, in the gut of healthy individuals. For example, Blastocystis was present in stool from 56% of 105 Danish adults and all of the 93 samples obtained from Senegalese children [2, 3]. Furthermore, Dientamoeba fragilis has been found in 68% of children in a Danish daycare, while Giardia was identified in 8% of children attending Toronto daycares and 32% of children sampled in rural Ecuador [4, 5, 6]. Although typically considered asymptomatic, Blastocystis and Dientamoeba have been linked with inflammatory bowel disease and irritable bowel syndrome [7, 8, 9, 10], conditions that promote gut bacterial dysbiosis [11, 12]. Indeed, it is becoming apparent that bacteria and eukaryotic microbiota feature complex relationships both with each other and the host immune system; specific gut bacteria have been shown to confer protective immunity to the host against transmission of Plasmodium and Toxoplasma gondii, the etiological agents of malaria and toxoplasmosis respectively [13, 14, 15, 16]. Conversely, induction of the host inflammasome by the protozoan Tritrichomonas musculis reduced Salmonella infection in mice, but exacerbated T cell-driven colitis .
Though more sensitive than microscopy or culture-based diagnosis [6, 18, 19, 20], molecular diagnostics are only available for a limited number of pathogens and provide no information on the diversity of eukaryotes in a defined environment. In attempts to better characterize the contribution of microbial eukaryotes to health, marker gene studies, analogous to bacterial 16S rRNA gene surveys, have been proposed. In these approaches, variable genetic regions are targeted for PCR amplification using universal DNA primers which bind to all organisms; sequencing of the resultant amplicons subsequently yields a readout of taxa present [21, 22, 23, 24]. In one such study, amplicons generated from the 18S rRNA gene revealed protozoa in stool from all 23 Malawian individuals sampled . However, such PCR-based approaches are not without challenges. The enormous sequence diversity exhibited by eukaryotes can result in established primers failing to amplify regions from entire phyla [22, 24]. Conversely, high sequence conservation in the region being targeted can limit resolution at the level of species or genus.
To address these limitations, we present a two-amplicon approach to generate a robust profile of eukaryotic microbiota. We perform a systematic in silico analysis to select two candidate amplicons suitable for classification of eukaryotic microbes: (1) the 18S rRNA variable genetic regions 4 and 5 (V4 V5) and (2) a region encompassing the internal transcribed spacers (ITS), together with the 18S rRNA gene V9 and the 28S rRNA gene V1 V2, collectively referred to as transITS. We design and validate universal DNA amplicon primers for these two regions using a diverse panel of eukaryotes. We subsequently demonstrate the effectiveness of our approach to survey eukaryotic microbiota in a study of stool samples from 46 Malawian children hospitalized for Severe Acute Malnutrition (SAM) [26, 27], a cohort with a high expected prevalence of intestinal parasites. We show that in addition to fungi, protozoa and helminths are prevalent in this cohort. We identify associations between protozoan Blastocystis and bacterial taxa, but find no associations between gut parasites and clinical features, including HIV reactivity and diarrhea.
Amplicon identification and primer design
In silico analyses were performed to evaluate the ability of each region to correctly classify taxa. For each variable region, 100 copies of each representative sequence were generated with a 1% mutation rate introduced to simulate sequencing error. Based on taxonomies of top hits in a sequence comparison with genes from SILVA, 18S rRNA gene V4 and 28S rRNA gene V2 achieved the highest accuracies in identifying protozoan (91.7%, 92.9%) and helminth (96%, 98.5%) species (Fig. 1b; Additional file 1: Table S1). Extending 18S rRNA gene amplicons to include two variable regions increased accuracy to 94.9% and 97.5% respectively, in the V4 V5 region. Identification of fungal species, by comparison, achieved a maximum accuracy of 83%, highlighting the reliance on ITS1 and ITS2 sequences for fungal studies. Based on these results, we selected 18S rRNA gene V4 V5 and 28S rRNA gene V2 for DNA primer design, as well as the Earth Microbiome Project (EMP) recommended 18S rRNA gene V9 .
Focusing on the 18S rRNA gene V9, no single primer outperformed others across all clades. While 1391f is predicted to bind more Excavata, Amoebozoa, and Alveolata genes, V9-9_F performed better for Stramenopiles, Rhizaria, Nematoda, and Platyhelminthes, with an additional advantage of low predicted binding to bacterial rRNA genes, reducing contamination. The accuracy of predicted reverse primer binding was likely compromised by incomplete reference genes. For the 28S rRNA gene V2 region, we found that lack of conservation at the 5′ end of this region resulted in the loss of 80% of genes from Excavata. Instead, we designed a primer to bind within V2 (28S V2_F). Primers V2_F and V2-rev6_R are predicted to bind up to 50% more Excavata and 6% more Alveolata sequences, at a 15% reduction in Rhizaria, compared to published primers (Fig. 2b). The resulting truncated amplicon (V2 t) is ~ 50% shorter (Fig. 1a), however, limiting taxonomic assignment.
Given the limitations of the reverse 18S rRNA gene V9 and forward 28S rRNA gene V2 primers, we explored the possibility of generating a larger amplicon (“transITS”) using the 18S V9-9_F and 28S V2-rev6_R primers. The transITS benefits from sequence variability of the entire 28S rRNA gene V2 and ITS previously used for strain typing. Amplification of ITS regions alone is not feasible for protozoa due to high variability in 5.8S and 5′ of 28S rRNA genes. Given ITS hypervariability and paucity of protozoan reference sequences, we reasoned that 18S rRNA gene V9 and 28S rRNA gene V2 could provide high-level taxonomic information, and where available, ITS regions could resolve species or strains.
Based on our in silico analyses, we validated the following primer pairs on a diverse panel of eukaryotes that include nine protozoa, three fungi, two helminths, and cultured human cells: 515f and 1119r, V4-1_F and V4-4_R, and 18S V9-9_F and 28S V2-rev6_R. For the 18S rRNA gene V4 V5 region, our V4-1_ F and V4-4_R primers were successful across the entire panel with the exception of the fungus A. fumigatus, which produced only a weak PCR product. On the other hand, the 515f and 1119r primer pair showed poor amplification of Blastocystis, Entamoeba, and Giardia (Fig. 2c). As for our 18S V4 V5 primer set, transITS amplicons were generated across the panel with the exception of A. fumigatus (Fig. 2c). As this amplicon can reach lengths of several thousand bases in select taxa such as Sarcocystis, we employed a processive high fidelity polymerase and 3-min extension times to maximize amplification (e.g., as shown with the successful recovery of the ~ 4000 bp human transITS region).
18S rRNA gene V4 V5 amplicon surveys reveal eukaryotic microbiota in children with SAM
Characteristics of patients hospitalized for severe acute malnutrition
n = 44
n = 27
n = 17
Female, n (%)1
Diarrhea, n (%)2
HIV reactive, n (%)3
In an attempt to verify taxonomic assignments based on 200 nt of the high-quality amplicon forward reads, we incorporated > 100 nt of reverse read information and used a binning and assembly approach to undertake more robust phylogenetic analyses for three major protozoan clades (Alveolates, Trichomonads, and Blastocystis; Fig. 3c). In brief, read pairs assigned to each clade were binned into discrete genera. Each genus bin was then assembled resulting in a total of 336 contigs. After merging contigs sharing ≥ 99% sequence identity, we recovered 27 contigs generated from at least 100 read pairs (3, 11, and 13 for Alveolates, Trichomonads, and Blastocystis respectively). To expand representatives in each clade, we additionally included four contigs associated with Gregarina, Tetratrichomonas, and Blastocystis sp. ST6 and unknown ST (generated from 32, 26, 56, and 23 read pairs respectively). Together with sequences from representative organisms, these contigs were used to construct three clade-specific phylogenies. Both Trichomonad and Apicomplexan phylogenies were consistent with amplicon assignments. The position of Eimeriorina contig 1 suggests these amplicons derive from a member of the Toxoplasmatinae, members of which are challenging to resolve at this locus. Contigs assembled from amplicons previously classified as “unknown Trichomonad” appear most closely related to Pentatrichomonas, while amplicon sequences from the Tetratrichomonas contig appear to have been misassigned due to close sequence similarity with Trichomonas in the 200 nt classified region. The majority of Blastocystis amplicon sequences were classified to the correct subtype, with the exception of ST1 contigs 5, 7, and 8, collectively representing 762 amplicon sequences. With additional sequence information, we identified amplicon sequences classified as “unknown subtype” as ST3.
We note that while the Luminex diagnostic identified Giardia and Entamoeba histolytica in 13 and 1 samples, respectively, both taxa were missing from our amplicon analyses. We did detect Giardia and a close relative, Entamoeba dispar, during primer validation tests (Fig. 2c). It may be that the abundance of E. histolytica is below the sensitivity of our system. Primer mismatches to Giardia rRNA genes may limit amplification in the presence of competing sequences, a limitation also noted for 515f and 1119r primers .
TransITS amplicon surveys complement findings from the 18S rRNA gene V4 V5 amplicon
TransITS amplicons have lengths up to several thousand bases, and include segments of the 18S and 28S rRNA genes, the 5.8S rRNA gene, and ITS regions. For Illumina sequencing, these amplicons were randomly fragmented to median 300 bp using the Nextera DNA library kit. Sequencing generated 44,750 to 192,224 quality-filtered reads per sample (see Additional file 1: Table S3). Reads (≥ 100 nt) were classified by sequence similarity to a custom reference database using the BWA-MEM aligner , followed by a BLAST search of unclassified sequences against the NCBI nucleotide database, using ≤ 3% nucleotide mismatch and ≥ 70% sequence coverage. We identified eukaryotes, bacteria, a small number of archaeal sequences, and over 1.1 million unclassified sequences out of 5.3 million (0.1 to 54% per sample). Among eukaryotes, we found eight protozoan genera, a microsporidian, six helminths, and six phyla of fungi (Fig. 3d; Additional file 1: Table S4). Pentatrichomonas hominis and Entamoeba coli were detected in the same patients as found by 18S rRNA gene V4 V5 data, and Blastocystis was found in 12 patients, with similar patterns of abundance. Trichomonas and Tritrichomonas were detected at low abundance in three and two samples, respectively. While we did not detect Cryptosporidium using this amplicon, interestingly, we identified microsporidian parasite Enterocytozoon bieneusi in 12 samples. Helminth sequences were classified primarily as Thysanotaenia congolensis isolate CY7 in one sample, for which no 18S rRNA gene V4 V5 data was generated. The most abundant fungal genera were Candida, Nakaseomyces, Trichosporon, Kazachstania, Pichia, Clavispora, and Saccharomyces.
We evaluated the similarity between the profiles of eukaryotic microbiota generated by the 18S rRNA gene V4 V5 and transITS amplicons through taxon co-occurrence using Hamming distance comparisons and Bray-Curtis dissimilarity. We find a higher co-occurrence of genera and phyla in pairs of amplicon datasets from the same patient compared to between patients (p values 8E−6 and 0.006) (Fig. 3e). Bray-Curtis analysis including taxa detected by both amplicon approaches found significantly higher compositional similarity between 18S rRNA gene V4 V5 and transITS profiles from the same patients (p value 1.57E−05) (Fig. 3f). Indeed, the two amplicons share six of the top 10 abundant genera. However, no significant associations were found when all taxa were included, consistent with previous studies that compared multiple amplicons [34, 35]. Using our two-amplicon approach, we were able to detect an additional 39 genera to those identified by 18S rRNA gene sequencing alone, while the 22 genera captured by both amplicons represent a high-confidence set of protozoa, helminths, and fungi identified in Malawian SAM patients (Fig. 3g). This demonstrates the potential to sample a broader range of taxa, circumventing amplicon-specific biases, and to cross-validate organisms identified by two sequencing experiments.
As the transITS amplicons are fragmented for sequencing, we might expect a uniform sampling of reads from across the entire locus. To explore this, amplicon sequences were mapped to a representative reference sequence from the assigned species. Where no reference sequence was available, amplicon sequences were binned on the basis of species assignment and each bin assembled to form contigs. Contig consensus sequences were subsequently used as a reference. Interestingly, rather than displaying uniform coverage across the entire length of the amplicon, we find that amplicon sequences assigned to Enterocytozoon, Blastocystis, and Pentatrichomonas exhibit uneven coverage (Fig. 4b). For example, for Blastocystis, this is evident in the ITS2 region and 5′ of 28S rRNA gene. Other taxa, such as Entamoeba and Taenia exhibit limited representation across the locus. Entamoeba reads, for example, map only to the V9 region of the 18S rRNA gene, reflecting the low read coverage of Entamoeba sequences observed in Fig. 4a and lack of reference information for Entamoeba coli ITS and the 28S rRNA gene. Similarly, reference information is only available for the 28S rRNA gene for Thysanotaenia. Reads assigned with low scores to Taenia ITS and the 5.8S rRNA gene may belong to Thysanotaenia (Fig. 4b), as these were identified in the same sample. As genomes and sequences from clinical isolates are deposited into public databases, we predict that we will be able to classify more sequences with greater confidence.
Finally, we explored over 3000 unclassified reads with weak similarity to Parabasalids (Fig. 4a). Assembly of reads generated four contigs (> 10 reads), with the largest (995 nt long, containing 2103 reads) sharing 77% sequence identity over 62% of its length with the Trichomonas vaginalis 28S rRNA gene AF202181. Reads from this contig largely derive from two samples (1169 and 919 reads) in which our data from the 18S rRNA gene V4 V5 indicate the presence of an uncharacterized trichomonad. This likely suggests that the transITS reads derive from the same taxon; however, lack of DNA precludes further experimental validation.
Blastocystis colonization is associated with changes in bacterial microbiota
To explore potential relationships between gut eukaryotes and bacteria, we characterized the bacterial communities through amplicon sequencing of the 16S rRNA gene. Sequencing generated 26,887 to 179,945 quality filtered reads per sample (Additional file 1: Table S3) and were dominated by Bacteroidetes, Firmicutes, and Proteobacteria (Fig. 3h). Proteobacteria represented > 20% of sequences in 26 out of 46 patients and > 80% in 6 patients. Of these, the most highly represented were Escherichia-Shigella, Enterobacter, Pantoea, and Klebsiella, genera known to include pathogenic members.
Differential abundance analysis of bacterial OTUs using ALDEx2 revealed that only Blastocystis colonization was correlated with significant changes in 15 bacterial taxa confirming many co-occurrence associations, including the two Proteobacteria, Bilophila, and Vampirovibrio (Fig. 5b; Additional file 2: Figure S2; Additional file 1: Table S6). The most significant changes were associated with the Firmicutes Oscillibacter, Sporobacter, Cellulosibacter, and Roseburia. Notably, of the two butyrate producers Faecalibacterium and Roseburia previously correlated with Blastocystis , we were only able to confirm a significant association with Roseburia. sPLS-DA and relative abundance analyses of bacterial composition further corroborate our findings and show separation of patient samples based on Blastocystis colonization (Fig. 5 c, d). The first sPLS-DA component, driven by changes in Sporobacter, Cellulosibacter, Oscillibacter, Eubacterium, and Lutispora abundance, explained 20% of the variance between samples, with an estimated 9% classification error rate based on tenfold cross-validation. Bacterial richness and diversity analyses revealed that only Blastocystis colonization correlated with changes in bacterial population structure (p < 0.05 or p < 0.005; Fig. 5e; Additional file 1: Table S7). Whether Blastocystis modulates the growth and colonization of bacteria or is itself dependent on the presence of specific bacterial taxa requires further investigation.
No evidence of association between parasite carriage, HIV reactivity, and diarrhea
Finally, we explored the relationships between clinical features of patients with SAM and the eukaryotic organisms Blastocystis, Cryptosporidium, Giardia, and Enterocytozoon. In our cohort, these eukaryotes showed no significant association with age, HIV reactivity, edema status, weight for height Z-scores (a measure of wasting), or with diarrhea (Additional file 1: Table S8). Similar results were obtained when including as appropriate age, sex, and HIV as cofounders in the models.
Despite the health and economic burden of a handful of intestinal protozoan and helminthic parasites worldwide, the diversity and role of these taxa in the microbiome is largely unexplored. Targeted surveys have revealed high carriage rates of a number of parasites in asymptomatic individuals, and studies are beginning to uncover key interactions between these organisms and gut bacteria [13, 37, 38, 39, 40, 41, 42]. For example, treatment of mice with Lactobacillus and Bifidobacterium decrease Plasmodium burden , while gut colonization by E. coli 086:B7 and E. coli CFT073 protect against malaria transmission and Cryptosporidium infection, respectively [13, 41]. Helminth infections have been suggested to promote growth of protective Clostridiales, inhibiting colonization by an inflammatory Bacteroides species . To better study these relationships, a broad surveillance diagnostic is needed. In an attempt to overcome limitations such as primer bias and poor taxonomic resolution associated with existing biomarkers in the 18S rRNA gene, we designed a two-amplicon approach, based on 18S and 28S rRNA genetic regions and internal transcribed spacers. By combining two regions, we report a high confidence set of taxa identified by two independent diagnostics and the potential to capture greater protozoan and helminth diversity in a pilot cohort of SAM patients, as well as uncover interactions with gut bacteria.
Our identification of amplicon V4 V5 in the 18S rRNA gene builds on CBOL Protist Working Group recommendations for the use of the 18S rRNA V4 genetic region for broad eukaryotic surveys  and previous ecological surveys [44, 45, 46]. Previous studies have shown that sequencing two amplicon regions in this gene provides complementary information on eukaryotic microbes [45, 47]. Here, we extend this approach by combining data from the 18S rRNA gene with that from the ITS regions and the 28S rRNA gene in our transITS amplicon. The ITS region is currently used for strain typing and fungal classification, while the 28S rRNA gene has been found as more accurate for the classification of key groups of protozoa and helminths [43, 48, 49, 50]. While primer bias is an unavoidable limitation of amplicon studies, by sequencing two regions using different primer pairs, we identified 39 additional taxa compared to the use of the 18S rRNA V4 V5 genetic region alone. Taxa identified by both amplicons represent a high confidence set of eukaryotes and provide method validation. Low abundance organisms identified by a single amplicon may require downstream corroboration to rule out false positives. The high proportion of unclassified transITS sequences or reads with weak similarity to Amoebozoa, Parabasilids, and Platyhelminthes suggests eukaryotic diversity that remains undocumented in public databases. Application of this amplicon to further clinical and environmental samples could reveal additional diversity and provide phylogenetic resolution not possible with 18S rRNA gene studies; however, the current need for transITS fragmentation significantly curbs its potential. Recent improvements in error rates associated with long-read sequencing technologies such as those developed by Oxford Nanopore and Pacific Biosciences offer much promise in this regard. Alternative molecular tagging approaches could similarly improve the recovery of full-length amplicons and reduce spurious contig assemblies.
Severe Acute Malnutrition (SAM), characterized by wasting or nutritional edema, affects an estimated 20 million children under the age of five each year, killing up to 35% even when hospitalized [51, 52]. While previous studies have demonstrated a causal relationship between gut bacteria and SAM [53, 54], the impact of eukaryotic microbes is less well established. Given that malnutrition is a common symptom of intestinal infections by microbial eukaryotes such as Entamoeba histolytica, Giardia lamblia, and Cryptosporidium parvum [26, 55, 56, 57], understanding the potential contribution of eukaryotes to SAM relies on the availability of a diagnostic capable of surveying a broad diversity of eukaryotic microbes. Using a two-amplicon approach, we show a high prevalence of protozoa in a Malawian cohort of SAM patients, with two or more protozoa in 84% of samples. Our findings of colonization by multiple Blastocystis subtypes are in agreement with previous surveys of healthy Malawian children , as well as individuals from both low-income countries and Western populations [2, 3]. Trichomonads and multiple Entamoeba species have previously been identified in healthy Malawian children . Here, we identified Entamoeba coli and multiple trichomonads in our malnourished cohort. The occurrence of trichomonads in the context of malnutrition warrants attention as Tritrichomonas spp. have been reported as rare opportunistic pathogens in humans [58, 59, 60, 61] and shown to significantly alter the immune profile and exacerbate colitis in mice [17, 62]. Indeed, the commensal Pentatrichomonas hominis was previously suspected of causing gastrointestinal symptoms in two patients . The prevalence of Cryptosporidium and Enterocytozoon bieneusi in the cohort is clinically relevant, as these taxa are commonly isolated from children with SAM exhibiting persistent diarrhea, and E. bieneusi has been associated with chronic diarrhea and intestinal malabsorption in HIV-positive adults [64, 65, 66, 67, 68, 69, 70, 71]. Detection of nematode and cestode sequences is also pertinent as intestinal infections by a handful of species are widely implicated in malnutrition and stunting .
We noted a positive association between Blastocystis and complex carbohydrate metabolizing bacteria Ruminococcus, Eubacterium, and Cellulosibacter, as well as the butyrate producer, Roseburia—taxa typically associated with gut health [36, 54, 73, 74, 75]. Future association studies with larger cohorts may reveal whether these interactions are Blastocystis subtype-specific. Although Giardia and Cryptosporidium-induced bacterial dysbiosis have been reported in mice [76, 77], we found no significant associations in our cohort. We note that possible exposure of children to antibiotics prior to hospitalization may confound our analysis. As stool was collected shortly after admission, we therefore expect that effects of antibiotic treatment at the hospital had minimal impact on our findings.
Nearly half of the children in this cohort were HIV reactive. Previous studies have suggested that Blastocystis and Cryptosporidium infections are common causes of persistent diarrhea in HIV-positive children . However, we found no evidence of an association between parasite carriage, HIV reactivity, and diarrhea, albeit in a relatively limited sample set, and the absence of conclusive diagnosis of infection.
While serving to illustrate the utility of our approach, the small sample size used in this study together with the complexity of the SAM phenotype and lack of samples from healthy subjects preclude us from drawing specific conclusions on the contribution of eukaryotic taxa to clinical symptoms. Applying this methodology to future cross-sectional and longitudinal case-control studies is expected to reveal clinically relevant relationships between gut eukaryotes and bacteria and help distinguish resident eukaryotes from acute infections.
During the development of our two-amplicon approach, we noted a number of limitations. First, the length of the V4 V5 genetic regions in the 18S rRNA gene exceeds current read lengths generated by Illumina sequencing platforms, resulting in incomplete sequences for many taxa, a limitation we expect to be mitigated by improvements in sequencing technology. Second, the lack of comprehensive sequence resources for rRNA genes and ITS restricts our ability to capture the full spectrum of eukaryotic diversity. Third, the recovery of human amplicons in our analyses suggests that host contamination may restrict a more comprehensive sampling of eukaryotic microbiota. Strategies such as human DNA blocking primers are needed to reduce host DNA signal. Finally, our inability to detect Giardia, which was detected by the Luminex pathogen panel, may be the result of primer mismatches or low abundance. Despite these limitations, we believe this two-amplicon approach provides an effective strategy for characterizing the eukaryotic microbiota of environmental samples, including those relevant to human health.
To generate a diverse and robust profile of eukaryotic microbiota in the human gut, we developed a strategy based on sequencing two independent taxonomically informative genetic regions. The combined sequence information allowed us to uncover protozoa, microsporidia, helminths, and fungi, including two organisms previously associated with malnutrition, and putative relationships between the eukaryote Blastocystis and bacteria. We expect future applications of this method to clinical and environmental samples will begin to uncover the full diversity of eukaryotic microbes, together with their interactions with other microbiota, as well as their role in health and disease.
Evaluation of rRNA variable genetic regions, primer design, and validation
Sequence variability was calculated as the Shannon entropy at each position of aligned protozoan 18S (n = 19,427) and 28S (n = 1061) rRNA genes obtained from the SILVA v128 database . Variable region boundaries were identified through manual inspection. To evaluate the potential for taxonomic classification, 100 copies of each variable region (≥ 30 nt) were mutated with a 1% frequency in silico and compared to reference sequences using the Burrows-Wheeler Aligner MEM (BWA-MEM) algorithm . Primers were designed by manual inspection of conserved stretches in protozoan and helminth kingdoms. In silico primer binding predictions were generated using the SILVA TestProbe 3.0 tool , allowing 1 weighted mismatch. Experimental primer evaluation was carried out using PCR conditions described below on gDNA isolated from a panel of cultured protozoa, fungi, helminths, and a human foreskin fibroblast cell line (ATCC SCRC-1041).
DNA extraction and sequencing
Details of clinical sample collection are provided in Additional file 2: Supplementary Methods; 100–150 mg of formed or 100 μL of loose stool was subjected to 5 min of bead beating and DNA was extracted using easyMAG (bioMerieux, Marcie-l’Etoile, France). Samples were screened for stool pathogens using the xTAG Gastrointestinal Pathogen Panel (Luminex, Austin, TX). Amplicons were generated in triplicate using the iProof polymerase (Bio-Rad Cat. #1725301, Hercules, CA) as follows: 94 °C 45 s, 56 °C 45 s, and 72 °C 1 min (18S rRNA gene V4 V5, 30 cycles, primers V4-1 forward 5′-GCGGTAATTCCAGCTC-3′ and V4-4 reverse 5′-GCCMTTCCGTCAATTCC-3′) or 3 min (transITS, 35 cycles, primers V9-9 5′-CCTGCCMTTTGTACACACC-3′ and V2-rev6 5′-GGNGCGAAAGACTMATCG-3′). Samples from 46 children produced amplicons and were used in this study. TransITS amplicons were fragmented using the Nextera kit (Illumina, San Diego, CA), and both libraries were barcoded and sequenced with MiSeq V3 (300 bp × 2) (Illumina, San Diego, CA). 16S rRNA gene V4 amplicons were generated using 515f and 806r primers, and libraries were sequenced using MiSeq V2 (150 bp × 2) (Illumina, San Diego, CA).
Sequence processing and classification
Paired amplicon reads were quality filtered using Trimmomatic-0.36 , transITS and 16S rRNA gene V4 reads were merged using VSEARCH v1.11.1 , and all were dereplicated. In the 18S rRNA gene V4 V5 amplicon, due to low percentage of read overlap, 200 nt of high-quality forward reads were taxonomically classified using SINA v1.2.11 , with minimum 90% sequence identity and 1 neighbor. To confirm taxonomies, reverse reads (≥ 100 nt in length) were included in subsequent phylogenetic analyses. Paired reads from select genera were assembled into contigs with maximum 1% mismatch using the Geneious v9.02 de novo assembler . Contigs composed of ≥ 100 reads (or the highest number of reads in the low abundance taxa Gregarina, Tetratrichomonas, and Blastocystis sp. ST6 and unknown ST) were aligned with clade representatives using MUSCLE (50 iterations), and optimal gap trimming was performed using trimAl v1.2 [83, 84]. Bayesian phylogenies were constructed using MrBayes , with the HKY85 substitution model, 1 million chain length, four heated chains, a burn-in of 250,000, and four gamma categories. Trees were visualized using FigTree v1.4.3 , and branches with posterior probability values ≤ 0.7 were collapsed.
TransITS reads were classified using BWA-MEM  based on the closest sequence match (≥ 97% sequence identity, ≥ 70% coverage) in a custom database. The database was built using 18S and 28S rRNA genes obtained from SILVA, UNITE, and RFAM databases [29, 33, 87, 88], as well as protozoan and helminth rRNA genes captured from the NCBI nucleotide database using the search terms “ITS1,” “ITS2,” and “28S.” Reference sequences were trimmed with cutadapt v1.12 , using amplicon primer sequences. Positional analysis of reads was performed by mapping merged reads to known reference sequences, or where no references were available, to contig consensus sequences generated through de novo sequence assembly, using Geneious v.9.02.
Unclassified amplicon sequences represented by ≥ 500 reads were submitted for a BLAST  search of the NCBI nucleotide database (downloaded Dec 1, 2017) and assigned the taxonomy of the top hit (≥ 98% sequence identity and coverage). Blastocystis subtypes were identified by ≥ 98% sequence identity to their respective genomes; equivalent scores to more than one subtype were assigned “Blastocystis unknown ST.” All reference gene and genome accessions are provided in Additional file 1: Table S5.
16S rRNA gene amplicons were clustered de novo to 97% using the VSEARCH cluster_fast algorithm, and taxonomies were assigned using the RDP classifier (release 11, training set 16) .
Analysis of microbiota composition
Microbiota composition and alpha diversity analyses were carried out in R 3.4 using Phyloseq 1.20.0  and ggplot2 2.2.1  packages. Similarity of eukaryotic 18S rRNA gene V4 V5 and transITS profiles were evaluated using taxon co-occurrence and Bray-Curtis compositional dissimilarity. Distances between samples, based on genus co-occurrence, were calculated using Hamming distances in the e1071 1.6-8 package  as follows. OTU tables were transformed to binary matrices, using a range of 5–100 minimum read thresholds to indicate the presence of taxa, and numbers of co-occurring taxa were scored. Significance was tested between the distribution of Hamming distances scores of profiles of the same patients compared to different patients using the two-sample Kolmogorov-Smirnov test. Bray-Curtis distances were calculated using the composition of genera identified in at least 20% of 18S rRNA gene V4 V5 and transITS datasets. Significance was tested for within-patient and between-patient comparisons using the Wilcoxon rank sum test.
Putative associations between eukaryotic and bacterial taxa were identified using taxon co-occurrence, differential abundance analysis, and sparse partial least squares discriminant analysis. Co-occurrence of bacterial and eukaryotic taxa in patient samples was calculated using Hamming distances as above, using a minimum of five reads to designate occurrence. Taxa occurring in < 20% or > 80% of samples were removed to reduce spurious associations. Distances were clustered and plotted using the hclust and heatmap.2 functions in the gplots 3.0.1 R package . Bacterial differential abundance analysis was carried out with ALDEx2 v1.10.0 , using centre log-transformed data, prefiltered for bacterial taxa present at ≥ 2 reads in ≥ 2 patients. Abundance was compared in samples positive and negative for eukaryotes with enough variance for analysis (Blastocystis, Cryptosporidium, Giardia, and Enterocytozoon) or HIV reactivity, and p values were calculated using the Wilcoxon rank sum test with Benjamini-Hochberg correction. Tests were carried out using minimums of 2, 5, 50, and 100 reads in patient samples to designate the presence of eukaryote, and p values < 0.05 in at least two tested conditions were deemed significant. Sparse partial least squares discriminant analysis (sPLS-DA) was used to evaluate relationships between patient samples positive and negative for Blastocystis, based on bacterial composition. Data were prefiltered for bacterial genera present at ≥ 5 reads in ≥ 10% of patients, and abundance was normalized using cumulative sum scaling in the MixOmics 6.2.0  and MetagenomeSeq  packages. The model was evaluated through tenfold cross-validation, and classification error rates averaged over 50 repetitions. Ordered multinomial logistic regression was used to explore relationships between clinical features and the four eukaryotic organisms, binned into undetected, low, medium, and high detection.
We thank Leah Cowen, Brent Derry and Bret Pearson for providing helminth and fungal DNA.
Funding was provided by the Center for Global Child Health, SickKids Research Institute & Natural Sciences and Engineering Research Council of Canada to J.P. (DG-06664). AP is supported by a Restracomp scholarship administered by the Research Training Centre (Hospital for Sick Children).
Availability of data and materials
The datasets generated during the current study are available through the NCBI Sequence Read Archive (BioProject PRJNA483794). Code is available in the GitHub repository https://github.com/ParkinsonLab/Eukaryotic_microbiome_analysis.
MEG, RHJB, and JP conceived the study. AP performed the experiments. AP and CB carried out the data analysis. PWW and DSG contributed to the NGS sequencing design. WV and RHJB provided the clinical samples. AP, CB, and JP wrote the manuscript. All authors reviewed and approved the manuscript.
Ethics approval and consent to participate
Study approval was obtained from the Research Ethics Board at The Hospital for Sick Children (Toronto, Canada) and the College of Medicine Research Ethics Committee at the University of Malawi (Blantyre, Malawi). Samples were collected with informed consent by study participants.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 6.Mejia R, Vicuna Y, Broncano N, Sandoval C, Vaca M, Chico M, Cooper PJ, Nutman TB. A novel, multi-parallel, real-time polymerase chain reaction approach for eight gastrointestinal parasites provides improved diagnostic capabilities to resource-limited at-risk populations. Am J Trop Med Hyg. 2013;88(6):1041–7.PubMedPubMedCentralCrossRefGoogle Scholar
- 14.Raetz M, Hwang SH, Wilhelm CL, Kirkland D, Benson A, Sturge CR, Mirpuri J, Vaishnava S, Hou B, Defranco AL, et al. Parasite-induced TH1 cells and intestinal dysbiosis cooperate in IFN-gamma-dependent elimination of Paneth cells. Nat Immunol. 2013;14(2):136–42.PubMedCrossRefPubMedCentralGoogle Scholar
- 19.Bruijnesteijn van Coppenraet LE, Wallinga JA, Ruijs GJ, Bruins MJ, Verweij JJ. Parasitological diagnosis combining an internally controlled real-time PCR assay for the detection of four protozoa in stool samples with a testing algorithm for microscopy. Clin Microbiol Infect. 2009;15(9):869–74.PubMedCrossRefPubMedCentralGoogle Scholar
- 20.Mero S, Kirveskari J, Antikainen J, Ursing J, Rombo L, Kofoed PE, Kantele A. Multiplex PCR detection of Cryptosporidium sp, Giardia lamblia and Entamoeba histolytica directly from dried stool samples from Guinea-Bissauan children with diarrhoea. Infect Dis (Lond). 2017;49(9):655–63.CrossRefGoogle Scholar
- 25.Parfrey LW, Walters WA, Lauber CL, Clemente JC, Berg-Lyons D, Teiling C, Kodira C, Mohiuddin M, Brunelle J, Driscoll M, et al. Communities of microbial eukaryotes in the mammalian gut within the context of environmental eukaryotic diversity. Front Microbiol. 2014;5:298.PubMedPubMedCentralCrossRefGoogle Scholar
- 26.Attia S, Versloot CJ, Voskuijl W, van Vliet SJ, Di Giovanni V, Zhang L, Richardson S, Bourdon C, Netea MG, Berkley JA, et al. Mortality in children with complicated severe acute malnutrition is related to intestinal and systemic inflammation: an observational cohort study. Am J Clin Nutr. 2016;104(5):1441–9.PubMedPubMedCentralCrossRefGoogle Scholar
- 27.Versloot CJ, Voskuijl W, van Vliet SJ, van den Heuvel M, Carter JC, Phiri A, Kerac M, Heikens GT, van Rheenen PF, Bandsma RHJ. Effectiveness of three commonly used transition phase diets in the inpatient management of children with severe acute malnutrition: a pilot randomized controlled trial in Malawi. BMC Pediatr. 2017;17(1):112.PubMedPubMedCentralCrossRefGoogle Scholar
- 33.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM: arXiv:13033997v1 [q-bioGN]; 2013.Google Scholar
- 40.Heimesaat MM, Bereswill S, Fischer A, Fuchs D, Struck D, Niebergall J, Jahn HK, Dunay IR, Moter A, Gescher DM, et al. Gram-negative bacteria aggravate murine small intestinal Th1-type immunopathology following oral infection with Toxoplasma gondii. J Immunol. 2006;177(12):8785–95.PubMedCrossRefPubMedCentralGoogle Scholar
- 52.World Health Organization. Guideline: Updates on the Management of Severe Acute Malnutrition in Infants and Children. Geneva: World Health Organization; 2013.Google Scholar
- 57.Buzigi E. Prevalence of intestinal parasites, and its association with severe acute malnutrition related Diarrhoea. J Biol Agriculture Healthcare. 2015;5(2):81.Google Scholar
- 58.Duboucher C, Caby S, Dufernez F, Chabe M, Gantois N, Delgado-Viscogliosi P, Billy C, Barre E, Torabi E, Capron M, et al. Molecular identification of Tritrichomonas foetus-like organisms as coinfecting agents of human Pneumocystis pneumonia. J Clin Microbiol. 2006;44(3):1165–8.PubMedPubMedCentralCrossRefGoogle Scholar
- 62.Escalante NK, Lemire P, Cruz Tleugabulova M, Prescott D, Mortha A, Streutker CJ, Girardin SE, Philpott DJ, Mallevaey T. The common mouse protozoa Tritrichomonas muris alters mucosal T cell homeostasis and colitis susceptibility. J Exp Med. 2016;213(13):2841–50.PubMedPubMedCentralCrossRefGoogle Scholar
- 63.Meloni D, Mantini C, Goustille J, Desoubeaux G, Maakaroun-Vermesse Z, Chandenier J, Gantois N, Duboucher C, Fiori PL, Dei-Cas E, et al. Molecular identification of Pentatrichomonas hominis in two patients with gastrointestinal symptoms. J Clin Pathol. 2011;64(10):933–5.PubMedCrossRefPubMedCentralGoogle Scholar
- 64.Amadi B, Kelly P, Mwiya M, Mulwazi E, Sianongo S, Changwe F, Thomson M, Hachungula J, Watuka A, Walker-Smith J, et al. Intestinal and systemic infection, HIV, and mortality in Zambian children with persistent diarrhea and malnutrition. J Pediatr Gastroenterol Nutr. 2001;32(5):550–4.PubMedCrossRefPubMedCentralGoogle Scholar
- 68.Espern A, Morio F, Miegeville M, Illa H, Abdoulaye M, Meyssonnier V, Adehossi E, Lejeune A, Cam PD, Besse B, et al. Molecular study of microsporidiosis due to Enterocytozoon bieneusi and Encephalitozoon intestinalis among human immunodeficiency virus-infected patients from two geographical areas: Niamey, Niger, and Hanoi, Vietnam. J Clin Microbiol. 2007;45(9):2999–3002.PubMedPubMedCentralCrossRefGoogle Scholar
- 69.Brasil P, Sodre FC, Cuzzi-Maya T, Gutierrez MC, Mattos H, Moura H. Intestinal microsporidiosis in HIV-positive patients with chronic unexplained diarrhea in Rio de Janeiro, Brazil: diagnosis, clinical presentation and follow-up. Rev Inst Med Trop Sao Paulo. 1996;38(2):97–102.PubMedCrossRefPubMedCentralGoogle Scholar
- 73.Hedin CR, McCarthy NE, Louis P, Farquharson FM, McCartney S, Taylor K, Prescott NJ, Murrells T, Stagg AJ, Whelan K, et al. Altered intestinal microbiota and blood T cell phenotype are shared by patients with Crohn's disease and their unaffected siblings. Gut. 2014;63(10):1578–86.PubMedCrossRefPubMedCentralGoogle Scholar
- 74.Machiels K, Joossens M, Sabino J, De Preter V, Arijs I, Eeckhaut V, Ballet V, Claes K, Van Immerseel F, Verbeke K, et al. A decrease of the butyrate-producing species Roseburia hominis and Faecalibacterium prausnitzii defines dysbiosis in patients with ulcerative colitis. Gut. 2014;63(8):1275–83.PubMedCrossRefPubMedCentralGoogle Scholar
- 76.Barash NR, Maloney JG, Singer SM, Dawson SC. Giardia alters commensal microbial diversity throughout the murine gut. Infect Immun. 2017;85(6).Google Scholar
- 82.Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, Cooper A, Markowitz S, Duran C, et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28(12):1647–9.PubMedPubMedCentralCrossRefGoogle Scholar
- 89.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17(1):10.Google Scholar
- 94.e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien [https://CRAN.R-project.org/package=e1071].
- 95.gplots: Various R Programming Tools for Plotting Data [https://CRAN.R-project.org/package=gplots].
- 96.Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014;2:15.PubMedPubMedCentralCrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.