Background

In contemporary biology, cell resources are routinely manipulated via various techniques under a range of conditions. During the course of such manipulations, eukaryotic cells are potentially exposed to microorganisms that cause prominent morphological and physiological changes in their host cells, and such changes often result in erroneous experimental conclusions [1,2,3]. In medical and clinical settings, it is imperative to detect infectious agents in donated cells to avoid donor-patient disease transmission [4,5,6]. Despite a community-wide effort to introduce precautions to prevent contamination, the pervasiveness of unexpected microbial contaminants in publications has recently been reported [7,8,9]. This diminished quality is due, in part, to intrinsic difficulties in assaying for contamination, e.g., window periods, primer dependency, and drug resistance. As an alternative solution to these problems, next-generation sequencing (NGS) has been shown to be an effective approach [6, 10, 11].

Recently, NGS-based studies have intensively addressed the presence of specific microorganisms (e.g., Mycoplasma) [7,8,9] and the influence of cross-contamination caused by exogenous sources (e.g., laboratory reagents and sequencer carryover) [12,13,14,15]. While computational methods employing efficient bioinformatics strategies have greatly contributed to such studies [16,17,18,19], fundamental challenges still remain [20, 21]. One difficulty in particular is how to deal with sequenced reads that can be mapped to multiple microbial genomes simultaneously, which leads to detection uncertainty [17, 21, 22]. In fact, biological resources contaminated by multiple microorganisms are not uncommon, and the nature of higher intra- and interspecies sequence similarities in microbial communities is well known; that is, distinct species belonging to the same genus have > 97% sequence identity [23]. There are also species in different genera that are difficult to distinguish genomically [21]; for instance, the genome sequence of Enterobacteria phage phiX174, a routinely used spike-in species in Illumina sequencing, shares > 95% identity with the sequences of the G4 and Alpha3 Microvirus genera [24].

In this study, to improve the certainty of NGS-based contaminant detection, we developed a computational approach that rigorously investigates the genomic origin of sequenced reads. Unlike existing rapid and quasi-alignment approaches, our method repeatedly performs read mapping coupled with a scoring scheme that weights the reads unmapped to the host genome but mapped to multiple contaminant genomes. This approach allows estimation of the probability of chance occurrence of the detected contaminants. By setting human as a host and bacteria/viruses/fungi as contaminants, we demonstrate the robust performance of the proposed method by analyzing synthetic data. Next, we analyzed over 400 NGS samples to profile the contamination landscape, which yielded a catalog of the microbes prevalent in the molecular experiments. Furthermore, we applied a matrix factorization algorithm using our profiles to infer the functional impacts of contamination, thus providing a novel window into the complexities of host-microbe interactions.

Results

Identification and quantification of host-unmapped microbial reads

Our first goal was to extract exogenous reads from the input NGS reads by performing greedy alignments. Similar to the initial screening step in published methods [18, 25, 26], our method thoroughly discards host-related reads (steps I to IV in Fig. 1a). Unlike the sequential subtracting approach used in other published methods [13, 18, 25], our method independently maps the screened reads to individual microbial genomes (step V in Fig. 1a), which enables us to define the mapping status of each read (step VI in Fig. 1a), i.e., a read is categorized as either a “uniq-species-hit” (or “uniq-genus-hit”), which is uniquely mapped to a specific species (or genus), or as a “multi-species-hit” (or “multi-genera-hit”), which is repeatedly mapped to multiple species (or genera).

Fig. 1
figure 1

Overall structure of the proposed pipeline and results of the performance assessment. a Schematic representation of the proposed pipeline that executes rigorous read alignment with a large-scale genome database. b FDR distribution in the reversion tests considering falsely mapped reads to other species or to other genera. Particular genera, including Raoultella, Shigella, and Kluyvera, are difficult to distinguish genomically. c Comparative analysis for the effects of uniq-genus-hits and weighted multi-genera-hits in quantification. “Total mapped” represents the sum of uniq-genus-hits (Unique and Unambiguous) and multi-genera-hits (Multiple and Ambiguous). “Weighted” represents the adjusted “Total mapped” by our scoring scheme. d Correlations between the detection quantification and spike-in concentration assayed by DNA-seq (0-day cultured hPDL-MSCs with antibiotics). e RPMH differences among three NGS protocols in Mycoplasma spike-in detections (3-day cultured hPDL-MSCs)

Prior to quantifying microbe abundance, our method tests the statistical significance of the unique microbe hits by preparing an ensemble of unique hits with random read sets (step VIII in Fig. 1a). If the observed value of the unique hits is significantly greater than its random ensemble mean value, the pipeline reports the microbe as a potential contaminant. Microbes that were detected with no unique hits are considered not of interest. Next, to calculate an RPMH (reads per million host-mapped reads) value for each species (or genus), our method weighs the reads repeatedly mapped to the multiple microbes reported (step VII in Fig. 1a). The RPMH at a sample level is based on the sum of the raw counts of microbe-mapped reads. In summary, the proposed method explores uniquely mapped reads, as a primary key, and exploits the weighted contributions of reads mapped to multiple microbial genomes (see the “Methods” section).

Parameter tuning with simulated reads

To assess the performance of our mapping approach (steps V and VI in Fig. 1a), we first conducted a reversion test with random microbial read sets, which measures the ratio of reads that correctly mapped to their origin genomes. We prepared 10,000 reads (1000 × 10 species) per run and repeated the test 1000 times with different read sets. We also tested different parameters for Bowtie2 [27]. Since the reversion test uses intact DNA fragments randomly selected, if the pipeline works perfectly, all the species will be detected with the 1000 reads.

With the default parameters (Fig. 1b), when counting false positives at the species level (i.e., multi-species-hits), 17% of the tested species had over 5% multi-species-hits. When allowing reversion errors within the same genus (i.e., counting uniq-genus-hits), only 0.7% of the genera (11 out of 1504) showed over 5% multi-genera-hits. The other parameters of Bowtie2 had no effect on these results (Additional file 1: Figure S1A-C). This observation implies the presence of high sequence similarity at the species level. We calculated the ratios by running PathSeq [18], FastQ Screen [28], and DecontaMiner [29] (Additional file 2). Of note, comparing existing pipelines is not straightforward because different aligners are employed and databases are inaccessible in some cases. With this in mind, the results indicated that the pipelines exhibit inferior performance for a portion of the reads, similar to our pipeline (Additional file 1: Figure S2A). These results suggest that the FDRs likely depend on the degree of microbial intra-species sequence homology causing ambiguous multi-species-hits, rather than on intrinsic algorithmic differences in the pipelines.

We next investigated the influence of interspecies sequence homology. Overall, although the reversion test ensures 1000 microbial reads as the intensity of a species, counting only the uniq-genus-hits showed lower intensity (i.e., loss of accuracy due in part to the occurrence of multi-genera-hits), while taking the sum of all of the hits showed higher intensity (i.e., gain of ambiguity due to the involvement of multi-genera-hits) (Additional file 1: Figure S1D). The existing pipelines we tested exhibited the same propensity in detection accuracy (Additional file 1: Figure S2B). These results point out the inadequacy in the consideration of uniquely mapped reads only and the need for careful handling of multi-genera-hits that causes ambiguity in the contamination source.

To overcome this issue, we designed a scoring scheme for multi-genera-hits (step VII in Fig. 1a). Based on the overall mapping status of the input reads, multi-genera-hit reads are rigorously penalized when a larger number of uniq-genus-hits are found; however, the penalty is relaxed when uniq-genus-hits are less frequent (Additional file 1: Figure S3). Overall, our pipeline incorporating this scoring scheme quantifies robust intensities compared to the simple sum of all of the hits (Additional file 1: Figure S1D). To clarify further, we performed a comparative analysis with the genera detected with over 5% FDR levels in Fig. 1b. The result demonstrated that the loss of accuracy can successfully recover when the weighted multi-genera-hits are considered (Fig. 1c and Additional file 3: Table S1). In addition, our detections of uniq-genus-hits and multi-genera-hits were highly comparable to FastQ screen with Bowtie2, which supports the validity of our mapping strategy tuned with Bowtie2. Interestingly, whereas the local alignment strategies (i.e., PathSeq and FastQ screen) increased the gain of ambiguity, our pipeline reduced it by the scoring scheme.

In this analysis, we observed nine unexpected genera with uniq-genus-hit reads resulting from misalignments for complex reasons (Additional file 3: Table S2). For example, a few reads of Escherichia coli were uniquely mapped to Lambdavirus in 3 out of 1000 runs. To test whether these uniq-genus-hits are rare events, we prepared random reads from our microbe genome database that discarded Lambdavirus genomes and then mapped them to the genera detected in each of the three runs to collect random uniq-genus-hits. After 1000 runs, in the case of Lambdavirus, the observation of ten unique hits showed almost zero deviation above the mean of the uniq-genus-hits from the mapping of random read sets (p = 0.475 with z-score 0.063), implying a chance occurrence of the observed uniq-genus-hits (Additional file 3: Table S2).

Considering these results, we adjusted the proposed method to quantify the microbe abundance at genus-level resolution and additionally reported species-level quantifications. Evaluation of the significance of the uniq-genus-hits of a genus prior to quantification is critical to avoid false results. For this purpose, instead of adopting the arbitrary criteria used in other methods [9, 14, 16], the proposed pipeline conducts the abovementioned mapping with random read sets to estimate the probability of the occurrence of uniquely mapped reads (step VIII in Fig. 1a). The genus having significant unique hits is finally quantified by the scoring scheme (step IX in Fig. 1a).

Analysis of spike-in contaminants with mesenchymal stem cells

To validate the performance with real-world data, we prepared human periodontal ligament-derived mesenchymal stem cells (hPDL-MSCs) by culturing with and without antibiotic treatments and by adding viable spike-in microbes. We performed DNA-seq, RNA-seq, and ATAC-seq assays with these samples (Table 1). hPDL-MSCs are a promising clinical resource for periodontal regeneration, as studied by our group [30].

Table 1 Profiling of spike-in microbes with host-unmapped NGS reads

As shown in Table 1, the spike-in microbes can be quantified with uniq-genus-hits only, decreasing the contribution of weighted multi-genera-hits. In the case of the DNA-seq assay with six spike-in species, we quantified the sample-level RPMHs that were well correlated with the spike-in concentrations (Fig. 1d). At the genus level, we could detect four species at 60 CFU and five species at 1100 CFU (p < 0.001), but failed to detect 60 CFU of Candida albicans (p = 0.2), as did BWA-align [31] and Taxonomer [17, 32]. By contrast, BWA-mem and NovoAlign found < 76 C. albicans reads with local alignments to low-complexity sequence loci. Of note, the C. albicans genome includes a particularly high content of repetitive sequences [33]. These results suggest that the microbial genomic context is one of the factors to determine the detection accuracy particularly in the case of lower contamination degree. In fact, the pipelines increased the detection variability at 60 CFU spike-ins as shown in Fig. 1d; PathSeq with BWA-mem reported a relatively higher concentration and the k-mer matching of Taxonomer broadly reduced the concentrations along with filtering a number of potential host-relevant reads (i.e., 165,777 in Sample1, 85,530 in Sample2, and 84,590 in Sample3).

With regard to antibiotic effects, the DNA-seq assay with 3-day-cultured cells clearly demonstrated that antibiotic supplementation causes a ~ 1000-fold decrease in the sample-level RPMH compared with that of cells cultured without antibiotics. In particular, Acholeplasma was markedly sensitive to sterilization compared with Mycoplasma (Table 1 and Fig. 1e), suggesting the presence of varying drug sensitivities among microbes.

In summary, we concluded that the concentration of spike-in cells can be recovered via our approach. Based on the results of the DNA-seq assays at ~ 0.1× coverage depth of the host genome with 60 CFU of microbes, we estimated 0.01 RPMH as an approximation of the limit of detection (LOD). That is, one microbial read will exist when 100 million host reads are sequenced. However, LOD verification depends on multiple factors, including microbial genomic context, antibiotic susceptibility, sequencing depth, and sequencing protocol. In this regard, the results of spike-in tests suggest that the ATAC-seq assay offers a remarkable ability to detect contaminants (Fig. 1e) with very few input reads shown in Table 1.

Detection of prevalent contaminants in public RNA-seq data

To profile the contamination landscape in public data, we downloaded 389 human RNA-seq datasets from ENCODE and Illumina Human BodyMap 2.0 (hereinafter called “IHBM2”) and extracted the potential host-unmapped microbial reads with scattered percentages in the input reads (Additional file 1: Figure S4A), which amounted to 0.15–18.7% in ENCODE and 0.54–3.0% in IHBM2. Interestingly, the relative level of microbe-mapped reads increased in a sample when the relative level of host-mapped reads decreased (Fig. 2a). Overall, 98% of samples fell within the range of 103–105 RPMHs, forming a reference range for RNA-seq sample-level RPMHs (Fig. 2b).

Fig. 2
figure 2

Investigation of 389 public RNA-seq datasets to profile potential contaminants. a Distribution of the microbe-mapped reads inversely correlated with that of the host-mapped reads. b Distribution of sample-level RPMHs. Of the samples, 98% are within 1000 to 100,000 RPMHs. c Genus-level read counts of 4040 occurrences of 240 genera across the 389 samples. d RPMHs of the 4040 occurrences, 91% of which are within 10 to 10,000 RPMHs. e Twenty-eight genera detected in both ENCODE and Illumina Human BodyMap2.0 (IHBM2) samples; the x-axis labels are colored black for bacteria, blue for fungi, and red for viruses

At the genus level, we detected 240 genera across the samples (p < 0.001). These genera appeared 4040 times, including widespread multi-genera-hits (Fig. 2c). Using the weighted read counts, we quantified the genus-level RPMHs of the 4040 occurrences, 91% of which were located within 10 to 104 RPMHs (Fig. 2d). Among the 240 genera, 56 were known contaminants in NGS experiments [12], such as Bacillus, Pseudomonas, and Escherichia (Additional file 1: Figure S4B). The remainder included 28 genera commonly found in ENCODE and IHBM2 samples (Fig. 2e). In particular, Cutibacterium, including the species C. acnes (formerly Propionibacterium acnes), which is readily detected on human skin, was the most prevalent, supporting the findings in a previous study [34].

Since the IHBM2 samples exhibited unique patterns, as shown in Fig. 2b and d, we next investigated their contamination characteristics by performing cluster analyses. The analysis clearly separated the sequencing libraries and revealed an increased magnitude of contamination in the 16 tissue-mixture samples, likely because producing such samples involved more cell-processing steps (Fig. 3a); this separation led to the bimodal distribution shown in Fig. 2b. To confirm the influence of cell-processing complexity, we further analyzed 22 samples of embryonic stem cells (ESCs) that were sequenced at five time points during culturing on various differentiation media [35]. This analysis revealed three clusters strongly associated with the cell types and time points and found elevated levels of contamination in the differentiated ESCs (Fig. 3b), suggesting that intricate cell manipulation poses a higher risk of contamination.

Fig. 3
figure 3

Results of the hierarchical clustering analysis with contamination profiles. a Contamination profile of the Illumina Human BodyMap2.0 (IHBM2) samples showing the increased RPMHs in 16 tissue-mixture RNA-seq datasets. b Contamination profile of ESCs (SRP067036) showing three clusters associated with differentiation and time points

Finally, we analyzed host-microbe chimeric reads with paired-end (PE) ENCODE and IHBM2 samples. That is, one end of a PE read was mapped to the host and its counterpart to one or more microbes, and vice versa. The total number of chimeric reads was very low among all of the microbe-mapped reads, implying no considerable influence on the quantification of host gene expression: only 972,812 out of 750,736,667 microbe-mapped PE reads in the ENCODE samples and 93,723 out of 28,622,763 microbe-mapped PE reads in the IHBM2 samples. On the other hand, most of the chimerism existed in host gene bodies that encode ribosome components, transporters, and signaling molecules (Additional file 3: Table S3). The genes were also upregulated in Mycoplasma-infected samples as described below. This finding should be further studied to understand the association between NGS read chimerism and microbial hijacking mechanisms.

Identifying genes responding to Mycoplasma infection in MSCs

Mycoplasma is notorious for infecting cultured cells and has been frequently detected in public NGS data [8, 9, 36]. Hence, we profiled the genus-level RPMHs of Mycoplasma from the 389 ENCODE and IHBM2 samples as well as from 43 heavily infected samples consisting of seven BL DG-75 samples already known to be infected [9] and 36 lung cancer and stem cell samples. As a result, 110 out of the 432 samples (25.5%) contained at least one Mycoplasma uniq-genus-hit, but only 22 samples (5%) included significant uniq-genus-hits (Fig. 4a). This large discrepancy again suggests the importance of the careful handling of homologous and erroneous NGS reads, which is imperative to infer contaminant prevalence with certainty.

Fig. 4
figure 4

Results of the Mycoplasma prevalence analysis and the functional impacts on host cells. a Twenty-two out of 432 public RNA-seq datasets contained significant Mycoplasma-mapped reads (red-colored bar) that were normalized to RPMHs (blue-colored line); the x-axis labels are colored black for DRA001846, gray for IHBM2, blue for ENCODE, and red for Mycoplasma-positive samples. b Gene expression correlation plots between Mycoplasma-positive (Myco+) and Mycoplasma-negative (Myco-) MSCs; Myco(+) hPDL-MSCs are Mycoplasma spike-in cells (2000 CFU × 7 species, 3 days cultured without antibiotics), FPKMs were transformed onto the log10 scale by adding one, and the black-labeled genes are the 13 genes listed in d. c Highly enriched Gene Ontology terms and Reactome pathways (q value after Bonferroni correction < 0.001). d Venn diagram showing unique or shared differentially upregulated genes (DUGs) in MSCs, including 13 out of 967 DUGs unique to Myco(+) MSCs. e Expression levels of the 13 genes in Myco(+) ESCs and MSCs; the values are expressed as relative TPM (transcripts per million)

To investigate host gene expression changes during Mycoplasma infection, we identified DEGs between Mycoplasma-positive Myco(+) hPDL-MSCs and uninfected Myco(−) hPDL-MSCs. We performed the same analysis by incorporating the Myco(+) human bone marrow MSCs (hBM-MSCs) used in Fig. 4a and Myco(−) hBM-MSCs (GSE90273). We also sequenced and identified DEGs from Myco(−) hBM-MSCs as a control. Of note, although decreases in gene expression should also be studied, we focused on the differentially upregulated genes (DUGs) in the Myco(+) samples to enable clear interpretations. We identified 86 and 2185 DUGs in Myco(+) hPDL-MSCs and in Myco(+) hBM-MSCs, respectively (Fig. 4b), 31 of which existed in both classes of MSCs. Although the DUGs are broadly involved in RNA processing, the genes are significantly enriched in cotranslational protein transport processes and with pathways involved in infection responses (Fig. 4c). None of these enrichments were observed among the 3538 DEGs in Myco(−) hBM-MSCs (Additional file 1: Figure S5). Among the 967 DUGs identified in Myco(+) MSCs, we ultimately retrieved 13 genes that are specifically upregulated in Myco(+) hPDL-MSCs and hBM-MSCs (Fig. 4d).

These results imply that the Mycoplasma in the MSCs addressed here utilizes host protein biosynthesis machinery related to the ER-associated degradation (ERAD) pathway, a well-known microbial entry point [37, 38]. Moreover, one can infer that the abnormal increase in the expression levels of the 13 DUG RNAs is a candidate diagnostic marker for infection. Indeed, the DUGs were also upregulated either in Myco(+) ESCs or other Myco(+) MSCs (Fig. 4e).

Inference of the functional impact of multiple contaminants

As shown in Fig. 5a, a few genes among the 967 DUGs in the Myco(+) MSCs were upregulated in Myco(+) DG-75 samples, which suggests a different type of response in lymphoma. We investigated the correspondence between gene expression levels and Mycoplasma concentrations in the samples and identified genes potentially associated with the infection (Additional file 1: Figure S6A); however, significant GO terms were not detected, which is consistent with the findings of a previous report [9]. Remarkably, the DG-75 samples were heavily contaminated with multiple microbes (Fig. 5b), and the gene expression levels exhibited diverse correlation patterns with the concentrations of other microbes (Additional file 1: Figure S6B), implying a profound influence of co-contaminants on phenotypes.

Fig. 5
figure 5

Inference of DUGs associated with multiple contaminants in Myco(+) DG75 samples. a Expression profile of 967 DUGs unique to Myco(+) MSCs. b Contamination profile with MSC, ESC, and DG-75 samples; the x-axis labels are colored black for Myco(−) and red for Myco(+). c Schematic representation of module identification from two input profiles by the jNMF algorithm. d An example showing the module that captured genes and contaminants co-elevated in a DG-75 sample. e Network representation of the association between genes and contaminants co-elevated in the seven DG-75 samples; GO:0010941 is the enriched GO term in the genes found in at least four DG-75 samples (p = 3.76e−3). f Expression profiles of the 33 genes involved in the biological process “regulation of cell death”, DG75_1 (GSM1197380), DG75_2 (GSM1197385), DG75_3 (GSM1197386), DG75_4 (GSM1197381), DG75_5 (GSM1197382), DG75_6 (GSM1197383), DG75_7 (GSM1197384), NB_1 (GSM2225743), and NB_2 (GSM2225744)

To facilitate the inference of the impact of multiple contaminants, we employed a joint non-negative matrix factorization (jNMF) algorithm [39, 40] that modulates multiple genes and contaminants associated in a set of samples (Fig. 5c). We first prepared seven input datasets, each of which contained five Myco(−) BL cell lines and one of the seven Myco(+) DG-75 samples. After preparing contamination and transcriptome profiles for each dataset, we repeatedly ran the jNMF algorithm by setting a series of parameters for testing the clustering stability (Additional file 1: Figure S7). In the case of DG75_1 (GSM1197380), the jNMF algorithm retrieved the module that specifically includes elements co-elevated in the dataset, i.e., 550 genes and 34 contaminants, including Mycoplasma (Fig. 5d). By gathering this type of module from all of the results of the seven input datasets, we could build a network modeling the connectivity between the upregulated genes and microbe concentrations in the DG-75 samples (Fig. 5e).

The network consisted of 4322 edges connecting 2289 genes, 68 microbes, and seven samples. Of these genes, 259 genes were common to least four DG-75 samples, and the biological process “regulation of cell death” (GO:0010941) was significantly enriched in a subset of them (p = 3.76e−3). This subset (33 genes) included tumor necrosis factor receptors, which paradoxically play pro-tumorigenic or pro-apoptotic functions [41], and humanin-like proteins, which potentially produce mitochondria-derived peptides that inhibit apoptosis [42]. Some of the genes were also highly expressed in normal B cells, where they are likely involved in activating immune responses. The Myco(−) BL cell lines exhibited repression of these apoptosis-related genes (Fig. 5f), which implies that the effect is not specific to cancerous cell types.

These results suggest that the severely contaminated DG-75 samples resisted contamination by multiple microbes via inflammation pathways and survived by inhibiting apoptotic pathways via mitochondria-related mechanisms or via the inhibitory effect of Mycoplasma on apoptosis [36]. Collectively, we concluded that jNMF facilitates the inference of how phenotypes (i.e., gene expression in this case) have been affected by the complex activities of co-contaminants.

Discussion

We sought to assess the feasibility of NGS-based contaminant detection and to improve its certainty by conducting microbe spike-in experiments and by analyzing public data. For profiling microbial contamination, the use of metagenomics approaches that depend on phylogenetic markers or de novo assembly seems to offer little benefit, because the sterilization of microbes and sequencing library preparation from host cell DNA lead to dilution and degradation of microbe-derived nucleic acids [13, 14]. Furthermore, since microbial communities can contaminate host cells, a comprehensive catalog of microbial genomes must be considered to avoid false inferences. Preliminarily, we detected phiX174 in 77 out of 341 ENCODE samples with the numbers of mapped reads ranging from 177 (ENCSR000AEG) to 7,031,626 (ENCSR000AAL). Surprisingly, fewer than six reads in a sample were the uniq-genus-hits of phiX174, and the remainder were multi-genera-hits for phylogenetic neighbor bacteriophages [24, 43, 44]. This situation, which makes it difficult to identify the true species, may occur frequently, as the uniquely mapped and multi-mapped reads in the public datasets exhibited a broad range of intensities (Fig. 2c).

We here developed a straightforward approach that uses a large-scale genome database and exploits multi-mapped reads that were discarded in previous studies. Although our method successfully detected the origins of microbes from the simulated reads of random mixtures, the detection certainty was still imperfect, particularly at species-level resolution. To overcome this issue, we attempted to estimate whether unique microbe-mapped reads are likely observed by chance. We found that 80% of the 110 public RNA-seq samples in which uniq-genus-hits of Mycoplasma were detected resulted from random occurrences, and 5% of 432 RNA-seq samples were most likely infected with Mycoplasma. Moreover, we estimated 103–105 sample-level RPMHs consisting of 10–104 genus-level RPMHs, consistent with previous reports; however, these results illustrated more dispersion than expected. Of note, it is possible that these RPMH estimations are limited to the samples used here, as microbes are highly sensitive to environmental conditions due to distinct genomic context, growth rate, antibiotic susceptibility, and invasion mechanism, and RPMH distributions depend greatly on the sample sets analyzed.

As shown by the results of the spike-in analyses, even though the experimental conditions were identical, the profiles differed between the DNA-seq, RNA-seq, and ATAC-seq assays. Remarkably, RNA-seq profiling tended to include more diverse microbes. This tendency may be attributed to the relatively complex sample handling required, which leads to a higher risk of contamination. Indeed, elaborate cell manipulations, such as tissue mixture and induction of cell differentiation, result in increased contamination diversity and intensity. On the other hand, because most prokaryotes have histone-free supercoiled nucleoids [45], ATAC-seq is superior for microbe detection with very low numbers of input reads. This suggests that the ratio of microbe-to-human DNA accessibility is useful to the NGS-based microbial contaminant detection more than the ratios of the genome and transcriptome sizes. This aspect of our work should be explored in more detail in future studies.

By analyzing public NGS samples, we found that microbes from the genus Cutibacterium are widespread contaminants, which is thought to arise naturally [12]. In addition to known contaminants, our microbe catalog suggests that the major sources of contamination are laboratory reagents and experimental environments. Importantly, any microbial contamination can trigger phenotypic changes in the host cells; however, the response pathways are diverse and unclear. For example, the genes aberrantly expressed during Mycoplasma infection differed greatly between MSCs and cancer cells. Therefore, as an approach to systematically infer the effects of contamination, we used network analysis with jNMF. This approach revealed that host-contaminant interactions alter the molecular landscape, and such alterations could result in erroneous experimental conclusions.

Conclusions

The findings in this study reinforce our appreciation of the extreme importance of precisely determining the origins and functional impacts of contamination to ensure quality research. In conclusion, NGS-based contaminant detection supported by efficient informatics approaches offers a promising opportunity to comprehensively profile contamination landscapes.

Methods

Step-by-step procedure of the proposed pipeline

The proposed pipeline shown in Fig. 1a consists of step-by-step operations detailed below.

Step I (quality control): Trimmomatic [46], with the option “ILLUMINACLIP:adapter_file:2:30:10 LEADING:20 TRAILING:20 MINLEN:36,” assesses the quality of the input NGS reads by removing adapters and trimming reads.

Step II (mapping to host reference genome): HISAT2 [47] coupled with Bowtie2 [27] with the option “-k 1” aligns the quality-controlled reads to a host reference genome.

Step III (removing host-relevant reads): To remove any potential host reads, Bowtie2 with “--sensitive” and via BLASTn with the options “-evalue 0.001 -perc_identity 80 -max_target_seqs 1” sequentially align the unmapped reads again to alternative host genomic and transcriptomic sequences.

Step IV (making low-complexity sequences): The host-unmapped reads that still remain are candidate contaminant-origin reads. To reduce false discovery, TANTAN [48] masks the low-complexity sequences in the host-unmapped reads.

Step V (mapping to a microbe genome): Bowtie2, with the option “--sensitive,” aligns the masked sequences to one set of bacterial, viral, or fungal genomes of species belonging to the same genus. This step is independently repeated with each of the 2289 genera.

Step VI (categorizing read-mapping status): A mapped read is categorized as either a “uniq-genus-hit” (i.e., uniquely mapped to a specific genus) or a “multi-genera-hit” (i.e., repeatedly mapped to multiple genera). The statistics is gathered from the mapping results, which includes the total number of microbe-mapped reads (i.e., sum of “uniq-genus-hit” and “multi-genera-hit”) and the total number of host-mapped reads.

Step VII (defining a shape of scoring function): The total number of microbe-mapped reads (n) and the number of genera of each “multi-genera-hit” read (Ti) define an exponential function for weighting the “multi-genera-hit” reads. That is, a score Si for the read i that was mapped to Ti different genera (or a single genus) is given by

$$ {S}_i={e}^{\frac{-n\left({T}_i-1\right)}{\sum_{j=1}^n{T}_j}}. $$

Thus, a read uniquely mapped to a genus is counted as 1.0, whereas a read mapped to multiple genera is penalized by the exponential function.

Step VIII (testing statistical significance of unique hits): To test the chance occurrence of the “uniq-genus-hit” reads that were mapped to specific microbes, the pipeline first randomly samples n reads (i.e., the total number of microbe-mapped reads) from the microbe genomes that discard the observed microbial genomes. Next, the pipeline aligns the random reads to the observed microbial genomes and counts the uniquely mapped reads. This procedure is repeated ten times to prepare an ensemble of random numbers of unique reads for each observed genus. The numbers for a genus are converted into z-scores, and the null hypothesis that no difference exists between the observation and the mean of its ensemble is tested, resulting in a p value.

Step IX (calculating RPMHs): For sample-level quantification, a normalized RPMH score (reads per million host-mapped reads) is calculated as RPMH = n/m × 106, where n and m are the total number of microbe-mapped reads and the total number of host-mapped reads in a given input dataset, respectively. For genus-level quantification, the RPMH of a genus G is calculated by

$$ \mathrm{RPMH}(G)=\frac{\sum_{k={1}^{S_k}}^{\overset{`}{n}}}{m}, $$

where \( \overset{`}{n} \) is the total number of reads uniquely or repeatedly mapped to G.

Preparation of random microbial reads for reversion

Ten species belonging to distinct genera were randomly selected, and 1000 100-base pair (bp) DNA fragments from the genome of a selected species were prepared. A run of the reversion test uses the 10,000 reads (1000 reads × 10 species) and calculates the false discovery rate (FDR) for each species; that is, TN/(TN + TP), where TP (true positive) is the number of reads mapped to their origin and TN (true negative) is the number of reads mapped to others. If the method works perfectly, the species tested will be detected with 1000 uniquely mapped reads (see Additional file 2).

Cell collection and culture

Human bone marrow-derived MSCs (hBM-MSCs) were purchased from Lonza (Lonza, Walkersville, MD, USA), and periodontal ligament-derived MSCs (hPDL-MSCs) were prepared as previously described [49]. Briefly, periodontal ligament (PDL) tissue samples separated from the middle third of a patient’s wisdom tooth were digested with collagenase (Collagenase NB 6 GMP Grade from Clostridium histolyticum; Serva, Heidelberg, Germany)/dispase (Godo Shusei Co., Tokyo, Japan), and single-cell suspensions were passed through a 70-μm cell strainer (Falcon, Franklin Lakes, N.J., USA). The collected cells were incubated in a culture plate (Falcon T-25 flask, Primaria; BD Biosciences, San Jose, CA, USA) in complete medium: α-MEM (Sigma-Aldrich, St. Louis, MO, USA) containing 10% fetal bovine serum (Gibco; Thermo Fisher Scientific, Waltham, MA, USA), 2 mM l-glutamine (Sigma-Aldrich, St. Louis, MO, USA), and 82.1 μg/ml l-ascorbic acid phosphate magnesium salt n-hydrate (Wako Junyaku, Tokyo, Japan) with the antibiotics gentamicin (40 μg/ml, GENTCIN; Schering-Plough, Osaka, Japan) and amphotericin B (0.25 μg/m, FUNGIZONE; Bristol-Myers Squibb, Tokyo, Japan). After three passages for expansion in T-225 flasks, the cells were preserved in freezing media (STEM-CELLBANKER GMP grade; Nihon Zenyaku Kogyo, Fukushima, Japan) and stored in liquid nitrogen.

Spike-in test of microbes with human PDL-MSCs

The frozen cells were rapidly thawed with gentle shaking in a water bath at 37 °C. Next, the cells were spiked and cultured in complete medium with and without antibiotics (40 μg/ml gentamicin and 0.25 μg/m amphotericin B). Then, 2 × 105 cells were spiked with either Bioball® (BioMérieux, France) or seven species of Mycoplasma (Additional file 3: Table S4), 60 or 1100 colony-forming units (CFU) of each Bioball, or 2000 CFU of each Mycoplasma species. Genomic DNA was isolated 0 or 3 days after the spike-in using a NucleoSpin Blood Kit (Macherery-Nagel Inc., Easton, PA, USA), and total RNA was isolated using a NucleoSpin RNA kit (Macherery-Nagel Inc., Easton).

Sequencing of DNA and RNA libraries

DNA-seq libraries were prepared using 100 ng DNA and the Illumina TruSeq Nano Kit, following the manufacturer’s instructions. RNA-seq libraries were prepared using 200 ng total RNA and the SureSelect Strand-Specific RNA Reagent Kit (Agilent Technologies, Santa Clara, CA, USA), following the manufacturer’s instructions. ATAC-seq libraries were prepared using 50,000 cells, according to a published protocol [50]. Sequencing of 36-bp single ends of the RNA libraries from mycoplasma-free hPDL-MSCs (three biological replicates) and hBM-MSCs (three biological replicates) was performed with an Illumina HiSeq2500 system. Sequencing of the 100-bp paired ends of the libraries of hPDL-MSCs with microbe spike-in was conducted with an Illumina HiSeq3000 system.

Implementation of joint non-negative matrix factorization

Joint non-negative matrix factorization (jNMF) has been successfully applied for the detection of the so-called modules in multiple genomic data [40, 51, 52]. Briefly, given N multiple non-negative data matrices \( {X}_{m\times {n}_I\left(I=1,\dots, N\right)} \), jNMF decomposes the input matrices into a common basis matrix Wm × k and a set of coefficient matrices \( {H}_{k\times {n}_I} \) by minimizing a squared Euclidean error function formulated as

$$ \min \sum \limits_{I=1}^N{\left\Vert {X}_I-W{H}_I\right\Vert}_F^2\ \left(\mathrm{s}.\mathrm{t}.W\ge 0,{H}_I\ge 0\right), $$

where k is the factorization rank and F is the Frobenius norm. To optimize this objective function, a multiplicative update procedure was performed by starting with randomized values for W and HI, which is well described in many publications [40, 51, 53]. In a single trial, the update procedure was repeated R times, and the trial was restarted T times. During the trials, consensus matrices Cm × m and \( {C}_{n_I\times {n}_I\ \left(I=1,\dots, N\right)} \) were built to calculate the co-clustering probabilities of all of the input elements, i.e., the cophenetic correlation coefficient values [39]. For example, if the maximal value of the jth factorization rank coincides with the ith element in Wm × k, all of the elements in m having > 0.8 with the ith element in Cm × m were modulated. In this study, N = 2 (i.e., contamination profile and expression profile) and m = 6 (i.e., five Myco(−) samples and one Myco(+) sample) were used. Thus, m, n1, and n2 represent cells, contaminants, and genes, respectively. The parameters T = 100, R = 5000, and k = 3 were set after testing the clustering stabilities with the combinations of T = (10,50,100), R = (1000,2000,5000), and k = (2, 3, 4, 5) by calculating the cophenetic correlation coefficient values [39]. The input profiles retaining elements with > 3 TPM and > 1 RPMH were converted to the log10 scale by adding one.

Preparation of public datasets

The human reference genome (hg38) was downloaded from the UCSC genome browser [54], and alternative sequences of the reference genome were downloaded from the NCBI BLAST DB [55]. To build the microbial genome database, the complete genomes of bacteria, viruses, and fungi were obtained from the NCBI RefSeq [56], consisting of 11,360 species from 2289 genera. Raw RNA-seq datasets (341) were downloaded from the ENCODE project [57], and additional raw RNA-seq datasets were downloaded from NCBI’s GEO and SRA, including 48 Illumina Human BodyMap 2.0 (GSE30611), 22 ESCs (SRP067036), seven Burkitt’s lymphoma (BL) DG-75 cell lines (GSE49321), 26 lung cancer cell lines (DRA001846), and ten stem cells (PRJNA277616). The RNA-seq data for the EBV-negative BL cell lines (BL-41, BL-70, CA46, GA-10, and ST486) were obtained from the CCLE [58].

Bioinformatics analysis

To analyze the RNA-seq data, the HISAT2-Bowtie2 pipeline and the Cufflinks package [47, 59] were used with hg38 and RefSeq gene annotation. After retrieving genes with > 3 FPKMs in at least one sample, Cuffmerge and Cuffdiff were performed to detect differentially expressed genes (DEGs) satisfying a q value cutoff < 0.05 (Benjamini-Hochberg correction p value) and a > 2.0 fold-change (fc) cutoff. To analyze the RPMH clusters, R language function hclust was used. The Euclidean distances among the RPMHs were adjusted by quantile normalization and mean centering, and the hierarchical average linkage method was used to group genera. To analyze the enrichment of Gene Ontology (GO) terms and pathways, the GOC web tool [60] was used with the “GO biological process complete” and “Reactome pathways” datasets by selecting the option “Bonferroni correction for multiple testing.”

NovoAlign (V.3.08) was downloaded from the Novocraft [61], and Taxonomer was performed on the Taxonomer website [32]. The network data were visualized by using software Cytoscape (V.3.5.1). PathSeq [18], FastQ Screen [28], and DecontaMiner [29] were installed with their reference databases. Because FastQ Screen accepts limited number of genomes, the input reads were mapped to ten specific genomes only. Detailed information on the existing pipelines can be found from Additional file 2. To calculate the sample-level RPMHs in Fig. 1d, the existing pipelines were used to analyze the host-unmapped reads of our pipeline, and the total number of microbe-mapped reads was divided by the total number of host-mapped reads from our pipeline. As the total number of microbe-mapped reads, for Taxonomer, the numbers of ambiguous, bacterial, fungal, phage, phix, and viral bins in the output file were summed up. For DecontaMiner, the total counts of “TOTAL_READS” in the output file were collected. For PathSeq, the read count of the column “read” when the column “type” is “root” in the output file was collected.