Laboratory contamination in airway microbiome studies
The low bacterial load in samples acquired from the lungs, have made studies on the airway microbiome vulnerable to contamination from bacterial DNA introduced during sampling and laboratory processing. We have examined the impact of laboratory contamination on samples collected from the lower airways by protected (through a sterile catheter) bronchoscopy and explored various in silico approaches to dealing with the contamination post-sequencing. Our analyses included quantitative PCR and targeted amplicon sequencing of the bacterial 16S rRNA gene.
The mean bacterial load varied by sample type for the 23 study subjects (oral wash>1st fraction of protected bronchoalveolar lavage>protected specimen brush>2nd fraction of protected bronchoalveolar lavage; p < 0.001). By comparison to a dilution series of know bacterial composition and load, an estimated 10–50% of the bacterial community profiles for lower airway samples could be traced back to contaminating bacterial DNA introduced from the laboratory. We determined the main source of laboratory contaminants to be the DNA extraction kit (FastDNA Spin Kit). The removal of contaminants identified using tools within the Decontam R package appeared to provide a balance between keeping and removing taxa found in both negative controls and study samples.
The influence of laboratory contamination will vary across airway microbiome studies. By reporting estimates of contaminant levels and taking use of contaminant identification tools (e.g. the Decontam R package) based on statistical models that limit the subjectivity of the researcher, the accuracy of inter-study comparisons can be improved.
KeywordsMicrobiome Contamination Low biomass Respiratory 16S rRNA gene
Chronic obstructive pulmonary disease
Negative control sample
Operational taxonomic unit
Protected bronchoalveolar lavage
Protected specimen brush
Quantitative Insights into Microbial Ecology
The most common method used for studying the bacterial communities of the lower respiratory tract is high throughput amplicon sequencing of the bacterial 16S ribosomal RNA (16S rRNA) marker gene . Some studies use sputum samples [2, 3], with inevitable questions regarding the degree to which the samples are representative of the lower respiratory tract as opposed to contamination from the upper respiratory tract. The emerging gold standard for lower respiratory tract samples is protected bronchoscopy (sampling via a sterile catheter) . However, even with protected bronchoscopy the samples are processed through extensive laboratory workflows that include at minimum steps of bacterial DNA extraction, PCR amplification of the marker gene, and preparation for sequencing. Each step opens up the possibility for the introduction of contaminating bacterial DNA from the laboratory environment, with greatest impact on samples with the lowest bacterial load .
Accurate analysis of the lower respiratory tract microbiome will require separate consideration of both of the aforementioned contamination sources - that from the upper respiratory tract introduced during sampling and that introduced during laboratory processing steps. We have previously shown that protected bronchoscopy offers some protection from upper airway contamination . In the current study, we address the issue of contamination from the laboratory.
The impact of laboratory contamination is typically evaluated through the inclusion of negative control samples (NCS) that are processed through all steps of DNA extraction and library preparation for sequencing alongside the study samples. The approach is not perfect as one may expect to find taxa in the NCS that also belong to the bacterial communities of the sampled site. Researchers are thus faced with a difficult decision with regards to what to do with the information acquired from the NCS. Some groups have removed all taxa identified in NCS from their study samples [4, 6, 7]. Others single out taxa they believe likely represent contaminants . Currently bioinformatic tools are being developed that aim to wriggle out the authentic microbiota signal using statistical models [9, 10, 11], but these have yet to be tested on lower respiratory tract sequencing data (e.g. Decontam ).
In the current paper we illustrate an effective workflow for evaluating the quality of lower respiratory tract samples for accurate assessment of bacterial composition. Objectives of the study were i) to determine the influence of contamination on lower respiratory tract samples as a function of bacterial load, ii) to determine the main source of contamination in our laboratory setting and iii) to explore common in silico approaches to dealing with contamination.
63.0 ± 6.7
68.2 ± 5.2
63.6 ± 3.1
Smoker pack years
11.8 ± 6.1
25.2 ± 8.1
12.1 ± 6.2
FEV1 (% predicted)
97.0 ± 13.7
72.6 ± 23.2
101.6 ± 9.3
Bacterial load varies with sample type
Bacterial load and impact of laboratory contamination
Monitoring procedural contamination
Having learned that contaminating bacterial DNA likely represents a substantial proportion (10–50%) of the sequencing output for the lower airway samples in our study, we attempted to identify the main contamination source. We performed ten simulated bronchoscopy procedures (no patient) over two days to capture the environmental contaminants that may have been introduced during sampling.
All procedural control samples were sequenced together on the same sequencing run (Run A). Additional control samples were sequenced on a second run (Run B) and included samples of molecular grade water that were processed through the DNA extraction protocol without the introduction of PBS. Although sequenced on a separate sequencing run (Run B), the molecular grade water samples would indicate whether the PBS was the main source of contamination. A sample of molecular grade water that was not processed through the DNA extraction protocol (PCR water) was also included on both sequencing runs (Run A and B). This later sample would reflect contamination introduced during PCR and sequencing steps without interference from contamination introduced during sampling and DNA extraction steps.
To differentiate between PBS and DNA extraction as contamination sources, we compared the molecular grade water samples (Run B) to the corresponding PCR water sample sequenced on the same run. The molecular grade water (n = 3) (Run B) contained a mean number of 124,941 sequences and 107 OTUs, whereas the PCR water (Run B) contained 126,103 sequences and only 39 OTUs. Importantly, the taxonomic profile of the molecular grade water (Run B) resembled that of the procedural control samples (Run A), whereas the PCR water did not, indicating that the main source of contamination was the DNA extraction kit (Fig. 3).
Exploring in silico approaches to dealing with contamination in LRT samples
Decontam performance test on the Salmonella dilution series (SDS)
In the Decontam introduction paper , the authors illustrate how Decontam is able to diminish the contaminant signal from the serially diluted Salmonella datasets published in the Salter paper . As our study also included a Salmonella dilution series (SDS), we were able to test the Decontam package tools on sequencing data generated in the context of our laboratory setting after processing through our chosen bioinformatic pipeline.
Of the three approaches tested in Decontam, the “either” method was able to most effectively remove the contaminant signal from the bacterial community profiles of the samples; even in the most diluted sample over 50% of the sequences mapped to the Salmonella genus. Of concern is however that the PBS sample also consisted of over 50% Salmonella. Also present in the PBS sample was oral/lung specific genera including Veillonella, Streptococcus and Neisseria that are obvious contaminants from the procedural samples sequenced on the same run. The number of reads in the PBS sample after processing in Decontam was only 32. Therefore we learn that although effective, removal of contaminant OTUs identified in Decontam may also lead to the magnification of another type of noise in the sequencing data – particularly that from cross sample contamination during library preparation or index misassignment during MiSeq sequencing.
In the current paper we illustrate an effective workflow for evaluating the quality of lower airway samples for amplicon-based analysis of bacterial composition. Our results show that the low bacterial load in samples from the lungs make them vulnerable to bacterial DNA contamination, which in our study mainly originated from DNA extraction kits. Even with contaminants representing an estimated 10–50% of the sequencing output for these samples, we demonstrate that most of the contaminating signal can be removed post sequencing using recently developed bioinformatic approaches.
Through the processing and sequencing of a serially diluted culture of Salmonella , we were able to define the threshold bacterial load for which contamination would begin to dominate the bacterial profile in our samples. At an input of between 10^3 and 10^4 Salmonella/mL, we observed that contaminants constituted more than 50% of the bacterial profile of the sample. The use of alternative protocols for sample processing and sequencing can slide this defined threshold of bacterial load up or down and should therefore be determined independently in separate studies. Biesbroek et al.  for example show in their study how the choice of DNA extraction kit will affect the DNA yield and in turn the placement of samples above or below a defined threshold of bacterial load for which contamination becomes a problem. Despite differences in laboratory protocols, our results are in agreement with Salter and colleagues  who in their study also recommend an input of more than 10^3–10^4 bacterial cells. The concordance of our results may partially be explained by the use of a DNA extraction kit from the same manufacturer (FastDNA Spin Kit, MP Biomedicals).
Using the Salmonella dilution series as a reference we were able to determine the degree of laboratory contamination in the various sample types (OW, PBAL1, PBAL2, PSB) collected from participants in the MicroCOPD study. The average bacterial load in the samples acquired from the lungs was highest for PBAL1 samples (10^6 bacteria/mL) and approximately an order of magnitude lower for PSB and PBAL2 samples. This could mean that the first lavage fraction harvests a larger portion of the resident microbiota, but also a dilution effect, as lavage yield tends to increase in the second fraction. We used a sterile inner catheter for lavage sampling, to minimize contamination from BAL, something no other study has done to our knowledge. It is however possible that the first fraction of lavage (PBAL1) is more susceptible to contamination from the upper airways during sampling compared to PBAL2 and PSB samples . Thus, the question remains as to whether PBAL1 with its higher bacterial load is a more representative sample compared to PBAL2 and PSB samples or if we are simply swapping contamination sources (contaminating bacterial DNA introduced from the upper airways during sampling versus contaminating bacterial DNA introduced during laboratory processing steps). The optimal sample type may thus be a question of which contamination source is easiest to identify and remove post sequencing.
Through the sequencing of procedural control samples and PCR negative control samples that were not processed through the DNA extraction protocol, we were able to trace the main source of contamination back to the DNA extraction kit. Our findings are in agreement with several other studies [5, 15, 16]. The difference in the microbiota readout for the procedural control samples and the negative control samples are likely explained by differences in lot number for the DNA extraction kits. Salter and colleagues report differences in contaminant profiles for three replicates of SDS extracted using different lots of the FastDNA Spin Kit for soil; similar to our results they also found that one SDS replicate was dominated by unclassified Enterobacteriaceae.
Publications such as that by Salter and colleagues have led to an increased awareness of the effects of contamination on microbiome studies of low biomass samples [5, 16]. Most studies now process negative control samples that allow for monitoring of the contaminant signal introduced from the laboratory. However, the inclusion of NCS only partly addresses the issue. In our study for example, we recognized that a major Streptococcus OTU found in procedural samples (OW, PBAL, PSB) was also among the top 20 most abundant OTUs found in NCS. A comparison of the relative abundance of the Streptococcus OTU in procedural samples and NCS indicated that the OTU was likely not a contaminant. However, the question of where to draw the line with regards to a set abundance threshold for which an OTU should be identified as a contaminant or not is not always as straightforward. The Decontam package in R has been developed to identify contaminants using statistical models . The Decontam developers demonstrate the accuracy of their approach on the Salmonella dilution series datasets generated in the Salter publication. We show in the context of our laboratory setting that Decontam is efficient at removing the contaminant signal from the SDS also in our study. Using Decontam we were also able to confirm the identity of the Streptococcus OTU found in both procedural samples and the NCS as a non-contaminant.
We acknowledge that our study does not address all issues related to bacterial load in microbiome sequencing data. The serial diluted Salmonella monoculture does not provide insight into the effects of bacterial load on the relative abundance of bacteria in a more complex microbiota sample. Biesbroek et al.  show in their study examining the microbiota of a serially diluted saliva sample, an increase in the relative abundance of Proteobacteria and Firmicutes and a decrease in Bacteroidetes across the dilution series. Proteobacteria likely reflect contaminants as has been suggested in several papers [14, 17], again illustrating the inverse relationship between bacterial load and the influence of contamination as observed in our study. The observed increase in relative abundance of Firmicutes and concurrent decrease in Bacteroidetes is however of concern, as these phyla hold members often detected in studies of the lung microbiome (e.g. Veillonella and Prevotella). The field would benefit from studies addressing the potential effects of bacterial load on the measured relative abundance of taxa in a more complex sample, particularly those that are suspect core lung microbiota members. Secondly, we did not quantify the amount of human DNA in the procedural samples. The presence of human DNA may affect the efficiency of the qPCR reaction , and thereby also the accuracy of the direct comparison to the SDS. Studies evaluating the impact of contamination might consider quantification of human DNA for an even more accurate estimate of contamination.
Measured amounts of bacteria will vary in lower airway samples collected with different bronchoscopic sampling techniques (e.g. PBAL1, PBAL2, PSB in the current study). These differences combined with the inverse relationship between bacterial load and bacterial DNA contamination will render some sampling modalities dominated by contaminating taxa.
Differences in protocols for sampling, laboratory processing and bioinformatics analysis across studies will require investigators to evaluate the impact of contamination in the context of their own laboratory setting. We encourage investigators to report an estimate of the degree of contamination in their datasets defined against a sample of known bacterial load as exemplified in the current study. We further suggest the use of contaminant identification tools (e.g. Decontam) based on statistical models for the objective removal of laboratory contaminants in lung microbiome sequencing data. Such measures will enable more accurate inter-study comparisons and may also resolve discrepancies between studies that have likely impeded understanding the potential relationship between microbiota and its role in chronic lung diseases.
Study subjects (n = 23) were chosen from the Bergen COPD Microbiome Study (short name “MicroCOPD”) , to give an equal representation of healthy (n = 9) and diseased (asthma (n = 4), COPD (n = 10)) states. Details on data collection and the bronchoscopy procedures have been previously published [4, 12]. Briefly, adult subjects recruited from Western Norway with and without obstructive lung disease, underwent voluntary bronchoscopies between 2013 and 2015. All subjects were examined in the stable state, not having received antibiotics at least 2 weeks prior to the procedure. All bronchoscopies were performed by experienced chest physicians at the outpatient clinic at the Department of Thoracic Medicine, Haukeland University Hospital. The regional ethical committee (REK-Nord, case # 2011/1307) approved the study, and all patients gave written informed consent.
Sample types acquired per patient included the first and second fraction of 2 × 50 mL bronchoalveolar lavage (PBAL1 and PBAL2) sampled through a sterile inner catheter (Plastimed Combicath, Le Plessis Bouchard, France) of the bronchoscope while the scope itself was wedged in the right middle lobe, and three protected specimen brushes subsequently sampled from the right lower lobe (rPSB), an oral wash (OW), and a negative control sample (NCS). Additional procedural control samples were collected after ten simulated bronchoscopy procedures (no patient) carried out over two days; samples included a bronchoscope rinse (BR), a catheter rinse (CR), a protected specimen brush (PSB), a sample of phosphate buffered saline (PBS) transferred to a cryotube (CT) and a sample of PBS used for collection of all samples. The PBS used for sample collection was sterilized by sterile filtration (0.22 μm) and autoclaving at 121 °C for 15 min. To study the relationship between bacterial load and the influence of contaminating bacterial DNA in our laboratory setting , we included a ten-fold dilution series of Salmonella enterica serovar Typhimurium (ATCC 14028) (ATCC, Manassas, VA, USA) (SDS).
Bacterial DNA extraction using enzymatic and mechanical lysis steps
Samples were treated with lytic enzymes mutanolysin, lysozyme and lysostaphin (all from Sigma-Aldrich, St. Louis, MO, USA) and subsequently processed through the FastDNA Spin Kit (MP Biomedicals, LLC, Solon, OH, USA) following the manufacturer’s instructions. Procedural samples were processed using different lots of the DNA extraction kit (#79113, #84562, #57212, #62903). The procedural controls and the SDS were processed using a kit of same lot number (#93678). The sample volume used as input varied with sample type (for procedural samples: 450 μl for PSB and NCS and 1800 μl for OW, PBAL1, PBAL2; for procedural control samples: 450 μl for PBS and CT, 550 μl for PSB and 1800 μl for BR and CR; for samples in the SDS: 500 μl). DNA was eluted in a total volume of 100 μl.
Quantification of bacterial load by quantitative PCR (qPCR)
The bacterial load in the samples was determined by probe-based qPCR targeting the bacterial 16S rRNA gene (region V1 V2) using forward primer 5′-AGAGTTTGATCCTGGCTCAG-3′, reverse primer 5′-CTGCTGCCTYCCGTA-3′ and probe 5′-6-FAM-TAACACATGCAAGTCGA-BHQ-1-3′ (locked nucleic acid bases are underlined; 6-FAM: 6-carboxyfluorescein; BHQ-1: Black Hole Quencher-1) [7, 18, 19, 20]. PCR reactions were carried out using the following cycling conditions: an initial cycle at 95 °C for 5 min followed by 45 cycles of 95 °C for 5 s, 60 °C for 20 s and 72 °C for 10 s and a final extension cycle of 72 °C for 2 min. A standard curve was constructed from genomic DNA from E. coli strain JM109 (Zymo Research, Irvine, CA, USA).
MiSeq sequencing of the bacterial 16S rRNA gene
The bacterial composition in the samples was determined by paired-end sequencing of the 16S rRNA gene (region V3 V4) following instructions provided in the Illumina 16S Metagenomic Sequencing Library Preparation guide (Part no. 15044223 Rev. B). PCR cycling conditions were modified from the commercial protocol and consisted of an initial cycle at 95 °C for 3 min followed by 45 cycles of 95 °C for 30 s, 55 °C for 30 s, 72 °C for 30 s and a final extension cycle at 72 °C for 5 min.
Bioinformatic sequence processing steps
Bioinformatic sequence processing steps were performed using tools provided within the Quantitative Insights into Microbial Ecology (QIIME) bioinformatic package, version 1.9.1. In short, raw sequences were retrieved from the MiSeq sequencer in the form of demultiplexed forward and reverse fastq files (paired end reads). Primer sequences were trimmed off and forward and reverse reads joined. Chimera sequences identified using the VSEARCH program  were subsequently removed. Remaining sequences were grouped into open-reference operational taxonomic units (OTUs) using UCLUST  and the GreenGenes reference database (v.13.8) . Small OTUs, defined as those containing less than 0.005% of the total sequence count in the dataset were then filtered out . Taxonomy was assigned to OTUs using the naïve bayesian RDP Classifier  together with the GreenGenes reference database (v.13.8) . The resulting OTU table displaying the sequence count in each OTU for each sample was the starting point for all subsequent analyses. The QIIME commands used for generating the working OTU table are provided in the Additional file 3: Supplementary Methods.
In silico contaminant identification and removal
Two approaches to contaminant identification and subsequent removal were tested. In the first approach contaminant OTUs were identified through their presence in NCS. NCS OTUs were filtered out from the procedural samples (OW, PSB, PBAL) collected under the same procedure using QIIME commands (illustrated in the supplementary methods). In the second approach, contaminant OTUs were identified based on statistical models using the Decontam package  in R. Contaminant OTUs identified using the Decontam isContaminant function (method = either, user defined threshold = 0.5) were filtered out of the main OTU working table using QIIME commands.
For greater details on study design, sample collection, preparation of Salmonella samples, DNA extraction, qPCR, 16S rRNA gene sequencing and bioinformatics, please see the Additional file 3: Supplementary Methods.
The authors wish to thank Marit Aardal, Kristina Apalseth, Hildegunn Bakke Fleten, Ane Aamli Gagnat, Ingvild Haaland, Tuyen Thi Van Hoang, Gunnar Husebø, Kristel Knudsen, Sverre Lehmann, Lise Østgård Monsen, Randi Sandvik, Øistein Svanes for their contributions in the data collection and/or analyses.
TME, RN, HGW, EN and TK participated in the planning and collection of procedural samples in the MicroCOPD study. HGW planned the sequencing analyses. CD, TK, EN, TME and RN participated in planning and collection of procedural control samples. CD and TK performed DNA extraction and library preparation for sequencing. CD performed qPCR, bioinformatics analyses and drafted the manuscript. RN and TME participated in bioinformatics analyses and drafting of the manuscript. All authors participated in the revision of the manuscript and approved the final version for publication.
The MicroCOPD study was funded by unrestricted grants and fellowships from Helse Vest, Bergen Medical Research Foundation, the Endowment of timber merchant A. Delphin and wife through the Norwegian Medical Association and GlaxoSmithKline through the Norwegian Respiratory Society. The funding bodies had no role in the design of the study, data collection and analysis, interpretation of data, or in writing the manuscript.
Ethics approval and consent to participate
The study was approved by the Regional Committees for Medical and Health Research Ethics (REK-Nord, case # 2011/1307) and was conducted in accordance with the Declaration of Helsinki. All study subjects signed informed consent forms.
Consent for publication
CD, HGW, TK, EN: The authors declare that they have no competing interests.
TME: Reports bursary from Boehringer Ingelheim for educational meetings within the last three years, unrelated to the current study.
RN: Reports grants from GlaxoSmithKline, during the conduct of the study; grants from Boehringer Ingelheim, grants and personal fees from AstraZeneca, grants from Novartis, grants from Boehringer Ingelheim, personal fees from GlaxoSmithKline, outside the submitted work.
- 7.Dickson RP, Erb-Downward JR, Freeman CM, Walker N, Scales BS, Beck JM, et al. Changes in the lung microbiome following lung transplantation include the emergence of two distinct Pseudomonas species with distinct clinical associations. PLoS One. 2014;9. https://doi.org/10.1371/journal.pone.0097214.CrossRefGoogle Scholar
- 13.Davis NM, Proctor D, Holmes SP, Relman DA, Callahan BJ. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. bioRxiv. 2018. https://doi.org/10.1101/221499.
- 16.Glassing A, Dowd SE, Galandiuk S, Davis B, Chiodini RJ. Inherent bacterial DNA contamination of extraction and sequencing reagents may affect interpretation of microbiota in low bacterial biomass samples. Gut Pathog. 2016;8. https://doi.org/10.1186/s13099-016-0103-7.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.