Background

Metagenomic sequencing of cell-free DNA (cfDNA) offers a highly sensitive approach to screen for pathogens in clinical samples [1,2,3,4]. The sensitivity of metagenomic sequencing of cfDNA in plasma can be boosted by the implementation of library preparations optimized to recover short, degraded microbial cfDNA [5], or by strategies that selectively enrich microbial DNA or deplete host DNA [6,7,8]. A major remaining challenge is the relatively poor specificity of cfDNA metagenomic sequencing, which is limited by alignment noise, annotation errors in reference genomes, and environmental contamination [9].

Here, we report low biomass background correction (LBBC), a tool to filter background contamination and noise in cfDNA metagenomic sequencing datasets. We have applied LBBC to two independent datasets. We first re-analyzed a dataset from a previous study that investigated the utility of urinary cfDNA as an analyte to monitor urinary tract infection (UTI) [2]. Next, we generated a new dataset of cfDNA in amniotic fluid collected from uncomplicated pregnancies or those complicated with clinical chorioamnionitis at term, a common heterogeneous condition that can occur in the presence or absence of intra-amniotic infection [10]. We report a first, detailed study of the properties of cfDNA in amniotic fluid. For both datasets, detailed microbiologic workups, including results from conventional bacterial culture and/or PCR, were available to benchmark the LBBC workflow. We demonstrate that LBBC greatly improves the specificity of cfDNA metagenomic sequencing, while minimally affecting its sensitivity.

Results

To extract sequence information from cfDNA isolates, we used a single-stranded DNA library preparation that improves the recovery of microbial cfDNA relative to host cfDNA by up to 70-fold for cfDNA in plasma [5]. We quantified microbial cfDNA by alignment of sequences to microbial reference genomes [11, 12] (see the “Methods” section). We identified two classes of noise, which we addressed using a bioinformatics workflow that implements both novel and previously described filtering approaches [13, 14] (Fig. 1a). The first type of noise can be classified as “digital crosstalk” and stems from errors in alignment and contaminant sequences that are present in microbial reference genomes, including human-related sequences or sequences from other microbes. Digital crosstalk affects distinct segments of a microbial genome and gives rise to inhomogeneous coverage of the reference genome. We computed the coefficient of variation in the per-base genome coverage for all identified species (CV, computed as the standard deviation in genome coverage divided by the mean coverage) and removed taxa for which the CV differed greatly from the CV determined for a uniformly sampled genome of the same size (see the “Methods” section), because this indicated that a significant number of sequences assigned to the genome are due to digital crosstalk.

Fig. 1
figure 1

Algorithm design and application to metagenomic sequencing of urinary cfDNA. a Diagram of the major components of the LBBC workflow. b Genus-level bacterial cfDNA (in RGE, see bar) across 44 urinary cfDNA samples from a kidney transplant cohort. Samples (columns) are grouped by clinical diagnosis (EN, Enterococcus; EC, E. coli; Neg., negative) and sex of subject. Rows are individual genera detected. c Abundance matrix after application of LBBC

A second class of noise is due to physical contamination of the sample with environmental DNA present at the time of collection and in reagents used for DNA isolation and sequencing library preparation [13]. We reasoned that the total biomass of environmental DNA would be consistent for samples prepared in the same batch. LBBC filters environmental contaminants by performing batch variation analysis on the absolute abundance of microbial DNA quantified with high accuracy. The core elements of LBBC can be implemented using any metagenomics abundance estimation algorithm which makes use of sequence alignment to full microbial genomes. In our analysis, we estimate the genomic abundance of each species using a maximum likelihood model implemented in GRAMMy [12] (see the “Methods” section). GRAMMy helps ameliorate the impact of closely related genomes [12]. From the relative abundance of species, we compute the absolute number of molecules in a dataset corresponding to a specific species, considering differences in genome sizes for all identified microbes. The total biomass of microbial DNA is then estimated as the proportion of sequencing reads derived from a species, multiplied by the measured biomass inputted in the library preparation reaction. Recent approaches have identified environmental contaminants by (i) looking for batch-by-batch covariation in the relative abundance of microbes measured by metagenomic sequencing or (ii) examining the (inverse) correlation between biomass of the sample and the relative abundance of microbial DNA in the sample [13, 14]. These studies have shown the dramatic effect of environmental contamination in low biomass settings. LBBC effectively combines these two prior approaches into one. Using this analysis applied to the metagenomic cfDNA datasets described below, we estimate that the total biomass of environmental, contaminant DNA can exceed 100 pg (range of 0 to 230.4 pg). This is a small amount of DNA (< 1% of sequencing reads) that nonetheless can significantly impact the interpretation of metagenomic sequencing results. We further incorporated a known-template, negative control in the library preparation procedures to identify any remaining contaminant sequences. The use of a negative control is recommended for metagenomics studies [9] and was implemented in our previous work [2, 15]. Here, we compared the microbial abundance detected in samples to those in controls to set a baseline for environmental contamination. This analysis indicated that, on average, only 46% of physical contaminant species determined by LBBC are removed using comparison to a negative control alone, supporting the need for the additional filters implemented in LBBC.

We evaluated and optimized LBBC using a dataset available from a recently published study that assessed the utility of urinary cfDNA for the monitoring of bacterial infection of the urinary tract [2]. We analyzed 44 cfDNA datasets from male and female kidney recipients. These included 16 datasets from subjects with E. coli UTI, 11 datasets from subjects with Enterococcus UTI, and 17 datasets from subjects without UTI, as determined by conventional urine culture performed on the same day. Prior to application of the LBBC algorithm, the ratio of sequences assigned as non-host vs host (paired host reads relative to sequences assigned to microbial taxa) was 4.4 × 10−1 ± 1.68 in this dataset. We detected 616 bacterial genera across all 44 samples (Fig. 1b; RGE > 10−6), many of which were atypical in the urinary tract, including Herminiimonas and Methylobacterium, albeit at very low abundance.

We defined two parameters for threshold-based filtering; these are (1) the maximum difference in the observed CV and that of a uniformly sequenced taxon for the same sequencing depth and genome size, ΔCVmax, and (2) the minimum allowable within-batch variation, σ2min. A third, fixed parameter was used to remove species identified in the negative controls (threshold 10-fold the observed representation in the negative controls). We optimized these parameters based on following metric:

$$ {\mathrm{BC}}_{\mathrm{score}}={k}_{\mathrm{TP}}\left(\mathrm{TP}\right)+{k}_{\mathrm{TN}}\left(\mathrm{TN}\right)+{k}_{\mathrm{FP}}\left(\mathrm{FP}\right)+{k}_{\mathrm{FN}}\left(\mathrm{FN}\right)+{k}_U(U), $$

where {TP, TN, FP, FN} is the number of true positives, true negatives, false positives, and false negatives, respectively, U is the total number of identified taxa for which an orthogonal measurement was not performed, and the coefficients k for these values represent weights to optimize the filtering parameters. Here, we chose {kTP, kTN, kFP, kFN, kU} = {4, 2, − 1, − 2, − 0.2} and used nonlinear minimization by gradient descent on the variable BCscore to determine an optimal set of threshold parameters: {ΔCVmax, σ2min} = {2.00, 3.16 pg2}.

Applying LBBC with these parameters to urinary cfDNA microbiome profiles led to a diagnostic sensitivity of 100% and specificity of 91.8%, when analyzed against results from conventional urine culture. We computed a confusion matrix (see the “Methods” section) and determined the accuracy of the test to be 0.886 (no information rate, NIR = 0.386, p < 10−10). Without LBBC, the test achieved a sensitivity of 100% but a specificity of 3.3%, and an accuracy of 0.000 (as most samples have both E. coli and Enterococcus). Applying a simple filter that excludes taxa with relative abundance below a pre-defined threshold (RGE > 0.1) led to an accuracy of 0.864 (sensitivity of 81.5%, specificity of 96.7%); however, such filtering does not remove sources of physical or digital noise at high abundance and may remove pathogens present at low abundance. After applying LBBC, we observed far fewer bacterial genera outside of Escherichia and Enterococcus in samples from patients diagnosed with UTI (Fig. 1c). LBBC did not remove bacteria that are known to be commensal in the female genitourinary tract, including species from the genera Gardnerella and Ureaplasma [16]. For male subjects without UTI, we detected a single Lactobacillus species among all subjects, consistent with the view that the male urinary tract is sterile in the absence of infection. For patients with UTI, the urinary microbiomes were less diverse in males compared with females, as previously reported [17]. These examples illustrate that LBBC conserves key relationships between pathogenic and non-pathogenic bacteria.

We next applied LBBC to the analysis of cfDNA in amniotic fluid. Circulating cfDNA in maternal plasma has emerged as a highly valuable analyte for the screening of aneuploidy in pregnancy [18], but no studies have examined the properties of cfDNA in amniotic fluid. No studies have furthermore assessed the utility of amniotic fluid cfDNA as an analyte to monitor clinical chorioamnionitis, the most common diagnosis related to infection made in labor and delivery units worldwide [19]. Traditionally, it was thought that clinical chorioamnionitis was due to microbial invasion of the amniotic cavity (i.e., intra-amniotic infection), which elicits a maternal inflammatory response characterized by maternal fever, uterine tenderness, tachycardia, and leukocytosis as well as fetal tachycardia and a foul-smelling amniotic fluid [20, 21]. However, recent studies in which amniocentesis has been used to characterize the microbiologic state of the amniotic cavity and the inflammatory response [amniotic fluid interleukin (IL)-6 > 2.6 ng/ml [22]] show that only 60% of patients with the diagnosis of clinical chorioamnionitis have proven infection using culture or molecular microbiologic techniques [10]. The remainder of the patients has clinical chorioamnionitis in the presence of intra-amniotic inflammation (i.e., sterile intra-amniotic inflammation) or without neither intra-amniotic inflammation nor microorganisms in the amniotic cavity [10]. Therefore, the emergent picture is that clinical chorioamnionitis at term is a heterogeneous syndrome, which requires further study to optimize maternal and neonatal outcomes [23]. We analyzed 40 amniotic cfDNA isolates collected from the following study groups of women: (1) with clinical chorioamnionitis and detectable microorganisms (n = 10), (2) with clinical chorioamnionitis without detectable microorganisms (n = 15), and 93 without clinical chorioamnionitis (i.e., normal full-term pregnancies) (n = 15). Microorganisms were detected by cultivation and broad-range PCR coupled with electrospray ionization mass spectrometry or PCR/ESI-MS (see the “Methods” section). Data from several independent clinical assays were available, including levels of interleukin 6 (IL-6), white and red blood cell counts, and glucose levels (see the “Methods” section).

We obtained 77.7 ± 31.8 million paired-end reads per sample, yielding a per-base human genome coverage of 1.90 × ± 0.88 ×. The data provide unique insight into the properties of amniotic fluid cfDNA. For women carrying a male fetus, we used the coverage of the Y chromosome relative to autosomes to estimate the fetal fraction of cfDNA in amniotic fluid (see the “Methods” section). The fetal fraction ranged from 6.0 to 100% and was strongly anticorrelated with inflammatory markers such as IL-6 [24, 25] (Spearman’s rho of − 0.763, p = 1.34 × 10−4, n = 20; Fig. 2a). We attribute this observation to the recruitment of immune cells to the amniotic cavity during infection [26, 27]. We next used paired-end read mapping to determine the fragment length profiles of cfDNA in amniotic fluid (Fig. 2b). We found that amniotic fluid cfDNA was highly fragmented (median length 108 bp) and lacked the canonical peak at 167 bp typically observed in the fragmentation profile of plasma cfDNA [18, 28]. To determine size differences between fetal and maternal cfDNA in amniotic fluid, we computed the median fragment length for molecules derived from the X and Y chromosomes in cfDNA from male pregnancy samples. We hypothesized that if all cfDNA in a sample originated from the male fetus, the median fragment lengths for the X- and Y-aligned DNA would be equivalent, and, conversely, in samples with a large fraction of cfDNA originating from the mother, a length discrepancy may arise. Using this approach, we found that fetal-derived cfDNA was shorter than maternal-derived cfDNA (up to 31 bp shorter; Fig. 2c). Previous reports have similarly noted that fetal cfDNA in urine and plasma is shorter than maternal cfDNA [29, 30].

Fig. 2
figure 2

Properties of fetal, maternal, and microbial cfDNA in amniotic fluid. a Comparison of IL-6 levels to the fraction of reads derived from the fetus. b Fragment length profile of chromosome 21 derived cfDNA in amniotic fluid (n = 40). c Comparison of clinically measured IL-6 levels to the difference in the median fragment length for cfDNA originating from the X and Y chromosomes. Colors for a and c correspond to clinical status. d Bacterial species and viral families detected by cfDNA metagenomic sequencing and LBBC. Crosshairs indicate bacteria identified by 16S sequencing. Chor./−, chorioamnionitis, no detectable microorganisms; Chor./+, chorioamnionitis, detectable microorganisms

We next examined the utility of LBBC for the diagnosis of clinical chorioamnionitis. Prior to the application of the LBBC algorithm, the ratio of sequences assigned as non-host vs host (paired host reads relative to sequences assigned to microbial taxa) was 1.08 × 10−2 ± 4.76 × 10−2 in this dataset. After applying LBBC with a relaxed batch variation minimum to account for species-level analysis (σ2min = 1 pg2), no bacteria were detected in the normal pregnancy group (Fig. 2d), in line with recent studies that point to a sterile amniotic cavity and placenta in the absence of infection [31, 32]. The cfDNA sequencing assay detected only 6 of the 14 bacterial genera identified by bacterial culture or PCR/ESI-MS, and was unable to identify a fungal pathogen, Candida albicans, detected by PCR/ESI-MS (see the “Methods” section). We asked if these false negatives were due to LBBC filtering. Relaxation of the filtering thresholds revealed that Ureaplasma was removed in four samples by the batch variation filter; other false negatives were not due to LBBC filtering. Interestingly, in all cases of chorioamnionitis without detectable microorganisms, no bacterium was identified (Fig. 2d), in line with previous evidence showing that chorioamnionitis and intra-amniotic inflammation can occur in the absence of microbial invasion of the amniotic cavity [10]. Last, in two samples, we identified a high burden of viral DNA, including papillomavirus in one sample and bacteriophage in another (Fig. 2d), demonstrating the utility of cfDNA paired with LBBC to detect viruses in the amniotic fluid.

Discussion

cfDNA metagenomic sequencing is emerging as a powerful approach to screen for infection [3]. The technique has inherent high sensitivity, but lower specificity. Here, we described LBBC, a simple computational workflow to filter background contamination and noise in cfDNA metagenomic sequencing datasets. LBBC analyzes batch effects, the uniformity of the genome coverage and the relationship between microbial abundance and total biomass of the sample to identify and filter noise contributions. Though batch effects can arise at any step, we found some steps are more prone to contamination and hence batch effects, in particular the cfDNA extraction batch, while others had very little effect, e.g., the sequencing instrument. Other possible batch effects include the date of processing (influencing reagent batch) and location where samples were prepared (e.g., in a clean room or in a lab environment with multiple experiments being performed); the sources of contamination in metagenomic sequencing and batch effects have been reviewed recently [9, 33].

The three filtering steps implemented in LLBC are appropriate for the analysis of any low-biomass sample, not limited to cfDNA isolates, and can be readily implemented, in a modular fashion, provided that (1) the total DNA biomass going into the sample preparation is measured and recorded, (2) batch information is available, and (3) the microbial abundance is determined by a sequence-based alignment method [12]. This last point is of importance, because of the several popular algorithms for metagenomic sequence classification, including Metaphlan, which relies on reduced reference genomes [34]. Such approaches preclude the ability to estimate sequencing coverage uniformity across the genome, required for the CV filter that is part of LBBC [12]. To our knowledge, LBBC is the first filtering scheme to analyze sequencing coverage heterogeneity across thousands of microbial genomes and filter results based on the coefficient of variation in sequence coverage.

Conclusions

We have described LBBC, a bioinformatics noise filtering tool informed by the uniformity of the coverage of microbial genomes and the batch variation in the absolute abundance of microbial cfDNA. We applied LBBC to a recently published urinary cfDNA dataset. Comparison to clinical testing showed that LBBC greatly improves the specificity of metagenomic cfDNA sequencing while minimally affecting the sensitivity of the assay (Fig. 1). We next applied LBBC to a novel dataset of cfDNA from the amniotic fluid of subjects with and without clinical chorioamnionitis. This dataset allowed us to characterize the properties of maternal and fetal DNA in the amniotic sac for the first time (Fig. 2). While LBBC greatly reduces the noise in metagenomic sequencing, some technical challenges, inherent to metagenomic read assignments, remain. For example, some reads, originating from a source microbe, can incorrectly align to taxa with a highly similar genome; LBBC reduces the frequency of erroneous read assignments, but it does not completely remove these reads.

The application of LBBC to a new dataset of cfDNA in amniotic fluid revealed a bacteria-free environment in healthy full-term pregnancies and in a subset of patients with clinical chorioamnionitis and intra-amniotic inflammation as well as in the presence of pathogenic bacteria in many cases of clinical chorioamnionitis with intra-amniotic infection and inflammation. In addition, few microbial taxa were identified in cases of chorioamnionitis with no detectable bacteria via culture or PCR/ESI-MS. In summary, metagenomic cfDNA sequencing, complemented with a background reduction workflow, enables identification of potential pathogens in clinical samples with both high sensitivity and specificity.

Methods

Sample description—urinary cfDNA

Forty-four sample datasets were selected from a recent study [2]. Urine samples were collected under an Institution Review Board protocol that was approved at Weill Cornell Medicine. All subjects provided written informed consent. Datasets were selected from the study from one of two groups: (1) UTI—those corresponding to a same-day positive urine culture (> 10,000 CFU/mL) indicating monomicrobial E. coli, Enterococcus faecium, or Enterococcus faecalis UTI. A single sample from the original study [2] (GU14) was excluded due to the high likelihood that it was R. ornithinolytica infection incorrectly diagnosed as an E. coli UTI. (2) No UTI—samples from patients with same-day negative standard urine culture and no microorganisms detected at earlier or later dates. Sample metadata is included in Additional file 1.

Sample description—amniotic fluid cfDNA

Forty samples were collected from a cohort of subjects with full-term pregnancy, which were uncomplicated (n = 15), or burdened with clinical chorioamnionitis with detectable microorganisms (n = 10) or clinical chorioamnionitis without detectable microorganisms (n = 15). Amniotic fluid samples were obtained by transabdominal amniocentesis performed for evaluation of the microbial and inflammatory status of the amniotic cavity in patients with clinical chorioamnionitis, whereas women approaching term underwent an amniocentesis for assessment of fetal lung maturity. Twenty of the 40 samples were from mothers pregnant with male fetus. Clinical chorioamnionitis was diagnosed by the presence of maternal fever (temperature > 37.8 °C) accompanied by two or more of the following criteria: (1) uterine tenderness, (2) foul-smelling amniotic fluid, (3) fetal tachycardia (heart rate > 160 beats/min), (4) maternal tachycardia (heart rate > 100 beats/min), and (5) maternal leukocytosis (leukocyte count > 15,000 cells/mm3) [20, 24]. Amniotic fluid samples were transported to the clinical laboratory in a sterile capped syringe and cultured for aerobic and anaerobic bacteria, including genital Mycoplasmas. The clinical tests also included the determination of amniotic fluid white blood cell (WBC) count [35], glucose concentration [36], and Gram stain [37]. Microbial invasion of the amniotic cavity was defined as a positive amniotic fluid culture and/or polymerase chain reaction with electrospray ionization mass spectrometry (PCR/ESI-MS) (Ibis® Technology—Pathogen, Carlsbad, CA, USA) test result [38]. Intra-amniotic inflammation was defined as an amniotic fluid IL-6 concentration > 2.6 ng/mL [22]. Sample metadata is included in Additional file 1.

cfDNA extraction and library preparation

Amniotic fluid samples were thawed from − 80 °C and centrifuged at 1500×g for 5 min. The top 175 μL of supernatant was removed and placed in a 1.5-mL tube with 825 μL of 1 × PBS and pipette mixed. The amniotic fluid was diluted to 1 mL in PBS, and cfDNA was isolated using the “Urine Supernatant 1 mL” protocol of the QiaAmp circulating nucleic acid extraction kit. Total cfDNA was eluted into 30 μL of the elution buffer. The DNA concentration was determined using the Qubit 3.0 Fluorometer (dsDNA HS Qubit). Libraries of extracted amniotic fluid cfDNA were prepared using a single-stranded DNA library preparation method. For this study, sample batches were not continuous between the cfDNA extraction, library preparation, and sequencing steps due to sample processing constraints. LBBC can address batch effects at any stage but will perform best if samples are maintained in the same batch throughout sample processing.

cfDNA sequencing

Paired-end DNA sequencing was performed on Illumina NextSeq 500 (2 × 75 bp) at Cornell University or Illumina HiSeq (2 × 100 bp) at Michigan State University. Paired-end fastq files were trimmed to 75 bp, and samples processed on both NextSeq and HiSeq platforms were concatenated into a single file for each sample.

Fetal fraction determination

Adapter-trimmed reads were aligned to the UCSC hg19 build using bwa mem [39]. Duplicates, low-quality reads, and reads with secondary sequence alignments were removed. Aligned bam files were processed in 500 bp windows using the R package HMMcopy (version 1) [40]. We determined the coverage exclusively in these regions with high mappability scores to extrapolate the coverage of the whole chromosome. The fetal fraction was determined as 2Y/A for subjects who were known to be pregnant with male fetuses, where Y and A are the inferred sequencing coverage of the Y chromosome and autosomes, respectively. To confirm the accuracy of the measurement, we ran the algorithm on samples from subjects with female fetuses, which we would expect to have a zero fetal fraction. We determined very few misalignments to the Y chromosome (median 2.6%, n = 20).

Microbial abundance determination

Fastq files were trimmed (Trimmomatic-0.32 [41]) and aligned to the human genome (UCSC hg19 build) using bowtie2 [42] (in very sensitive mode, version 2.3.5.1). Human-unaligned reads were retrieved and aligned to an annotated NCBI microbial database using BLAST [11] (blastn, NCBI BLAST 2.2.28+). After read alignment, a maximum likelihood estimator, GRAMMy (version 1), was used to adjust the BLAST hits [12]. The adjusted hits to each taxon and respective genome size of each taxon were used to compute the taxon genome coverage. The ratio of each taxon’s genomic coverage to that of human chromosome 21 was used to compute the relative genomic abundance of each taxon in each sample.

Low biomass background correction

The biomass correction method was employed in three steps: (1) BLAST hits were collected for every taxon with ten alignments or more. Genomes were aggregated into 1-kbp bins and the number of alignments within each bin was determined. The coefficient of variation (the standard deviation in alignments per bin divided by the mean number of alignments per bin) was calculated for each taxon in the sample. Given the number of alignments to a specific taxon and the taxon size, we randomly generated reads across the genome to simulate uniform sampling. The CV of this simulated taxon was calculated (CVsim). The difference between the CV and CVsim (ΔCV) was then determined to look at coverage statistic discrepancy. CV and ΔCV were calculated for every taxon in every sample in the cohort. Taxa were removed if they exceeded a maximum allowable ΔCV value.

(2) The mass of each taxon present in a sample was calculated by calculating the adjusted number of BLAST hits from GRAMMy, dividing by the total number of sequencing reads, and multiplying by the mass of DNA added into library preparation (measured using a Qubit 3.0 Fluorometer). Taxon biomasses were compared across samples extracted or prepared within batches using the “cov” command standard in R. The diagonal of the output matrix reveals the variation within the batch for a given taxon. Taxa with variation below the minimum filtering parameter (σ2) were removed from every sample in the batch.

(3) For all of our wet lab procedures, a negative control (dsDNA synthetic oligos of length 25 bp, 40 bp, 55 bp, and 70 bp; each resuspended 0.20 μM eluted in TE buffer) was processed alongside samples in batches. Microbial controls were sequenced alongside samples and were designed to take up 1–3% of the sequencing lane (roughly four to 12 million reads). Control samples were processed through the bioinformatics pipeline, and the taxa read proportion was calculated (raw BLAST hits to a taxon divided by total raw sequencing reads). The taxa read proportion was calculated in samples and compared with that in the controls. Taxa for which the read proportion did not exceed 10-fold higher than the contaminant read proportion were removed. Following processing, the relative genomic abundance (measured in relative genomic equivalents, RGE) was summed for taxa to the species, genus, or family level, depending on desired output.

Correction optimization

To facilitate the optimization of filtering parameters ΔCVmax and σ2min, we created a store based on a linear combination of values related to the true positive, true negative, false positive, and false negative rates. We optimized these parameters based on the following metric:

$$ {\mathrm{BC}}_{\mathrm{score}}={k}_{\mathrm{TP}}\left(\mathrm{TP}\right)+{k}_{\mathrm{TN}}\left(\mathrm{TN}\right)+{k}_{\mathrm{FP}}\left(\mathrm{FP}\right)+{k}_{\mathrm{FN}}\left(\mathrm{FN}\right)+{k}_U(U), $$

where {TP, TN, FP, FN} is the number of true positives, true negatives, false positives, and false negatives, respectively; U is the total number of identified taxa for which a secondary method of identification was not performed; and the coefficients k for these values represent weights to optimize the filtering parameters based on the specifics of the application. Here, we chose {kTP, kTN, kFP, kFN, kU} = {4, 2, − 1, − 2, − 0.25} and used nonlinear minimization by gradient descent to minimize (1 – BCscore) to determine an optimal set of threshold parameters.

Other statistical analyses

All statistical analyses were performed in R. Correlation measurements were performed using Spearman correlations (function cor.test). To compute the confusion matrix in analysis of the urinary cfDNA datasets, we constructed four possible observable states for each sample: Escherichia positive, Enterococcus positive, both Escherichia and Enterococcus positive, and double negative. Observation of the state was determined with the reduced microbial matrix after filtering. Observed state was compared with standard urine culture as the reference. A 4 × 4 confusion matrix was constructed, and statistics, including the accuracy and no information rate, were determined using the “confusionMatrix” command from the R caret package.

Versions of software and references

Reads were aligned to human genome build hg19. Nonhuman reads were aligned to a NCBI reference database (downloaded 2015). The following packages (with versions) were used to build the LBBC package and analyze the data in R (version 3.6.1): caret (6.0-84), data. table (1.12.6), devtools (2.2.1), ggplot2 (3.2.1), ggpubr (0.2.3), ineq (0.2-13), MASS (7.3-51.4), reshape2 (1.4.3), roxygen2 (6.1.1), and taxize (0.9.9).