Background

Sequencing the genomes of microbial communities within a host organism's environment has opened new avenues for research into host-microbe interactions. After metagenomic sequencing, several analysis steps are necessary to achieve a comprehensive understanding of the microbiome's composition. Due to the often massive amount of data involved, efficient processing is essential to minimise unnecessary computations. Sequenced data often contains sequences from the host and privacy concerns arise when the host is human. Additionally, non-microbial sequences could introduce bias in downstream analyses. Therefore, the removal of host sequences should be prioritised at an early stage [1].

Decontamination is often managed in an ad-hoc manner by utilising alignment tools to search reads against a host genome database. Ad-hoc approaches may be unnecessarily complicated and could lead to suboptimal performance. The lack of standardised best practices for this procedure hinders comparison across studies.

Some generic metagenome analysis pipelines, such as ATLAS [2] and Sunbeam [3] have integrated modules for host decontamination. ATLAS employs BBsplit [4, 5], while Sunbeam employs BWA [6] as its decontamination method. Using Bowtie2 [7] with the ‘un-conc’ option is also occasionally suggested for host decontamination. With the ‘un-conc’ option, Bowtie2 requires both reads in a pair to map concordantly to the genome.

We could only identify two dedicated tools specifically designed for removing contaminating sequences: DeconSeq and GenCoF. DeconSeq [8] is a command-line tool for identifying and removing sequence contamination from genomic and metagenomic datasets. DeconSeq integrates a modified version of BWA-SW [6], its underlying classifier, directly into its source code, making modifications challenging. DeconSeq supports only single-end Illumina reads, and its code has not been updated since 2013. GenCoF [9] is a graphical user interface for rapidly removing human genome contaminants from metagenomic datasets, limited to short reads. Its GUI nature makes it unsuitable for scripting, and extending GenCoF is difficult as it has Bowtie2 [7] integrated directly into its source code. Both tools have clear limitations, rendering them less suitable for most large-scale datasets. Thus, we developed the new tool HoCoRT to address this gap. To provide recommendations for different circumstances and default settings, we evaluated the performance of various underlying classification methods.

Implementation

HoCoRT is an open-source command-line-based tool written in Python 3. It is designed to be user-friendly and can be effortlessly installed as a package using Bioconda [10] or as a Docker container. HoCoRT features a modular pipeline design and utilises well-established classification, mapping, and alignment tools to classify sequences into host and non-host (microbial) sequences. The current pipeline modules encompass the BBMap tool in the BBTools suite [4, 5], BioBloom [11], Bowtie2 [7], BWA-MEM2 [12], HISAT2 [13], Kraken2 [14], and Minimap2 [15]. Moreover, modules can pipe data through different tools sequentially. While users can configure pipeline options, recommended defaults are provided. HoCoRT can be extended by creating new modules that utilise other tools. Additionally, HoCoRT offers a comprehensive Python library with an API that can serve as a backend for other tools. HoCoRT supports both reading and writing optionally compressed FASTQ files. The tool also manages the construction of database index files. Built-in help functions and error messages ensure the tool's documentation is readily accessible. HoCoRT relies on Samtools [16]. For a comprehensive list of all software mentioned in this work, including version numbers and references, please refer to Additional file 1: Table S1.

Evaluation

The classification speed and accuracy of HoCoRT using several different underlying methods and settings were investigated with synthetic and real-world datasets. The GitHub repository at https://github.com/ignasrum/hocort-eval provides the scripts used to generate the synthetic datasets and conduct performance evaluations.

HoCoRT was evaluated on synthetic HiSeq, MiSeq and Nanopore data mimicking human gut and oral microbiomes. The synthetic human gut microbiome datasets comprised a mix of 1% human host sequences and 99% microbial sequences, while the synthetic human oral microbiome datasets included a mix of 50% human host and 50% microbial sequences. Human reads were derived from the Genome Reference Consortium Human Build 38 patch release 13 (GRCh38.p13), while microbial reads were extracted pseudo-randomly from a set of 100 bacterial, fungal, and viral sequences from NCBI GenBank [17]. Accession numbers for the microbial genome sequences are provided in the GitHub repository. To assess the variance in performance, seven different datasets (using distinct random seeds) were generated for each of the six combinations of microbiome (gut and oral) and sequencing technology (HiSeq, MiSeq, and Nanopore), resulting in a total of 42 datasets. Each synthetic short read dataset contains 5 million read pairs randomly generated using InSilicoSeq [18] with the prebuilt HiSeq (2 × 125 bp) and MiSeq (2 × 300 bp) error profiles, while each long read dataset contains 2.5 million single-ended reads generated using NanoSim [19] (average 2159 bp, range 54–98,320 bp). The ‘read profile’ used by NanoSim was generated from the NCBI Sequence Read Archive (SRA) dataset with accession ERR3279199. This dataset consists of unpaired human Nanopore MinION sequencing reads, more specifically, the NA12878 sample and another individual with ataxia-pancytopenia syndrome. The Nanopore basecaller chosen was Guppy.

Seventeen pipelines were examined using Illumina data: Seal, BBDuk, BBSplit, BioBloom, Bowtie2 in end-to-end and local mode, both with and without the ‘un-conc’ option, HISAT2, Kraken2, BBMap in default and fast mode, BWA-MEM2, Kraken2 followed by Bowtie2 in end-to-end mode, Kraken2 followed by HISAT2, Minimap2, and finally Kraken2 followed by Minimap2. Four pipelines were examined using Nanopore data: BioBloom, Minimap2, Kraken2 followed by Minimap2, and Kraken2. CONSULT [20] was considered, but the lack of pre-compiled binaries or packages and its considerable memory requirements make it impractical. CLARK [21] was also considered, but not included due to its primary taxonomic classification focus. The newer Kraken2 tool has been shown to be many times faster and much less memory-demanding than CLARK without any loss of accuracy [14].

The ability to detect human host sequences was tested, and the sensitivity, precision, and accuracy were calculated. True positives (TP) were sequences correctly identified as human, while false positives (FP) were sequences incorrectly identified as human. True negatives (TN) represented sequences correctly identified as microbial, and false negatives (FN) were sequences incorrectly identified as microbial. Sensitivity was calculated as TP/(TP + FN), precision was calculated as TP/(TP + FP), and accuracy was calculated as (TP + TN)/(TP + FP + TN + FN). Given the synthetic human gut microbiome datasets’ 1% human sequences, accuracy here primarily reflects precision, while the accuracy calculated for the oral microbiome datasets is more balanced. When recommending tools, accuracy was prioritised, followed by speed. Performance analysis utilised a Snakemake pipeline [22] on a desktop PC with an AMD Ryzen 7 1700X 8 core/16 thread 3.4 GHz CPU, 64 GB RAM and 4 TB HDD running Linux. No quality filtering or other pre-processing was conducted.

HoCoRT’s performance was compared to DeconSeq using two synthetic human gut datasets with 5 million single-ended short reads each; one with HiSeq and one with MiSeq reads. They were generated as described above, but with single-ended reads, due to the inability of DeconSeq to handle paired-end reads.

The performance of HoCoRT was also evaluated using two real human gut microbiome datasets from the SRA. The first dataset (SRR18498477) consists of gut microbiomes from people living with HIV sequenced using Illumina HiSeq technology. The second dataset (SRR9847864) comprises three healthy human gut microbiome samples sequenced using Oxford Nanopore Technology. We employed BLAST [23] in MegaBLAST mode with an E-value threshold of 1∙10–10 and HoCoRT with both Bowtie2 and Minimap2 pipelines to assess the amount of remaining human host contamination.

Results and discussion

The classification speed and accuracy of HoCoRT on the synthetic gut microbiome are shown in Fig. 1 and Table 1. Overall, BioBloom, Bowtie2 in end-to-end mode, and HISAT2 performed best for short reads and are recommended due to high accuracy and speed across scenarios. The best tools detect almost all human short reads, but also incorrectly include a small number of bacterial reads. Kraken2 consistently displayed the highest speed with a minor reduction in accuracy. For long reads, the sensitivity decreased substantially, with only 59% of human reads detected in the best-case scenario, achieved by a combination of Kraken2 and Minimap2. Synthetic oral microbiome results are presented in Additional file 1: Fig. S1 and Table S2. These results are similar to the gut microbiome results, but Bowtie2 was clearly slower than the other options, while also the most accurate, in particular for the MiSeq reads. This may be due to the higher number of aligned human sequences.

Fig. 1
figure 1

Overview of HoCoRT performance on simulated gut microbiome datasets. Box plots of HoCoRT runtime in seconds (top) and classification accuracy (bottom) using several different classification modules and parameters on Illumina HiSeq (yellow, left), MiSeq (cyan, middle) and Nanopore data (red, right). Table 1 contains additional results, including those for BioBloom (on Nanopore data), BBMap, BBSplit, Bowtie2 with the ‘un-conc’ option, and BWA-MEM2, which were excluded from this figure due to outliers

Table 1 Detailed HoCoRT performance on simulated human gut microbiome datasets

HoCoRT’s performance was compared to DeconSeq using human gut datasets with short reads. The HoCoRT Bowtie2 (end-to-end) pipeline exhibited substantially higher alignment speed than DeconSeq for both HiSeq (34X) and MiSeq (49X) reads, and slightly better accuracy, as shown in Additional file 1: Table S3.

Lastly, HoCoRT’s performance was evaluated using two real human gut microbiome datasets. Up to 0.03% of the reads in these datasets were identified as human, as shown in Additional file 1: Table S4. Minimap2 identified the highest number of reads, followed by BLAST and then Bowtie2. BLAST required about 40 times more time than the other tools. Since the true number of human reads is unknown, we cannot calculate sensitivity and precision. Based on the results from the synthetic datasets, most of the true human reads are probably detected, but how many of the predicted human reads that really are microbial is difficult to estimate.

For short reads, based on the overall very high sensitivity of most tools, it appears that almost all human host sequences can be detected reliably in microbiomes, while a small number of microbial sequences may be incorrectly classified as human. If necessary, some precision may be traded-off for decreased run-time, depending on the specific use case and how important it is to keep as many microbial sequences as possible.

For long reads, the situation is more challenging and only about 59% of the actual human host reads were detected in the best-case scenario. Improved tools are required to reliably detect human host contamination in long reads with the error profiles studied.

Additional results and a comprehensive description of HoCoRT can be found in the first author’s master’s thesis [24].

Conclusions

A dedicated, flexible, extendable, and modular tool for removing host sequence contamination is now available, free of charge. We have conducted a comprehensive comparison of classification methods and offer corresponding recommendations. The HoCoRT tool is expected to streamline the decontamination step in microbiome data analysis and deliver reliable performance.

Availability and requirements

Project name: HoCoRT. Project home page: https://github.com/ignasrum/hocort. Operating system(s): Linux and macOS (except for the BioBloom module). Programming language: Python. Other requirements: Samtools [14] and other packages. Please see GitHub repository for details. License: MIT license. Any restrictions to use by non-academics: None.