SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
Human tissue is increasingly being whole genome sequenced as we transition into an era of genomic medicine. With this arises the potential to detect sequences originating from microorganisms, including pathogens amid the plethora of human sequencing reads. In cancer research, the tumorigenic ability of pathogens is being recognized, for example, Helicobacter pylori and human papillomavirus in the cases of gastric non-cardia and cervical carcinomas, respectively. As of yet, no benchmark has been carried out on the performance of computational approaches for bacterial and viral detection within host-dominated sequence data.
We present the results of benchmarking over 70 distinct combinations of tools and parameters on 100 simulated cancer datasets spiked with realistic proportions of bacteria. mOTUs2 and Kraken are the highest performing individual tools achieving median genus-level F1 scores of 0.90 and 0.91, respectively. mOTUs2 demonstrates a high performance in estimating bacterial proportions. Employing Kraken on unassembled sequencing reads produces a good but variable performance depending on post-classification filtering parameters. These approaches are investigated on a selection of cervical and gastric cancer whole genome sequences where Alphapapillomavirus and Helicobacter are detected in addition to a variety of other interesting genera.
We provide the top-performing pipelines from this benchmark in a unifying tool called SEPATH, which is amenable to high throughput sequencing studies across a range of high-performance computing clusters. SEPATH provides a benchmarked and convenient approach to detect pathogens in tissue sequence data helping to determine the relationship between metagenomics and disease.
KeywordsMetagenomics Pipeline Taxonomy Classification SEPATH Cancer Pathogens Bioinformatics Bacteria Viral
Binary alignment map file format
High performance computing cluster
National Center for Biotechnology Information
Positive predictive value (precision)
Random access memory
The estimated incidence of cancer attributed to infection surpasses that of any individual type of anatomically partitioned cancer . Human papillomavirus (HPV) causes cervical carcinoma, and Helicobacter pylori facilitates gastric non-cardia carcinoma induction [2, 3]. The role of HPV in tumorigenesis is understood and has clinical implications: HPV screening programs have been adopted and several vaccines exist, targeting a wide range of HPV subtypes . The amount of whole genome sequencing data generated from tumor tissue is rapidly increasing with recent large-scale projects including The Cancer Genome Atlas (TCGA) Program , International Cancer Genome Consortium (ICGC)  (including the Pan-Cancer Analysis of Whole Genomes, PCAWG ), Genomic England’s 100,000 Genomes Project , and at least nine other large-scale national sequencing initiatives emerging . When such samples are whole genome sequenced, DNA from any pathogens present will also be sequenced, making it possible to detect and quantify pathogens, as recently shown in cancer by Feng et al.  and Zapatka et al. . Protocols for these projects do not typically encompass negative control samples and do not use extraction methods optimized for microbiome analysis, yet careful consideration of contamination and correlation of output results with clinical data could generate hypotheses without any additional cost for isolated metagenomics projects. The scope of potential benefits from analyzing cancer metagenomics is broad and could benefit multiple prominent research topics including cancer development, treatment resistance, and biomarkers of progression. It is therefore important to consider the performance of pathogen sequence classification methods in the context of host-dominated tissue sequence data.
Traditionally, the identification of microbiological entities has centered around culture-based methodologies. More recently, there has been an increase in taxonomic profiling by using amplicon analysis of the 16S ribosomal RNA gene . Whole genome sequencing however presents an improved approach that can interrogate all regions of every constituent genome whether prokaryotic or not and provides a wider range of possible downstream analyses. The increasingly widespread use of whole genome sequencing technologies has resulted in an explosion of computational methods attempting to obtain accurate taxonomic classifications for metagenomic sequence data . Typically, these tools rely on references of assembled or partially assembled genomes to match and classify each sequencing read or assembled contig. One issue with this approach is that there exists an uneven dispersion of interest in the tree of life, rendering some clades underrepresented or entirely absent. Furthermore, sequence similarity between organisms and contamination in reference genomes inhibit the perfect classification of every input sequence [14, 15, 16]. A recent study has shown that the increasing size of databases such as NCBI RefSeq has also resulted in more misclassified reads at species level with reliable classifications being pushed higher up the taxonomic tree . Because of this species-level instability, we initially select to carry out metagenomic investigations at a genus level, prior to investigating lower taxonomic levels, particularly for experiments with low numbers of non-host sequences.
Computational tools for metagenomic classification can be generalized into either taxonomic binners or taxonomic profilers . Taxonomic binners such as Kraken [18, 19], CLARK , and StrainSeeker  attempt to make a classification on every input sequence whereas taxonomic profilers such as MetaPhlAn2 [22, 23] and mOTUs2 [24, 25] typically use a curated database of marker genes to obtain a comparable profile for each sample. This generally means that taxonomic profilers are less computationally intensive in comparison with binners but may be less effective with low amounts of sequences. Although there is a large number of tools available purely for sequence classification, at the time of writing, there is a limited selection of computational pipelines available that process data optimally with high-throughput and produce classifications from raw reads with all appropriate steps including quality control. Examples of these include PathSeq [26, 27, 28] which utilizes a BLAST-based  approach and IMP  which utilizes MaxBin  for classification.
Community-driven challenges such as Critical Assessment of Metagenome Interpretation (CAMI) provide one solution to independently benchmark the ever-growing selection of tools used for metagenomic classification . CAMI provides a useful starting point for understanding classification tools on samples with differing complexity, but it is unlikely to provide an accurate comparison for more niche areas of taxonomic classification such as ancient microbiome research  or for intra-tumor metagenomic classification dominated by host sequences.
Classifying organisms within host tissue sequence data provides an additional set of challenges. In addition to the limitations in the tool performance, there is also a low abundance of pathogenic sequences compared to the overwhelming proportion of host sequence data as well as high inter-sample variability. Cancer sequences are also known to be genetically heterogeneous and unstable in nature providing a further cause for caution when classifying non-host sequences and rendering the accurate removal of host reads difficult [33, 34, 35].
Here, we present and discuss the development of SEPATH, template computational pipelines designed specifically for obtaining classifications from within human tissue sequence data and optimized for large WGS studies. This paper provides rationale for the constituent tools of SEPATH by analyzing the performance of tools for quality trimming, human sequence depletion, metagenomic assembly, and classification. We present the results of over 70 distinct combinations of parameters and post-classification filtering strategies tested on 100 simulated cancer metagenomic datasets. We further assess the utility of these pipelines by running them on a selection of whole genome cancer sequence data. We analyze a selection of samples from cervical cancer, where it is expected that Alphapapillomavirus will be frequently identified and gastric cancer where it is expected that Helicobacter will be identified. A selection of 10 pediatric medulloblastoma samples is also analyzed for which it is expected that not many if any taxa at all will be identified due to the historically noted sterility of the brain, although this is currently a subject of debate within the scientific community .
The process of obtaining pathogenic classifications from host tissue reads can be broken down into a few key computational steps: sequence quality control, host sequence depletion, and taxonomic classification. For these computational steps, a series of tools and parameters were benchmarked on simulated metagenomes (see the “Methods” section). These genomes emulate empirical observations from other cancer tissue sequence data , with the percentage of human reads ranging from 87 to > 99.99%. Genomes from 77 species were selected as constituents for the metagenomes . These species were identified from Kraal et al.  with additional bacterial species associated with cancer, e.g., Helicobacter pylori  (see Additional file 1 for a full description of each simulation).
Human sequence depletion
A large proportion of sequence reads from tumor whole genome sequencing datasets are human in origin. It is essential to remove as many host reads as possible—firstly, to limit the opportunity for misclassification and, secondly, to significantly reduce the size of data thereby reducing the computational resource requirement.
Three methods of host depletion were investigated on 11 simulated datasets (2 × 150 bp Illumina reads). Two of these methods were k-mer-based methods: Kontaminant [39, 40] and BBDuk . The third method involved extracting unmapped reads following BWA-MEM  alignment, an approach that is facilitated by the likelihood that data will be available as host-aligned BAM files in large-scale genomic studies. BWA-MEM is used as a baseline, and parameters were set to be as preservative as possible of any potential non-human reads.
In an attempt to capture k-mers specific of cancer sequences, a BBDuK database was generated containing human reference genome 38 concatenated with coding sequences of all cancer genes in the COSMIC database . With the additional cancer sequences, a near-identical performance was obtained when compared with just the human reference database (Fig. 1b, c). Therefore, including extra cancer sequences did not alter the retention of pathogen-derived reads, providing an opportunity for increased human sequence removal on real data without sacrificing bacterial sensitivity. To investigate using a BBDuK database capturing a higher degree of human sequence variation, we also investigated the inclusion of additional human sequences from a recent analysis into the African “pan-genome” . Including these extra sequences removed slightly more bacterial reads but this had a very minor effect (Fig. 1c).
Taxonomic classification: bacterial datasets
Kraken utilizes over 125 times the RAM requirement of mOTUs2 (Fig. 2d; median 256 GB vs 2 GB RAM for Kraken and mOTUs2, respectively; p=2.2×10−16 Mann-Whitney U test); Kraken was ran with the database loaded into RAM to improve runtime. Historically, alignment-based taxonomic classification tools have been slow, but by using the reduced 40 marker gene database, mOTUs2 has much lower run times. CPU time was on average marginally higher for mOTUs2 compared to Kraken (Fig. 2d), but we noticed the elapsed time was actually lower (data not shown).
Bacterial proportion estimation
Bacterial classification following metagenomic assembly
The data above demonstrates that mOTUs2 and Kraken have comparable performances. However, Kraken, in contrast to mOTUs2, can classify non-bacterial sequences. When ran on raw reads, Kraken typically requires post-classification filtering strategies in order to obtain high performance  (Additional file 3: Figure S2). Post-classification filtering involves applying criteria to remove low-quality classifications from taxonomic results. Applying a metagenomic assembly algorithm to quality-trimmed non-host reads might provide a rapid filtering approach that reduces the need for read-based thresholds.
Filtering these datasets by number of contigs is non-ideal, as it would remove classifications from taxa that assembled well into a small number of contigs. An evolution of Kraken, KrakenUniq , was run on these contigs to further illuminate the relationship between taxa detection and more advanced metrics than Kraken 1, including the coverage of the clade in the reference database and the number of unique k-mers (Fig. 4d, Additional file 6: Figure S5). This analysis reveals that on our challenging datasets, no set of filtering parameters could obtain perfect performance. Upon investigation of a single dataset, it was observed that 13 out of 17,693 contigs assigning to different genera were responsible for false-positive classifications resulting in a drop of PPV to 0.83 (Additional file 7: Figure S6). These contigs were extracted and used as input for NCBI’s MegaBLAST with standard parameters. Of the 13 false-positive contigs, 3 were correctly reclassified, 3 were incorrectly classified, and the remaining 7 obtained no significant hits. This highlights that these contigs may suffer from misassembly or non-uniqueness that is not improved by use of a tool with a different approach.
Taxonomic classification: viral datasets
Bacterial consensus classification
Real cancer whole genome sequence data
In both cervical and gastric cancers, expansion of these pipelines to larger datasets would help to characterize the role of many other reported genera. Medulloblastoma samples are expected to be mostly sterile, and this is well reflected with only very low number of genera at low read counts (number of genera: total reads in all samples 75: 11,213,997; 102: 16,269,893; 27: 138,712 for cervical, gastric, and medulloblastoma, respectively.). Kraken appears to be more sensitive, making a greater number of classifications overall and classifying the same taxa as present in a higher number of samples than mOTUs2.
SEPATH template pipelines
We have demonstrated pipelines for detecting bacterial genera and viral species in simulated and real whole genome sequence data from cancer samples. These pipelines perform well in terms of sensitivity and PPV and utilize computational resources effectively. The two top-performing classification tools, Kraken and mOTUs2, have very different underlying mechanics despite achieving similar performance. Kraken builds a database by minimizing and compressing every unique k-mer for each reference genome. Kraken begins the analysis by breaking down each input read into its constituent k-mers and matching each of these to the user-generated reference database. The sequence is classified probabilistically by the leaf in the highest weighted root to leaf path in a taxonomic tree . In comparison with Kraken, mOTUs2 uses a highly targeted approach by analyzing 40 universal phylogenetic bacterial marker genes for classification. Overall, mOTUs2 uses 7726 marker gene-based operational taxonomic units (mOTUs). Classifications are obtained by an alignment to this database using BWA-MEM with default parameters [25, 42].
mOTUs2 has been developed with quantitative abundance in mind. It intuitively estimates the proportion of sequences estimated to originate from unknown taxa (denoted by “ − 1” in mOTUs2 reports) and adjusts abundance values from detected clades accordingly to account for this. Kraken read distribution can be improved by using a Bayesian framework to redistribute the assigned reads using Bracken . A comparison of relative abundance between mOTUs2 and Bracken was carried out during the production of mOTUs2 as reported in Milanese et al. , which demonstrated that mOTUs2 appeared to provide more accurate predictions. We therefore recommend our Kraken pipelines for accurate representations of presence/absence and suggest that using abundance weighted β-diversity metrics from these pipelines should be interpreted with caution. A further caveat of the assembly Kraken pipeline is that it requires successful metagenomic assembly. While MetaSPAdes worked well on our simulations, idiosyncrasies of differing technologies and datasets may hinder a successful assembly. In this event, we would recommend running Kraken classification on quality-trimmed and human-depleted sequencing reads without assembly.
The data in this paper supports the use of mOTUs2 for quantitative bacterial measurements, which together with the high classification performance on simulated data suggests that both binary and non-binary β-diversity measures would be representative of the true values of the dataset, suggesting a conferred accuracy in bacterial community profiling. Furthermore, mOTUs2 differs from the current methods that rely purely on bacterial reference sequences by incorporating data from metagenome-assembled genomes, suggesting that mOTUs2 captures a differing scope of classifications to our Kraken database, which was developed using reference genomes. Although both tools are state-of-the-art at the time of writing, they are likely to contain biases in terms of what they are able to classify, which pertains to previous sequencing efforts of the sampling site. The human gut microbiome for example is currently believed to be better characterized than other body sites .
For bacterial classification, we noted a higher performance at taxonomic levels above genus level, but performance appears to drop at species level (Additional file 3: Figure S2). We urge caution when working at the species level on this type of data due to this combined with the instability of species-level classification. At lower taxonomic levels, the retention of BAM files from mOTUs2 could theoretically allow for subsequent investigations at more specific taxonomic nodes (such as strain level) by investigating single-nucleotide variation. Kraken also automatically produces subgenus-level classifications where the input data and reference database permits. Validating performance at these taxonomic levels would require extensive performance benchmarking which has not been conducted here. Benchmarking tools and databases as they emerge are important tasks as they greatly influence performance. It is hoped that utilities presented here will assist future benchmarking efforts.
The use of SEPATH pipelines on real cancer sequence data suggests overall agreement between Kraken and mOTUs2 but reveals important considerations for subsequent analysis. Kraken appears to be more sensitive than mOTUs in this real data, possibly due to the differing parameters used due to the shorter read lengths seen (2 × 100 bp in real sample data compared to 2 × 150 bp in simulated data). Using sequencing protocols optimized for microbial detection compared to human sequencing projects is likely to result in a higher and more even microbial genome coverage and subsequently more classifications with mOTUs2 which has been demonstrated recently in the analysis of fecal metagenomes of colorectal cancer patients . In this study, mOTUs2 provided interesting “unknown” classifications which would not be captured by standard Kraken databases. We therefore recommend Kraken as the primary tool of investigation on tissue, but mOTUs2 has a great potential in the confirmatory setting and for investigating unknown taxa. A consensus approach of different tools on much larger real datasets would likely help in distinguishing between the peculiarities (particularly false positives) of individual tools and true-positive results which would benefit the accurate characterization of human tissue metagenomes.
A benchmark into metagenomic classification tools has revealed high-performing approaches to process host-dominated sequence data with low pathogenic abundance on a large selection of challenging simulated datasets. We provide these pipelines for the experienced user to adjust according to their own resource availability and provide our simulated metagenomes for others to use freely for independent investigations. mOTUs2 provides fast and accurate bacterial classification with good quantitative predictions. MetaSPAdes and Kraken provide bacterial and viral classification with assembled contigs as a useful downstream output. We have shown that SEPATH forms a consensus alongside PathSeq to achieve near-perfect genus-level bacterial classification performance. Using SEPATH pipelines will contribute towards a deeper understanding of the cancer metagenome and generate further hypotheses regarding the complicated interplay between pathogens and cancer.
Metagenomes were simulated using a customized version of Better Emulation for Artificial Reads (BEAR)  and using in-house scripts to generate proportions for each reference genome (Additional file 8: Figure S7, https://github.com/UEA-Cancer-Genetics-Lab/BEAR). These proportions were based on previously analyzed cancer data . Firstly, the number of total bacterial reads (in both pairs) was generated by a random selection of positive values from a normal distribution function with a mean of 28,400,000 and a standard deviation of 20,876,020. The number of human reads in the sample was set to the difference between this number and 600 million (the total number of reads in both pairs). The number of bacterial species was randomly sampled from the reference species available, and the number of bacterial reads available was picked from a gamma distribution of semi-random shape. The number of reads for each bacterial species was distributed among contigs proportionately depending on the contig length. This produced a file with contigs and proportions of final reads which was provided to BEAR to generate paired-end FASTA files for each of the 100 metagenomes with approximately 300 million reads per paired-end file (complete metagenome compositions can be found in Additional file 1, viral components in Additional file 9). An error model was generated following the BEAR recommendations from a sample provided by Illumina containing paired-end reads that were 150 bp in read length (https://basespace.illumina.com/run/35594569/HiSeqX_Nextera_DNA_Flex_Paternal_Trio). This sample was selected to best resemble data originating from within Genomic England’s 100,000 Genomes Project. These simulated metagenomes can be downloaded from the European Nucleotide Archive (https://www.ebi.ac.uk/ena/data/view/PRJEB31019).
Tool performance benchmarking
Samples were trimmed for quality, read length, and adapter content with Trimmomatic  prior to running any classification (default parameters were minimum read length = 35 and minimum phred quality of 15 over a sliding window of 4). SEPATH has trimming parameters set as default that prevent any excessive removal of data (including any reads that may be pathogenic), but these should be adjusted according to the nature of the data being analyzed.
Real cancer whole genome sequence analysis
Sequencing data from cancer tissue was obtained from The Cancer Genome Atlas (TCGA-CESC and TCGA-STAD) , International Cancer Genome Consortium (ICGC) PedBrain Tumor Project , and ICGC Chinese Gastric Cancer project . These sequencing reads were pre-processed through a common pipeline to obtain reads unaligned to the human genome  and were additionally quality trimmed and depleted for human reads using SEPATH standard parameters but with a database consisting of human reference genome 38, African “pan-genome” project sequences and COSMIC cancer genes as previously mentioned. Kraken was ran on quality-trimmed reads, and a confidence threshold of 0.2 was applied to the reports. mOTUs2 was ran for the genus-level analysis on the same reads using 2 marker gene minimum and a non-standard minimum alignment length of 50 to account for shorter read length. Kraken files had a minimum read threshold applied of 100 reads for each classification, and mOTUs2 results were left unfiltered.
Computational tools and settings
All analysis for figures was carried out in R version 3.5.1 (2018-07-02). All scripts and raw data used to make the figures can be found in the supplementary information and on https://github.com/UEA-Cancer-Genetics-Lab/sepath_paper. In addition to the “other requirements” mentioned below, this paper used the following software as part of the analysis: picard 2.10.9, samtools v1.5, BEAR (https://github.com/UEA-Cancer-Genetics-Lab/BEAR commit: a58df4a01500a54a1e89f42a6c7314779273f9b2), BLAST v2.6.0+, Diamond v0.9.22, MUMmer v3.2.3, Jellyfish v1.1.11, Kaiju v1.6.3, Kontaminant (pre-release, GitHub commit: d43e5e7), KrakenUniq (github commit: 7f9de49a15aac741629982b35955b12503bee27f), MEGAHIT (github commit: ef1bae692ee435b5bcc78407be25f4a051302f74), MetaPhlAn2 v2.6.0, Gottcha v1.0c, Centrifuge v1.0.4, FASTA Splitter v0.2.6, Perl v5.24.1 bzip2 v1.0.5, gzip v1.3.12, and Singularity v3.2.1.
Python v3.5.5 was used with the exception of BEAR, which used Python 2.7.12. Python modules used the following: SeqIO of BioPython v1.68, os, sys, gzip, time, subprocess, and glob. The following are the R packages used and their versions: Cowplot v0.9.3, dplyr v 0.7.6, ggExtra v0.8, ggplot2 v3.0.0, ggpubr v0.1.8, ggrepel v0.8.0, purr v0.2.5, ggbeeswarm v0.6.0, see v0.2.0.9, RColorBrewer v1.1-2, readr v1.1.1, reshape2 v1.4.3, tidyr v0.8.1, and tidyverse v1.2.1.
Availability and requirements
Project name: SEPATHProject home page: https://github.com/UEA-Cancer-Genetics-Lab/sepath_tool_UEAOperating system(s): Linux-based high performance computing cluster environmentsProgramming language: Python 3, BashOther requirements: Python v3.5, Snakemake v3.13.3, Trimmomatic v0.36, Java v.8.0_51, bbmap v37.28, mOTUs2 v2.0.1, Kraken 1, Spades v3.11.1, Pysam v0.15.1License: GPL version 3 or later
The research presented in this paper was carried out on the High Performance Computing Cluster supported by the Research and Specialist Computing Support service at the University of East Anglia. We acknowledge and thank the support received from Big C, Prostate Cancer UK, Cancer Research UK C5047/A14835/A22530/A17528, Bob Champion Cancer Trust, The Masonic Charitable Foundation successor to The Grand Charity, The King Family, and the Stephen Hargrave Trust.
For the submission of reference genomic sequence data to NCBI that were used in producing simulated metagenomes in this paper, we would like to thank and acknowledge the following:
- Genome Reference Consortium Human Build 38 – Genome Reference Consortium (CRG) and the International Human Genome Sequencing Consortium (IHGSC) - TIGR for the submissions of Helicobacter pylori, Haemophilus influenzae, Enterococcus faecalis, and Mycoplasma genitalium - University of Wisconsin – Madison – E. coli Genome Project for their submission of E. coli - Baylor College of Medicine for their submissions of Corynebacterium accolens, Pasteurella dagmatis, Rothia dentocariosa, Streptococcus parasanguinis, Corynebacterium glucuronolyticum, Corynebacterium pseudogenitalium, Peptoniphilus duerdenii, and Finegoldia magna - J. Craig Venter Institute for their submissions of Corynebacterium tuberculostearicum, Ureaplasma urealyticum, Bulleidia extructa, Prevotella buccalis, Peptoniphilus harei, Anaerococcus prevotii, Peptoniphilus sp. BV3C26, Propionimicrobium sp. BV2F7, Anaerococcus lactolyticus, Mobiluncus curtisii, and Campylobacter rectus
- The Human Microbiome Project for their submission of Gemella haemolysans - Radboud University Nijmegen Medical Centre for their submission of Moraxella catarrhalis
- Goettingen Genomics Laboratory for their submission of Cutibacterium acnes - The Chinese National Human Genome Centre, Shanghai, for their submission of Staphylococcus epidermidis - The Department of Microbiology, University of Kaiserslautern, for their submission of Streptococcus mitis - Kitasato University for their submission of Bacteroides fragilis - Washington University Genome Sequencing Center for their submissions of Abiotrophia defectiva, Cantonella morbi, Blautia hansenii, Dialister invisus, Clostridium spiroforme, Eubacterium ventriosum, Faecalibacterium prausnitzii, Ruminococcus torques, and Anaerococcus Hydrogenalis - Integrated Genomics for their submission of Fusobacterium nucleatum
- Washington University School of Medicine in St. Louis–McDonnel Genome Institute for their submission of Kingella oralis
- DOE Joint Genome Institute for their submissions of Leptotrichia goodfellowii, Streptobacillus moniliformis, Veillonella parvula, Porphyromonas somerae, Porphyromonas bennonis, Campylobacter ureolyticus, Varibaculum cambriense, Actinotignum urinale, Propionimicrobium lymphophilum, Prevotella corporis, and Anaerococcus prevotii
- European Consortium for their submission of Listeria monocytogenes - Georg-August University Goettingen, Genomic and Applied Microbiology, Goettingen Genomics Laboratory, for their submission of Mannheimia haemolytica
- INRS-Institut Armand Frappier for their submission of Neisseria elongate
- Broad Institute for their submissions of Neisseria mucosa, Treponema Vincentii, Fusobacterium gonidiaformans, Actinobaculum massiliense, Actinomyces neuii, Actinomyces turicensis, Propionimicrobium lymphophilum, Corynebacterium pyruviciproducens
- Institut National de la Recherche Agronomique (INRA) for their submission of Streptococcus thermophilus
- The Sanger Institute for their submission of Salmonella enterica
- JGI for their submission of Prevotella bivia
- The Genome Institute for their submission of Enterococcus faecalis
- The University of Tokyo for their submission of Prevotella disiens
- URMITE for their submission of Prevotella timonensis
- Aalborg University for their submission of Actinotignum schaali
- The Robert Koch Institute for their submission of Sneathia sanguinegens
- The Genome Institute at Washington University for their submission of Peptoniphilus coxii
- Institut Pasteur for their submission of Streptococcus agalactiae - University Medical Centre Utrech for their submission of Staphylococcus aureus - National Microbiology Laboratory, Public Health Agency of Canada, for their submission of Streptococcus anguinosus - USDA, ARS, and WRRC for their submission of Campylobacter ureolyticus
AG developed the manuscript and SEPATH, is responsible for the metagenome simulation and tool benchmarking, and produced all graphical presentations. GR alongside AG modified BEAR for metagenomic simulation. GR developed the Kraken database and supervised the early development of SEPATH. RH advised on the metagenomic content of the simulated datasets. DB, RL, CC, and RH contributed towards the development of the final manuscript. RL and DB supervised the production and development of SEPATH. DB obtained and processed the cancer whole genome sequencing files prior to AG running SEPATH. DB and CC developed the original concept of SEPATH. All authors read and approved the final manuscript.
Funding for this project was obtained from the Big C Cancer Charity, grant reference: 16-09R.
Ethics approval and consent to participate
N/A. All data presented in this article was analyzed from publicly available sources.
Consent for publication
The authors declare that they have no competing interests.
- 2.Lax A. Bacterial toxins and cancer - a case to answer?Nat Rev. 2005; 3:343–9.Google Scholar
- 4.Castellsagué X, Díaz M, de Sanjosé S, Muñoz N, Herrero R, Franceschi S, Peeling RW, Ashley R, Smith JS, Snijders PJF, Meijer CJLM, Bosch FX. Worldwide human papillomavirus etiology of cervical adenocarcinoma and its cofactors: implications for screening and prevention. JNCI: J Natl Cancer Inst. 2006; 98(5):303–15. https://doi.org/10.1093/jnci/djj067.CrossRefGoogle Scholar
- 6.International Cancer Genome Consortium - ICGC. 2007. https://icgc.org/.
- 7.PCAWG. Pancancer analysis of whole genomes. 2019.Google Scholar
- 8.Genomics England Limited. The 100,000 Genomes Project Protocol v3 2017. 2017. https://doi.org/10.6084/m9.figshare.4530893.v2.
- 9.Global Alliance for Genomics and Health. 2019. https://www.ga4gh.org/.
- 11.Zapatka M, Borozan I, Brewer DS, Iskar M, Grundhoff A, Alawi M, Desai N, Cooper CS, Eils R, Ferretti V, Lichter P, IP-CAoWGN PCAWG Pathogens Working Group. The landscape of viral associations in human cancers. bioRxiv. 2018. https://doi.org/10.1101/465757. https://www.biorxiv.org/content/early/2018/11/08/465757.full.pdf.
- 13.Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Droge J, Gregor I, Majda S, Fiedler J, Dahms E, Bremges A, Fritz A, Garrido-Oter R, Jorgensen TS, Shapiro N, Blood PD, Gurevich A, Bai Y, Turaev D, DeMaere MZ, Chikhi R, Nagarajan N, Quince C, Meyer F, Balvociute M, Hansen LH, Sorensen SJ, Chia BKH, Denis B, Froula JL, Wang Z, Egan R, Don Kang D, Cook JJ, Deltel C, Beckstette M, Lemaitre C, Peterlongo P, Rizk G, Lavenier D, Wu YW, Singer SW, Jain C, Strous M, Klingenberg H, Meinicke P, Barton MD, Lingner T, Lin HH, Liao YC, Silva GGZ, Cuevas DA, Edwards RA, Saha S, Piro VC, Renard BY, Pop M, Klenk HP, Goker M, Kyrpides NC, Woyke T, Vorholt JA, Schulze-Lefert P, Rubin EM, Darling AE, Rattei T, McHardy AC. Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat Methods. 2017; 14(11):1063–71. https://doi.org/10.1038/nmeth.4458.CrossRefGoogle Scholar
- 16.Breitwieser FP, Pertea M, Zimin A, Salzberg SL. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res. 2019. https://doi.org/10.1101/gr.245373.118. http://genome.cshlp.org/content/early/2019/05/07/gr.245373.118.full.pdf+html.
- 17.Nasko DJ, Koren S, Phillippy AM, Treangen TJ. Refseq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 2018; 19(1). https://doi.org/10.1101/304972.
- 18.Wood D, Salzberg S. Kraken - ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3). https://doi.org/10.1186/gb-2014-15-3-r46..
- 21.Roosaare M, Vaher M, Kaplinski L, Mols M, Andreson R, Lepamets M, Koressaar T, Naaber P, Koljalg S, Remm M. Strainseeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees. PeerJ. 2017; 5:3353. https://doi.org/10.7717/peerj.3353.CrossRefGoogle Scholar
- 24.Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, Rasmussen S, Brunak S, Pedersen O, Guarner F, de Vos WM, Wang J, Li J, Dore J, Ehrlich SD, Stamatakis A, Bork P. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013; 10(12):1196–9. https://doi.org/10.1038/nmeth.2693.CrossRefGoogle Scholar
- 25.Milanese A, Mende DR, Paoli L, Salazar G, Ruscheweyh HJ, Cuenca M, Hingamp P, Alves R, Costea PI, Coelho LP, Schmidt TSB, Almeida A, Mitchell AL, Finn RD, Huerta-Cepas J, Bork P, Zeller G, Sunagawa S. Microbial abundance, activity and population genomic profiling with motus2. Nat Commun. 2019; 10(1):1014. https://doi.org/10.1038/s41467-019-08844-4.CrossRefGoogle Scholar
- 26.Broad Institute. 2019. https://github.com/broadinstitute/gatk.
- 29.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3).Google Scholar
- 30.Narayanasamy S, Jarosz Y, Muller EE, Heintz-Buschart A, Herold M, Kaysen A, Laczny CC, Pinel N, May P, Wilmes P. IMP: a pipeline for reproducible reference-independent integrated metagenomic and metatranscriptomic analyses. Genome Biol. 2016; 17(1):260. https://doi.org/10.1186/s13059-016-1116-8.CrossRefGoogle Scholar
- 32.Velsko I., Frantz L. A. F., Herbig A., Larson G., Warinner C.Selection of appropriate metagenome taxonomic classifiers for ancient microbiome research. mSystems. 2018; 3. https://doi.org/10.1128/.
- 33.Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, Colaprico A, Wendl MC, Kim J, Reardon B, Ng PK, Jeong KJ, Cao S, Wang Z, Gao J, Gao Q, Wang F, Liu EM, Mularoni L, Rubio-Perez C, Nagarajan N, Cortes-Ciriano I, Zhou DC, Liang WW, Hess JM, Yellapantula VD, Tamborero D, Gonzalez-Perez A, Suphavilai C, Ko JY, Khurana E, Park PJ, Van Allen EM, Liang H, Group MCW, Cancer Genome Atlas Research N, Lawrence MS, Godzik A, Lopez-Bigas N, Stuart J, Wheeler D, Getz G, Chen K, Lazar AJ, Mills GB, Karchin R, Ding L. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018; 173(2):371–38518. https://doi.org/10.1016/j.cell.2018.02.060.CrossRefGoogle Scholar
- 35.Cooper CS, Eeles R, Wedge DC, Van Loo P, Gundem G, Alexandrov LB, Kremeyer B, Butler A, Lynch AG, Camacho N, Massie CE, Kay J, Luxton HJ, Edwards S, Kote-Jarai Z, Dennis N, Merson S, Leongamornlert D, Zamora J, Corbishley C, Thomas S, Nik-Zainal S, O’Meara S, Matthews L, Clark J, Hurst R, Mithen R, Bristow RG, Boutros PC, Fraser M, Cooke S, Raine K, Jones D, Menzies A, Stebbings L, Hinton J, Teague J, McLaren S, Mudie L, Hardy C, Anderson E, Joseph O, Goody V, Robinson B, Maddison M, Gamble S, Greenman C, Berney D, Hazell S, Livni N, Fisher C, Ogden C, Kumar P, Thompson A, Woodhouse C, Nicol D, Mayer E, Dudderidge T, Shah NC, Gnanapragasam V, Voet T, Campbell P, Futreal A, Easton D, Warren AY, Foster CS, Stratton MR, Whitaker HC, McDermott U, Brewer DS, Neal DE, Cooper CS, Eeles R, Wedge DC, Van Loo P, Gundem G, Alexandrov LB, Kremeyer B, Butler A, Lynch AG, Camacho N, Massie CE, Kay J, Luxton HJ, Edwards S, Kote-Jarai Z, Dennis N, Merson S, Leongamornlert D, Zamora J, Corbishley C, Thomas S, Nik-Zainal S, O’Meara S, Matthews L, Clark J, Hurst R, Mithen R, Cooke S, Raine K, Jones D, Menzies A, Stebbings L, Hinton J, Teague J, McLaren S, Mudie L, Hardy C, Anderson E, Joseph O, Goody V, Robinson B, Maddison M, Gamble S, Greenman C, Berney D, Hazell S, Livni N, Fisher C, Ogden C, Kumar P, Thompson A, Woodhouse C, Nicol D, Mayer E, Dudderidge T, Shah NC, Gnanapragasam V, Voet T, Campbell P, Futreal A, Easton D, Warren AY, Foster CS, Stratton MR, Whitaker HC, McDermott U, Brewer DS, Neal DE, Bova G, Hamdy F, Lu YJ, Ng A, Yu Y, Zhang H. Analysis of the genetic phylogeny of multifocal prostate cancer identifies multiple independent clonal expansions in neoplastic and morphologically normal prostate tissue. Nat Genet. 2015; 47(4):367–72.CrossRefGoogle Scholar
- 37.National Center for Biotechnology Information. 2018. https://www.ncbi.nlm.nih.gov/genome.
- 40.Daly GM, Leggett RM, Rowe W, Stubbs S, Wilkinson M, Ramirez-Gonzalez RH, Caccamo M, Bernal W, Heeney JL. Host subtraction, filtering and assembly validations for novel viral discovery using next generation sequencing data. PLoS ONE. 2015; 10(6):0129059. https://doi.org/10.1371/journal.pone.0129059.CrossRefGoogle Scholar
- 41.Joint Genome Institute. 2018. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/.
- 42.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013; q-bio.GN. arXiv:1303.3997v1.Google Scholar
- 43.Catalogue of Somatic Mutations in Cancer - COSMIC. Data Downloads. 2018. https://cancer.sanger.ac.uk/cosmic/download.
- 44.Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, Boorgula MP, Chavan S, Vergara C, Ortega VE, Levin AM, Eng C, Yazdanbakhsh M, Wilson JG, Marrugo J, Lange LA, Williams LK, Watson H, Ware LB, Olopade CO, Olopade O, Oliveira RR, Ober C, Nicolae DL, Meyers DA, Mayorga A, Knight-Madden J, Hartert T, Hansel NN, Foreman MG, Ford JG, Faruque MU, Dunston GM, Caraballo L, Burchard EG, Bleecker ER, Araujo MI, Herrera-Paz EF, Campbell M, Foster C, Taub MA, Beaty TH, Ruczinski I, Mathias RA, Barnes KC, Salzberg SL. Assembly of a pan-genome from deep sequencing of 910 humans of african descent. Nat Genet. 2019; 51(1):30–5. https://doi.org/10.1038/s41588-018-0273-y.CrossRefGoogle Scholar
- 50.Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P, Rodriguez FJ, Lim M, Quinones-Hinojosa A, Gallia GL, Tornheim JA, Melia MT, Sears CL, Pardo CA. Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurol Neuroimmunol Neuroinflamm. 2016; 3(4):251. https://doi.org/10.1212/NXI.0000000000000251.CrossRefGoogle Scholar
- 54.Lu J, Breitwieser F, Thielen P, Salzberg S. Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci. 2017; 3. https://doi.org/10.7717/peerj-cs.104.
- 55.Wirbel J, Pyl PT, Kartal E, Zych K, Kashani A, Milanese A, Fleck JS, Voigt AY, Palleja A, Ponnudurai R, Sunagawa S, Coelho LP, Schrotz-King P, Vogtmann E, Habermann N, Nimeus E, Thomas AM, Manghi P, Gandini S, Serrano D, Mizutani S, Shiroma H, Shiba S, Shibata T, Yachida S, Yamada T, Waldron L, Naccarati A, Segata N, Sinha R, Ulrich CM, Brenner H, Arumugam M, Bork P, Zeller G. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med. 2019; 25(4):679–89. https://doi.org/10.1038/s41591-019-0406-6.CrossRefGoogle Scholar
- 58.Northcott PA, Buchhalter I, Morrissy AS, Hovestadt V, Weischenfeldt J, Ehrenberger T, Grobner S, Segura-Wang M, Zichner T, Rudneva VA, Warnatz HJ, Sidiropoulos N, Phillips AH, Schumacher S, Kleinheinz K, Waszak SM, Erkek S, Jones DTW, Worst BC, Kool M, Zapatka M, Jager N, Chavez L, Hutter B, Bieg M, Paramasivam N, Heinold M, Gu Z, Ishaque N, Jager-Schmidt C, Imbusch CD, Jugold A, Hubschmann D, Risch T, Amstislavskiy V, Gonzalez FGR, Weber UD, Wolf S, Robinson GW, Zhou X, Wu G, Finkelstein D, Liu Y, Cavalli FMG, Luu B, Ramaswamy V, Wu X, Koster J, Ryzhova M, Cho YJ, Pomeroy SL, Herold-Mende C, Schuhmann M, Ebinger M, Liau LM, Mora J, McLendon RE, Jabado N, Kumabe T, Chuah E, Ma Y, Moore RA, Mungall AJ, Mungall KL, Thiessen N, Tse K, Wong T, Jones SJM, Witt O, Milde T, Von Deimling A, Capper D, Korshunov A, Yaspo ML, Kriwacki R, Gajjar A, Zhang J, Beroukhim R, Fraenkel E, Korbel JO, Brors B, Schlesner M, Eils R, Marra MA, Pfister SM, Taylor MD, Lichter P. The whole-genome landscape of medulloblastoma subtypes. Nature. 2017; 547(7663):311–7.CrossRefGoogle Scholar
- 59.Xing R, Zhou Y, Yu J, Yu Y, Nie Y, Luo W, Yang C, Xiong T, Wu WKK, Li Z, Bing Y, Lin S, Zhang Y, Hu Y, Li L, Han L, Yang C, Huang S, Huang S, Zhou R, Li J, Wu K, Fan D, Tang G, Dou J, Zhu Z, Ji J, Fang X, Lu Y. Whole-genome sequencing reveals novel tandem-duplication hotspots and a prognostic mutational signature in gastric cancer. Nat Commun. 2019; 10(1):2037.CrossRefGoogle Scholar
- 60.Campbell PJ, Getz G, Stuart JM, Korbel JO, Stein LD. Pan-cancer analysis of whole genomes. bioRxiv. 2017. https://doi.org/10.1101/162784. https://www.biorxiv.org/content/early/2017/07/12/162784.full.pdf.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.