Abstract
Variant callers typically produce massive numbers of false positives for structural variations, such as cancer-relevant copy-number alterations and fusion genes resulting from genome rearrangements. Here we describe an ultrafast and accurate detector of somatic structural variations that reduces read-mapping costs by filtering out reads matched to pan-genome k-mer sets. The detector, which we named ETCHING (for efficient detection of chromosomal rearrangements and fusion genes), reduces the number of false positives by leveraging machine-learning classifiers trained with six breakend-related features (clipped-read count, split-reads count, supporting paired-end read count, average mapping quality, depth difference and total length of clipped bases). When benchmarked against six callers on reference cell-free DNA, validated biomarkers of structural variants, matched tumour and normal whole genomes, and tumour-only targeted sequencing datasets, ETCHING was 11-fold faster than the second-fastest structural-variant caller at comparable performance and memory use. The speed and accuracy of ETCHING may aid large-scale genome projects and facilitate practical implementations in precision medicine.
Similar content being viewed by others
Data availability
WGS data from 26 MM samples, RNA-seq data from 24 matched samples and PacBio long-read sequencing data from two multiple-myeloma samples can be downloaded from the Korean Nucleotide Archive (KONA; PRJKA220342; https://www.kobic.re.kr/kona/) with controlled access. TPS data from reference materials are available at http://big.hanyang.ac.kr/ETCHING. Genomes used to build PGK and PGK2 are listed in Supplementary Table 1. WGS from 46 BRCA, 20 PRAD and 32 LUAD were downloaded from TCGA (https://cancergenome.nih.gov). kLUAD WGS datasets (49) were acquired from a previous study33. WGS and PacBio long-read sequencing data from HCC1395/HCC1395BL were downloaded from NCBI Short Read Archive (SRA) under accession number SRP162370. Cancer-panel datasets were downloaded from SRA under accession number SRP042598. NSCLC cancer-panel data were acquired from a previous study47. Source data are provided with this paper.
Code availability
All source and binary codes of ETCHING (version 1.4.0) and in-house codes (LR_Filter and ETCHING_bench) used in the study are available at http://big.hanyang.ac.kr/ETCHING and on GitHub (https://github.com/ETCHING-team). ETCHING was designed for 64-bit Linux systems with at least 16 GB of RAM. The image file containing all codes, models and demo data is available on the Amazon elastic computing cloud (ID: ami-07c7a7d8934784df9; Region: us-east-1 (Northern Virginia)).
References
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017).
Sharp, A. J., Cheng, Z. & Eichler, E. E. Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. 7, 407–442 (2006).
Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).
Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 719–724 (2009).
Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905 (2010).
Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annu. Rev. Med. 61, 437–455 (2010).
Macintyre, G., Ylstra, B. & Brenton, J. D. Sequencing structural variants in cancer for precision therapeutics. Trends Genet. 32, 530–542 (2016).
Di Fiore, P. P. et al. erbB-2 is a potent oncogene when overexpressed in NIH/3T3 cells. Science 237, 178–182 (1987).
Slamon, D. J. et al. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 235, 177–182 (1987).
Soda, M. et al. Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448, 561–566 (2007).
Lugo, T. G., Pendergast, A. M., Muller, A. J. & Witte, O. N. Tyrosine kinase activity and transformation potency of bcr-abl oncogene products. Science 247, 1079–1082 (1990).
Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677–681 (2009).
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Wang, J. et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat. Methods 8, 652–654 (2011).
Schroder, J. et al. Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics 30, 1064–1072 (2014).
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Yang, L. et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell 153, 919–929 (2013).
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Cameron, D. L. et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 27, 2050–2060 (2017).
Wala, J. A. et al. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).
Chong, Z. et al. novoBreak: local assembly for breakpoint detection in cancer genomes. Nat. Methods 14, 65–67 (2017).
Moncunill, V. et al. Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat. Biotechnol. 32, 1106–1112 (2014).
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Cameron, D. L., Di Stefano, L. & Papenfuss, A. T. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat. Commun. 10, 3240 (2019).
Gong, T., Hayes, V. M. & Chan, E. K. F. Detection of somatic structural variants from short-read next-generation sequencing data. Brief Bioinform. https://doi.org/10.1093/bib/bbaa056 (2020).
Zhang, J. et al. INTEGRATE: gene fusion discovery using whole genome and transcriptome data. Genome Res. 26, 108–118 (2016).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Wright, M. N. & Ziegler, A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. https://doi.org/10.18637/jss.v077.i01 (2017).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).
Lee, J. J. et al. Tracing oncogene rearrangements in the mutational history of lung adenocarcinoma. Cell 177, 1842–1857 e1821 (2019).
Xia, L. C. et al. SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution. Gigascience https://doi.org/10.1093/gigascience/giy081 (2018).
Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 20, 117 (2019).
Derrien, T. et al. Fast computation and applications of genome mappability. PLoS ONE 7, e30377 (2012).
Avet-Loiseau, H. et al. High incidence of translocations t(11;14)(q13;q32) and t(4;14)(p16;q32) in patients with plasma cell malignancies. Cancer Res. 58, 5640–5645 (1998).
Avet-Loiseau, H. et al. Rearrangements of the c-myc oncogene are present in 15% of primary human multiple myeloma tumors. Blood 98, 3082–3086 (2001).
Chakravarty, D. et al. OncoKB: a precision oncology knowledge base. JCO Precis. Oncol. https://doi.org/10.1200/PO.17.00011 (2017).
Mertens, F., Johansson, B., Fioretos, T. & Mitelman, F. The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer 15, 371–381 (2015).
Chesi, M. et al. IAP antagonists induce anti-tumor immunity in multiple myeloma. Nat. Med. 22, 1411–1420 (2016).
Raponi, S. et al. Biallelic BIRC3 inactivation in chronic lymphocytic leukaemia patients with 11q deletion identifies a subgroup with very aggressive disease. Br. J. Haematol. 185, 156–159 (2019).
Blakemore, S. J. et al. Clinical significance of TP53, BIRC3, ATM and MAPK-ERK genes in chronic lymphocytic leukaemia: data from the randomised UK LRF CLL4 trial. Leukemia 34, 1760–1774 (2020).
Frazzi, R. BIRC3 and BIRC5: multi-faceted inhibitors in cancer. Cell Biosci. 11, 8 (2021).
Uhrig, S. et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 31, 448–460 (2021).
Abo, R. P. et al. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res. 43, e19 (2015).
Shin, H. T. et al. Junction Location Identifier (JuLI): accurate detection of DNA fusions in clinical sequencing for precision oncology. J. Mol. Diagn. 22, 304–318 (2020).
Kokot, M., Dlugosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Zito Marino, F. et al. A new look at the ALK gene in cancer: copy number gain and amplification. Expert Rev. Anticancer Ther. 16, 493–502 (2016).
Pasini, L. et al. TrkA is amplified in malignant melanoma patients and induces an anti-proliferative response in cell lines. BMC Cancer 15, 777 (2015).
Huang, M. E. et al. Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood 72, 567–572 (1988).
Slovak, M. & Campbell, L. International System of Human Cytogenetic Nomenclature (ISCN) (Karger, 2009).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417–419 (2017).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Acknowledgements
We thank all BIGLab members for critical reading of the manuscript and for comments, S. Kim and S. Yoon of Seoul National University and W.-Y. Park of the Samsung Medical Center for providing resources. The results shown in this study are in part based on data generated by the TCGA Research Network (https://www.cancer.gov/tcga). This work was supported by the National Research Foundation (NRF) funded by the Ministry of Science & ICT (2014M3C9A3063541, 2020R1A4A1018398, 2021R1A2C3005835, 2022M3A9I2082294 and 2022M3E5F1018502 to J.-W.N.) and by the Korean Health Technology R&D Project, Ministry of Health and Welfare, Republic of Korea (HI15C3224 to J.-W.N.).
Author information
Authors and Affiliations
Contributions
J.S., M.-H.C., D.Y. and V.A.M. performed analyses. J.S., M.-H.C., D.Y. and V.A.M. contributed to writing the codes. J.S., M.-H.C. and B.N. contributed to parallel computing. J.L., J.W.P. and M.S.Y. contributed to the data processing of benchmarking datasets. S.K., S.-H.S., Y.K., S.-S.Y. and Y.S.J. provided validation datasets. Y.J.K. and J.-G.J. performed experimental validations. J.S., D.Y. and J.-W.N. contributed to the writing of the manuscript. D.B., T.-M.K. and J.-W.N. supervised the project. J.-W.N. conceived the idea.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biomedical Engineering thanks Ryan Layer and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Comparison of running time.
a, Wall-clock times of ETCHING and other tools from FASTQ input files to SV predictions on benchmarking BRCA samples. These include the mapping procedures of the FASTQ files from tumour and normal samples. We measured the running times using 30 threads on DELL PowerEdge R830 servers. b and c, The comparisons of the running times of ETCHING (from FASTQ) and other tools (from pre-mapped BAM) to SV predictions in CPU time on a single thread (b) and wall-clock time on 30 threads (c).
Extended Data Fig. 2 Benchmarking on BRCA and kLUAD samples.
Benchmarking results in auPR for ETCHING versus other tools on six BRCA (a) and nine kLUAD (b) samples by SV type.
Extended Data Fig. 3 Benchmarking on simulation data.
Benchmarking results for ETCHING versus other tools on simulation data sets.
Extended Data Fig. 4 Systemically benchmarking of SV prediction by ETCHING versus other tools over different genomic contexts and SV sizes with HCC1395 data.
a, The number of SV calls by each tool for defined SV size categories: 100 bp ≤ L < 1Kb (100bp–1Kb), 1Kb ≤ L < 1 Mb (1Kb–1 Mb), and 1 Mb ≤ L ( ≥ 1 Mb), as well as inter-chromosomal rearrangements (Inter-Chr., that is TRAs), where L is SV size. b, The SV ratios associated with repetitive elements, different MP scores, and different GC ratios. c, Recall and precision of the SV callers for SVs that overlap repeats. d, Recall and precision of the SV callers for SVs in regions over different genomic MP scores. e, Recall and precision of the SV callers for SVs located in regions over different GC ratios.
Extended Data Fig. 5 Validation of SVs using PacBio long-reads on HCC1395.
a, The number of SVs of each SV type. b, The area under PR curves (auPR) of ETCHING and other tools on the gold-standard SV sets. c, The validation rates of ETCHING and other tools.
Extended Data Fig. 6 Benchmarking on 32 LUAD samples.
Benchmarking results for ETCHING versus other tools on 32 LUAD samples by SV type for five different performance metrics.
Extended Data Fig. 7 Benchmarking on 20 PRAD samples.
Benchmarking results for ETCHING versus other tools on 20 PRAD samples by SV type for five different performance metrics. Note that each boxplot has 8 dots because 13 samples of low SV numbers were treated as a sample.
Extended Data Fig. 8 SVs detected by each tool.
Summary of detected SV biomarkers (a) and actionable targets (b) by DELLY, LUMPY, Manta, SvABA, novoBreak, and GRIDSS.
Extended Data Fig. 9 SV and FG prediction from TPS data, paired with WT alleles (regarded as matched-normal).
a, The TP calls (labelled as ‘Found’ in orange) and false negatives (labelled as ‘Missed’ in grey) of SV callers for cfDNA reference materials – Complete Reference (CR), Complete Mutation Mix (CMM), and Mutation Mix v2 (MMv2) – with different mutant allele ratios (0.5 to 5.0%; grey to black). CR and CMM include NCOA4-RET, EML4-ALK, and CD74-ROS1 FGs, and MMv2 includes NCOA4-RET and TPR-ALK FGs. The total TP for each tool is indicated in the lower right corner. b, Benchmarking SV callers on the reference materials including ETCHING with PGK.
Extended Data Fig. 10 PML-RARA detection.
a, Wall-clock times for detecting PML-RARA fusions on WGS data of seven APML samples. b, PML-RARA fusions detected by each tool.
Supplementary information
Main Supplementary Information
Supplementary methods and figures.
Supplementary tables
Supplementary tables.
Source data
Source Data for Fig. 2
Source data.
Source Data for Fig. 3
Source data and unprocessed gels.
Source Data for Fig. 4
Source data.
Source Data for Fig. 5
Source data.
Source Data for ED Fig. 4
Source data.
Source Data for ED Fig. 6
Source data.
Source Data for ED Fig. 7
Source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sohn, Ji., Choi, MH., Yi, D. et al. Ultrafast prediction of somatic structural variations by filtering out reads matched to pan-genome k-mer sets. Nat. Biomed. Eng 7, 853–866 (2023). https://doi.org/10.1038/s41551-022-00980-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41551-022-00980-5
- Springer Nature Limited
This article is cited by
-
MSV: a modular structural variant caller that reveals nested and complex rearrangements by unifying breakends inferred directly from reads
Genome Biology (2023)
-
Faster detection of somatic structural variants
Nature Biomedical Engineering (2023)