Abstract
Cancer is generally characterized by acquired genomic aberrations in a broad spectrum of types and sizes, ranging from single nucleotide variants to structural variants (SVs). At least 30% of cancers have a known pathogenic SV used in diagnosis or treatment stratification. However, research into the role of SVs in cancer has been limited due to difficulties in detection. Biological and computational challenges confound SV detection in cancer samples, including intratumor heterogeneity, polyploidy, and distinguishing tumor-specific SVs from germline and somatic variants present in healthy cells. Classification of tumor-specific SVs is challenging due to inconsistencies in detected breakpoints, derived variant types and biological complexity of some rearrangements. Full-spectrum SV detection with high recall and precision requires integration of multiple algorithms and sequencing technologies to rescue variants that are difficult to resolve through individual methods. Here, we explore current strategies for integrating SV callsets and to enable the use of tumor-specific SVs in precision oncology.
Similar content being viewed by others
The importance of structural variant detection in cancer
Genomic aberrations acquired in cancer genomes encompass a broad spectrum of types and sizes. These range from single nucleotide variants (SNVs) to larger structural variants (SVs) that impact genome organization (Fig. 1, Table 1)1,2. SVs are a major contributor to genomic variation, they affect more base pairs in the genome than SNVs3 and can have serious phenotypic impact4,5. Some SVs are known to drive carcinogenesis and SVs resulting in gene fusions were the first recurrent mutations observed in many pediatric cancers6,7. With at least 30% of cancer genomes affected by a pathogenic SV, detection of SVs is essential for both diagnosis and treatment stratification6,7,8,9,10,11. In addition, discovering new oncogenic SV driver events is beneficial for understanding cancer etiology. However, research into the role of SVs in cancer has been limited due to difficulties in their detection which has partially resulted from co-opting sequencing technologies designed for SNV detection.
Advances in sequencing technologies have increased the number of SVs identified per genome from ~2, 1–2, 5k in the 1000 genomes project to more than 27k in recent multi-platform sequencing efforts3,4,12. Specifically for the cancer genomics community, recent contributions of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium have provided an extensive resource of paired tumor-normal genomes13. The insights obtained from multi-platform analyses also highlight current SV blindspots in cancer variant databases like COSMIC. Despite technological innovations, confident SV detection in cancer genomes remains challenging due to biological factors including contamination from healthy tissue, intratumor heterogeneity and polyploidy. Identification of variants acquired in tumor cells requires discerning tumor-specific somatic SVs (TSSVs) from variants in the germline and mosaic variants present in unaffected cells14. This is often done by differential analysis between paired tumor-normal samples15. The classification of SVs as tumor-specific or normal is confounded by inconsistencies in detected breakpoints and derived variant types, as well as the biological complexity of some rearrangements.
Confident SV detection and subsequent classification of variants as either germline, tumor-specific or mosaic variation in healthy tissue is not only important for diagnostics and cancer etiology but also for research into cancer predisposition and genetic interactions. In addition, the genetic context of somatic variants and interplay with germline variants may influence their tumorigenic potential16. Here, we focus on the detection of TSSVs from paired tumor-normal WGS data. First, we explore current approaches for SV detection and their integration, whilst accounting for challenges specific to cancer samples. Second, we address different approaches aimed at distinguishing TSSVs from normal SVs. Third, we highlight the impact that long-read sequencing can have on somatic SV detection. Last, we explore how orthogonal sequencing technologies can be combined to improve TSSV detection.
Detection of somatic SVs in short-read WGS data
SVs can be detected using short-read sequencing data based on patterns in aligned reads (Fig. 1). These reads are sequenced as paired ends of 150–250 bp long. Changes in read-depth (RD) are used to derive copy-number variants (CNVs). Discordant read-pairs (DP) that align with an abnormal distance and/or orientation to the reference genome are suited for detecting large SVs. Split or soft-clipped reads (SR) are partially mapped reads and can indicate breakpoints with base-pair resolution17. Both the alignment method and reference genome used, influence the performance of SV detection algorithms17,18. BWA-MEM is predominantly used for alignment prior to SV detection, as it provides secondary alignments to reads mapping to multiple locations rather than placing the reads randomly19,20. However, alignment uncertainty is inherent to short-read sequencing data. In parallel, the reference genome continues to evolve, resulting in improved alignments and fewer false-positive variants in studies which adopted GRCh38 (hg38) compared to GRCh37 (hg19)8,21,22,23.
Combinatorial algorithms integrate multiple read-alignment patterns
The latest generation of SV detection algorithms that combine multiple read-alignment patterns can detect SVs across a broad range of types and sizes. At present, many different strategies and methods exist (Table 2). How these combinatorial algorithms integrate read-alignment patterns influences their ability to detect specific variant classes (Fig. 2A)24,25. As a result, no single algorithm performs best across the full spectrum of SVs, implying that integration of multiple algorithms is beneficial25. Although most studies comparing SV algorithms focus on germline SVs, these findings were recently also confirmed for somatic SV detection26. The methodology used by DELLY, LUMPY, Manta, SvABA, and GRIDSS for detecting SVs (Box 1) achieves high performance in detecting both germline and somatic SVs25,26.
SV-level integration of multiple algorithms improves precision
Since the optimal detection algorithm differs between SV type and size range, full-spectrum SV detection with high recall and precision currently requires multiple algorithms25,27. The optimal method to combine the resulting callsets remains a largely unanswered question and a variety of tools and in-house pipelines are currently used4,13,25,28. To compare and combine SV callsets, variants from the same genomic rearrangement need to be merged first, this is complicated by diversity in breakpoint resolution and SV typing (Fig. 2B). The recent review by Ho et al. addresses different “ensemble” integration approaches currently in use in germline SV research4. In general, simple integration strategies use (reciprocal) overlap or breakpoint distance to merge SVs whilst more complex solutions combine this with read-evidence integration, local assembly or machine learning29,30,31,32.
After overlapping variants are merged, integration of SV callsets from multiple algorithms can either be performed by taking the union or intersection (Fig. 2B). Since achieving high precision takes priority in most cancer research and clinical applications, an intersection strategy is often preferred but reduces recall. The precision/recall trade-off can be optimized by carefully selecting which tools to intersect25 and by taking the union of pairwise intersections26.
Distinguishing somatic from germline SVs
TSSV detection aims to identify variants that uniquely occur in a patient’s tumor cells. Typically paired tumor-normal samples are used to classify SVs as either germline, mosaic-normal or tumor-specific variants15. Detection of TSSVs is a two-step process that involves the detection of SVs in both samples, followed by differential analysis of the callsets (Fig. 2C). Also, cancer genomes can have highly complex rearrangements. Alternatively, if patient-derived healthy material is not available, SVs can be filtered using a panel-of-normals. A sufficiently large panel-of-normals can provide more statistical power for filtering recurrent germline variants, but is less effective than a patient-derived normal sample when filtering rare or private germline variants4. Also, strictly filtering out regions with germline CNVs excludes potentially interesting genomic regions from SV analysis, which are susceptible to rearrangements because of their architecture33.
Tools for somatic SV detection in WGS data
Somatic SV detection algorithms differ in their approach to identify TSSVs from paired tumor-normal samples and as a result can classify the same event differently26. Despite their differences, DELLY, LUMPY, SvABA, Manta, and GRIDSS have successfully been used to report somatic SVs in various studies34,35,36,37. DELLY and LUMPY use ad hoc filtering whereby SVs supported by at least one read from the normal sample are removed from the tumor SV callset34,35, which is highly sensitive contamination. In contrast, Manta uses a probabilistic scoring system for somatic SVs integrating evidence from tumor and normal reads36. SvABA uses both the tumor and normal data during assembly before distinguishing somatic variants38. GRIDSS has yet another approach and applies extensive rule-based filtering to both single break-ends and breakpoints37,39.
Specialized somatic SV detection tools such as Lancet and Varlociraptor account for challenges specific to the identification of TSSVs (Box 2)31,40. The first challenge in comparing tumor and normal SV callsets are differences in SV breakpoints and types, analogous to the issues with overlapping SV callsets of different algorithms25. Second, somatic SVs are often complex which can be problematic for algorithms that are not equipped to resolve these complex SV signatures and instead infer (false-positive) small indels41. As an alternative to ad-hoc filtering of SV callsets, Varlociraptor and Lancet, respectively, compare breakpoints and aberrant reads between tumor-normal samples at an earlier stage of the analysis (Fig. 2C). Specifically, Varlociraptor compares the statistical support for an altered reference with simulated variant versus an unadjusted reference (Box 2)31. Using read-level or breakpoint-level comparison can account for the subsequent mutations at germline variant locations, as these mutations may convolute somatic-germline comparisons. Third, issues inherent to analyzing tumor samples such as contamination, polyploidy, and heterogeneity are accounted for by Varlociraptor and Lancet (Box 2).
Challenges for accurate SV detection in cancer genomes
The analysis of tumor-normal paired samples is confounded by challenges inherent to cancer samples, including polyploidy, heterogeneity and contamination17. First, potential aneuploidy of tumor cells complicates haplotype reconstruction and phasing reads12,42. Second, intratumor heterogeneity can result in multiple subclonal variants which have low allele frequency (AF) and few supporting reads, making them difficult to detect. Third, contamination of the tumor sample with healthy material and vice versa complicates differential analysis between paired samples due to mislabelled reads. This can result in algorithms falsely discarding somatic variants with one or more supporting reads from the control sample. Adjusting the filtering threshold based on an estimated contamination fraction is a balance between precision and sensitivity for detecting low-AF variants.
The detection of rare TSSVs is limited by sequencing depth and AF. In practice, a minimum of 20% AF is required for reliable variant detection from tumor-normal pairs26,31. Increasing sequencing depth to 75x-90x for tumor samples improves the sensitivity of detection, especially for variants below 20% AF, whilst maintaining precision26. In addition, interpretation of TSSV allele frequencies is not straightforward since they can reflect intratumor heterogeneity and/or multiple alleles within a polyploid tumor genome. Note that the SV type should be considered during AF interpretation43. For diploid normal cells, variants are expected to have an AF of 0%, 50%, 100%, or 33% in case of a heterozygous duplication. However, mosaic-normal variants can occur at varying AF and be difficult to distinguish from TSSVs14. Computational modeling with AF can provide insight into intratumor heterogeneity and clonal architecture, both of which are important for therapeutic resistance and relapse44. The majority of SV tools operate under a diploid genome assumption. A multitude of tools independently quantify purity and ploidy of tumor samples however benchmarking studies show little consensus39,45. These tools can rely solely on CNV deletion events to model the cell purity and ploidy, and/or incorporate heterozygous known SNPs into their probabilistic models. At present, only SVclone uses SVs to estimate intra-tumor heterogeneity due to the complexities of calculating variant AF for SVs43.
Computational challenges of complex variant detection
Genomic instability in cancer genomes results in more breakpoints and more complex SVs compared to germline variation46. Complex SVs are characterized by signatures of many breakpoints clustering together and are hypothesized to be caused by a single catastrophic process followed by repair or progressive rearrangements47. The presence of breakpoint clusters complicates the inference of the underlying genomic rearrangements and therefore also the identification of tumor-specific events. Alternatively, when breakpoint clusters confound confident SV calling, breakpoint-level differential analysis can be used to identify tumor-specific events. In addition, unsupervised clustering can discern complex from simple SVs and help to study both events more accurately41.
Technical limitations of short-read WGS influence SV detection
The detection of SVs is also influenced by technical limitations of the sequencing platform; most notably genome coverage bias and alignment uncertainty. Illumina (IL) is currently the most commonly used short-read sequencing platform since it’s relatively affordable, fast and has a high nucleotide accuracy (>99%)48. However, IL sequencing has inherent biases in genome coverage with regions that have a high, or low GC content (<10% and >85% GC) or long homopolymers49. Although PCR-free library preparation does reduce GC biases it does require a large amount of input DNA (Table 3)49.
The detection of SVs relies on identifying aberrant read alignment patterns (Fig. 1). Reads derived from highly homologous regions, such as pseudogenes and segmental duplications, are often not long enough to uniquely map to the reference genome50. Yet repeat-rich regions comprise about half of the human genome and are vulnerable to SVs due to homologous recombination errors and replication slippage33,51. Depending on the alignment algorithm, uncertainty usually results in either random placement of reads or multi-mapping to all possible locations52. Multi-mapping, for example as done by BWA-MEM, causes unequal genome coverage altering the signal-to-noise ratio52. Hence, alignment uncertainty is problematic for accurate SV detection and should be addressed with a sound statistical model30,31,52. Current estimates suggest ~55 Mb of GRCh38 are “dark regions” inaccessible to IL sequencing due to alignment ambiguity (i.e., repeat-rich regions) or the sequencing chemistry (i.e., GC content)53. The over 4000 affected gene bodies53 also include disease-related genes, such as the TERT promoter which was found to be mutated in 9% of tumors in the PCAWG study but mutations can be missed due to its high GC content13.
Impact of long-read sequencing
Single-molecule long-read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are valuable for SV detection54. PacBio and ONT generate reads of ~10+ kb versus ~250 bp from IL; the longer reads reduce alignment ambiguity and do not have a GC bias, resulting in improved coverage of “dark” regions in the genome55. In addition, long reads allow for haplotype phasing of variants and de novo assembly of complex rearrangements56. For example, sequencing lung cancer cell lines with PromethION detected both known cancer-driver SNVs and revealed large previously unknown genomic rearrangements, including an 8 Mb amplification of MYC57. Similarly, direct comparison of a PacBio assembly with IL sequencing shows ~2.5× more uniquely identified SVs (~48k and ~20k, respectively), in particular more inversions and 50 bp–2 kb insertions/deletions located in repeat-rich regions12.
Limitations of long-read sequencing
The disadvantages of PacBio and ONT platforms include costs and sample requirements, which are substantial compared to IL sequencing and can be problematic for tumor samples (Table 3)55. In addition, they have a lower nucleotide accuracy of ~85% for single molecule sequencing and up to 99% using consensus sequencing of the same DNA molecule58,59,60,61. Continuous improvements in algorithms for base calling and error correction have increased the accuracy of these platforms58,59. Since low nucleotide accuracy can impede read-alignment, error correction potentially improves SV detection by increasing the fraction of aligned reads62. However, error-correction strategies come with trade-offs for SV detection. Long reads can be aligned to each other as a self-correction strategy when sufficient coverage (~50×) is available55. However, haplotyping information is lost as a result of using the consensus of reads with mixed molecular origin. This makes the consensus sequence unsuitable for variant phasing or for studying intra-tumor heterogeneity or polyploidy. Alternatively, short reads can be used for error correction by aligning them to the long reads, but this approach only improves accuracy of genomic regions accessible to IL sequencing55,61.
Long-read data requires specialized algorithms
Long-read SV detection algorithms are either based on de novo assembly or read-alignment to a reference genome. Assembly-based strategies have a higher sensitivity for detecting non-template insertions and homozygous SVs. During assembly, contigs are compared to the reference genome and can provide more evidence than individual reads32,55. However, variant calling using alignment requires less coverage than assembly (~20× versus ~50×) and statistical significance when identifying SVs is achieved relatively easily due to the low alignment uncertainty of long reads32,50,55. Compared to assembly methods, alignment-based approaches are more suited to identify heterozygous SVs and more robust to amplifications in highly homologous regions such as low-complexity regions12,55. Within clinical applications, often insufficient resources are available to perform long-read sequencing of tumor-normal pairs to depths required for de novo assembly (Table 3). Therefore, we focus on using alignment-based strategies (Table 2).
Alignment of long reads differs from short reads due to the increase in base pairs to align and different errors profiles55. Although BWA-MEM offers support for long reads, it often infers many small gaps during alignment and misses large indels63,64. Specialized long-read alignment algorithms have been developed to overcome these issues. In contrast to short-read data, there is no best practise for which aligner should be used when performing SV detection63,64,65,66. Preliminary comparisons suggest that NGMLR and minimap2 perform well and both algorithms are designed to handle the higher error rates and adjust for the 1 bp indels in long-reads12.
Alignment-based SV detection algorithms for long-read data
Currently, many tools are actively developed to detect SVs from alignment of ONT and PacBio data (Table 2). However, studies comparing long-read SV detection tools have been scarce and predominantly show the limitations of available truth sets by identifying many novel variants12,67. At present only nanomonsv reports somatic SVs from long-read data68. The commonly used tools SVIM and Sniffles have shown good precision and sensitivity in multiple performance assessments63,67,69. They were among the first to process both ONT and PacBio data despite their different error profiles and have been followed by additional tools like NanoVar and CuteSV (Table 2). Similar to short-read SV detection tools, long-read tools combine multiple read-alignment patterns to detect SVs. They infer patterns similar to split reads and discordant pairs using intra-alignment and inter-alignment signatures, despite long reads not being paired-end. Similar to short-read tools, using a consensus callset created by intersecting multiple long-read SV detection algorithms can increase precision32,67. Alternatively, machine learning approaches can attain greater improvements in precision and sensitivity than ad hoc intersection, given a truth set is available for training32.
Multi-platform data integration to improve detection of somatic SVs in cancer
Limitations in both short-read and long-read WGS can potentially be overcome by using a multi-platform approach and as such improve the identification of TSSVs. Integration can improve both precision and sensitivity by combining read-alignment patterns (Fig. 2A) and integrating SV callsets from multiple algorithms or technologies (Fig. 2B).
Gene fusion detection by combined analysis of RNA and WGS
Integration of genomic and transcriptomic data can further improve variant detection and provide insight into the phenotypic effect of SVs; specifically resolving gene fusions, splice variants and linking SVs to altered gene expression70. RNA sequencing of tumor samples offers unique advantages such as tissue specificity and time specificity, but obtaining high-quality RNA can be problematic. In addition, sufficient expression is necessary to detect events, which may impede detection of low AF variants.
RNA-seq is especially suitable for detecting gene fusion events through their chimeric transcripts. Gene fusions have high clinical relevance since they are often cancer drivers and otherwise occur rarely in the general population6,70. Specialized gene fusion algorithms predict gene fusions from chimeric transcripts by using read-alignment patterns such as SR crossing exonic junctions and DP mapping to both gene partners71. However, these algorithms can suffer from a high false positive rate which requires extensive filtering72. Chimeric transcripts can occur without genomic rearrangement, for example through intergenic splicing (trans-splicing and cis-splicing) or transcriptional slippage on short homologous sequences73. Since these chimeric transcripts are also present in healthy cells, this advocates for tissue matched RNA-seq of paired tumor-normal samples to allow the identification of tumor-specific events.
Combining RNA-seq with WGS data could resolve specificity issues and improve gene fusion detection. By itself, WGS can detect gene fusions, but not the occurrence of functional transcripts. Although sometimes used for validation purposes74, there are no established algorithms which integrate WGS and RNA-seq such that they both contribute to detection. The advantages of combining WGS, RNA-seq and exome sequencing has been demonstrated for detecting SVs in heterogeneous pediatric cancers75. Similarly, joint analysis of RNA-seq and short-read WGS in the PCAWG study identified the underlying SV for 82% of gene fusions. The remaining fusions were either the result of RNA-only alterations such as transcriptional read-through or underdetection of SVs5.
Integration of short-read and long-read WGS
Short-read and long-read data can complement each platform’s strengths and overcome individual limitations12. Combining SV callsets after detection can increase sensitivity and requiring orthogonal support for variants across platforms can increase their confidence. However, the union or intersection of callsets is still affected by platform-specific technical biases. Read-level integration can overcome some of these issues as illustrated by error correction approaches which use IL reads to improve the accuracy of PacBio/ONT reads55. Likewise, hybrid assembly of short and long reads benefits from their respective high accuracy and scaffolding properties. Localized hybrid assembly tailored to SV detection as implemented by HySA shows that problematic SVs can be detected that have too little support in either PacBio or IL76. However, HySA cannot infer somatic SVs and some variants were missed due to few supporting aberrant IL reads and PacBio alignment issues. Hybrid assembly can also reduce coverage requirements for de novo assembly77.
As an alternative to long-read technologies, linked-read sequencing from 10× Genomics (10×) performs well for haplotype construction and variant phasing12. A read-barcode is added during library preparation to trace the molecule of origin at costs similar to IL sequencing78 (Table 3). In addition, 10× can report variants in repeat-rich regions not accessible by standard short-read IL sequencing79,80. Integration of short-read WGS and 10× enabled chromosome-scale haplotyping and phasing of detected variants of the polyploid cancer cell line HepG281,82. Variant phasing can help to gain biological insights, as shown for associated regulatory and coding mutations in treatment-resistant prostate cancer83 and identification of SVs as potential cancer drivers by altering cis-regulation of genes84.
Discovery of large, complex variants by chromatin assays
Combining sequencing data with technologies that provide insight into genomic organization can elucidatie large complex rearrangements. Technologies such as Bionano Genomics (BNG) and Hi–C have shown limitations of SV detection using sequencing. The combination of short-read WGS, BNG, and Hi–C on a cancer cell line showed most of the large (>1 Mb) intra-chromosomal and inter-chromosomal SV events were uniquely detected by a single technology with only ~20–35% validated by multiple platforms8. Each platform has its own scope of variant detection. Short-read WGS detected the largest number of variants across a broad range, whilst BNG and Hi–C lack base-pair resolution but can detect >1 kb deletions in repeat rich regions unlike short-read WGS8. BNG has promising diagnostic applications as it can confidently detect large variants with low input requirements (Table 3). Also, BNG had full concordance with standard diagnostic assays in pediatric ALL and identified additional variants85.
Incorporating pre-existing technologies in ongoing studies
Continuous technological improvements provide exciting new data and SV discoveries, but this does not make existing datasets obsolete. The phenotypic effect of CNVs is often better understood than for SVs and established technologies have had more opportunity to collect samples, including rare cancer types. Currently many samples are available in repositories that profile genomic imbalances either via SNV array or exome sequencing technologies13,86. Challenges in integrating these datasets result from differences between technologies, such as breakpoint resolution and platform-specific biases, and systematic solutions are rare87. The widely varying detection resolution of different technologies invalidates callset intersection strategies, as smaller events are below the detection limits for lower resolution arrays, and exome sequencing is limited to events involving multiple exons. The absence of an event in a callset should not be considered proof that the event does not exist. Gene-centric approaches based on unions seem the most applicable. Although integration of pre-existing datasets assayed with different technologies with recently acquired datasets provides a complex computational challenge and is often ignored, it is likely to be an ongoing issue as technologies and platforms continue to evolve.
Challenges in using sequencing for precision oncology
In clinical practice, next-generation sequencing (NGS) is increasingly used to replace targeted assays subject to budgetary and sample requirements. NGS can simultaneously detect different variant types and discover new biomarkers, and is more cost-effective than a series of single-gene assays. Although turn-around times are often longer, sensitivity and precision are maintained88 provided sufficient sequencing depth is achieved26,31. As a result, NGS makes pan-cancer biomarker testing feasible, leading to the approval of drugs based on molecular alterations shared by different cancer types like the use of TRK inhibitors for all solid tumors with a NTRK fusion88. However, the distribution of NGS data over multiple repositories and lack of data harmonization complicates clinical decision-making and prevents precision medicine from reaching its full potential.
Variant interpretation is a major challenge in precision oncology often done by expert panels such as interdisciplinary molecular tumor boards88. Despite its challenges, integration of multi-omics data is increasingly being used to improve variant interpretation and increase the number of identified drivers or actionable targets5,88,89. However, standards on variant interpretation and prioritization are still emerging90. As a result, there is low concordance between the recommendations of different molecular tumor boards when given identical case studies, especially for complex genomic alterations90.
Recent initiatives have attempted to resolve this need for standardization in variant assessment and clinical decision through the Molecular Tumor Board Portal91 and Somatic Working Group of the Clinical Genome92. Both harmonize different variant repositories, curated knowledge bases and computational predictions to acquire insights into variant-gene-drug-disease relationships with the focus on clinical use Although extremely valuable, these efforts focus only on SNVs and to a limited extent gene fusions. Similar initiatives for SVs and complex genomic alterations are currently lacking. Largely due to tumor-specific SVs not yet commonly being used as molecular targets or biomarkers to guide patient-specific treatment. We anticipate that improved confidence of TSSV detection will enable the subsequent research necessary for the use of the full spectrum of variants in precision oncology.
Conclusion
The field of SV detection is continuously improving through advancements in sequencing technologies and tools. These advancements will contribute to discoveries into the role of SVs in cancer, as well as the incorporation of SVs in precision oncology programs. Nevertheless, SV detection and interpretation in tumor samples is complicated by unique biological and technical challenges, i.e., contamination, intra-tumor heterogeneity and aneuploidy. These challenges are addressed by algorithms specialized in identifying TSSVs from tumor-normal paired sequencing data, which requires both SV detection and distinguishing tumor-specific variants.
Based on studies of normal genomic variation, a multi-platform approach is necessary to detect the full spectrum of variants and reduce false positives. Truth sets and procedures developed for SV detection from short-read data show that combining multiple tools improves precision and recall. Despite this, short-read sequencing has inherent limitations such as GC coverage bias and mapping ambiguities leading to inaccessible genomic regions. Long-read sequencing technologies can resolve large, complex SVs and improve coverage, but have lower per-nucleotide accuracy, higher costs and sample requirements. SV detection tools for long-read data have yet to mature with performance assessments and truth sets lacking.
Integration of long-read and short-read data is likely required for complete characterization of tumor genomes. However, adopting sequencing technologies in clinical laboratories requires a clear added value compared to the standardized assays, as well as being fast and affordable. Considering IL and 10× provide high accuracy WGS at low sample requirements, they are most feasible for tumor-normal sequencing in a clinical setting. Supplementary low-coverage sequencing with ONT can cover regions inaccessible to short-read WGS and aid in variant phasing. Alternatively, RNA sequencing has proven to be highly beneficial in a clinical setting for the detection of gene fusion events.
In conclusion, improving detection of TSSVs by integrating data derived from multiple platforms and detection tools enables the use of TSSVs in precision oncology and research into their role in cancer. With accurate TSSV datasets becoming more available, previously unchartered territories of variant types can be explored to potentially discover novel SV cancer driver events.
Data availability
No datasets were generated or analyzed during this study.
References
Vogelstein, B. & Kinzler, K. W. Cancer genes and the pathways they control. Nat. Med. 10, 789–799 (2004).
Aplan, P. D. Causes of oncogenic chromosomal translocation. Trends Genet. 22, 46–55 (2006).
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75 (2015).
Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. 21, 1–19 (2019).
Calabrese, C. et al. Genomic basis for RNA alterations in cancer. Nature 578, 129–136 (2020).
Mitelman, F., Johansson, B. & Mertens, F. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7, 233–245 (2007).
Wang, Y., Wu, N., Liu, D. & Jin, Y. Recurrent fusion genes in leukemia: an attractive target for diagnosis and treatment. Curr. Genomics 18, 378–384 (2017).
Dixon, J. R. et al. Integrative detection and analysis of structural variation in cancer genomes. Nat. Genet. 50, 1388–1398 (2018).
Dupain, C. et al. Discovery of new fusion transcripts in a cohort of pediatric solid cancers at relapse and relevance for personalized medicine. Mol. Ther. 27, 200–218 (2019).
Cairncross, J. G. et al. Specific genetic predictors of chemotherapeutic response and survival in patients with anaplastic oligodendrogliomas. J. Natl Cancer Inst. 90, 1473–1479 (1998).
Cohen, M. H. et al. Approval summary for imatinib mesylate capsules in the treatment of chronic myelogenous leukemia. Clin. Cancer Res. 8, 935–942 (2002).
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Pleasance, E. D. et al. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Van Horebeek, L., Dubois, B. & Goris, A. Somatic variants: new kids on the block in human immunogenetics. Trends Genet. 35, 935–947 (2019).
Mandelker, D. & Ceyhan-Birsoy, O. Evolving significance of tumor-normal sequencing in cancer care. Trends Cancer Res. 6, 31–39 (2020).
Ramroop, J. R., Gerber, M. M. & Toland, A. E. Germline variants impact somatic events during tumorigenesis. Trends Genet. 35, 515–526 (2019).
Liu, B. et al. Structural variation discovery in the cancer genome using next generation sequencing: computational solutions and perspectives. Oncotarget 6, 5477–5489 (2015).
Ruffalo, M., LaFramboise, T. & Koyuturk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27, 2790–2796 (2011).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv [q-bio.GN] (2013).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Pan, B. et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinforma. 20, 17–29 (2019).
Eisfeldt, J., Mårtensson, G., Ameur., Nilsson, D. & Lindstrand, A. Discovery of Novel Sequences in 1,000 Swedish Genomes. Mol. Biol. Evol. 37, 18–30 (2019).
Guo, Y. et al. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics 109, 83–90 (2017).
Lin, K., Smit, S., Bonnema, G., Sanchez-Perez, G. & de Ridder, D. Making the difference: integrating structural variation detection tools. Brief. Bioinform. 16, 852–864 (2015).
Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 20, 117 (2019).
Gong, T., Hayes, V. M. & Chan, E. K. F. Detection of somatic structural variants from short-read next-generation sequencing data. Brief. Bioinform. bbaa056 (2020).
Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief. Bioinforma. 15, 256–278 (2014).
Zarate, S. et al. Parliament2: Accurate structural variant calling at scale. GigaScience. 9, giaa145 (2020).
Mohiyuddin, M. et al. MetaSV: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741 (2015).
Wittler, R., Marschall, T., Schönhuth, A. & Mäkinen, V. Repeat- and error-aware comparison of deletions. Bioinformatics 31, 2947–2954 (2015).
Köster, J., Dijkstra, L. J., Marschall, T. & Schönhuth, A. Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome Biol. 21, 1–25 (2020).
Zhou, A., Lin, T. & Xing, J. Evaluating nanopore sequencing data processing pipelines for structural variation identification. Genome Biol. 20, 1–13 (2019).
Carvalho, C. M. B. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224–238 (2016).
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Cameron, D. L. et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. (2017).
Wala, J. A. et al. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).
Cameron, D. L. et al. GRIDSS, PURPLE, LINX: unscrambling the tumor genome via integrated analysis of structural variation and copy number. Preprint at bioRxiv https://doi.org/10.1101/781013. (2019).
Narzisi, G. et al. Genome-wide somatic variant calling using localized colored de Bruijn graphs. Commun. Biol. 1, 20 (2018).
Li, Y. et al. Patterns of structural variation in human cancer. Nature 578, 112–121 (2020).
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Cmero, M. et al. Inferring structural variant cancer cell fraction. Nat. Commun. 11, 1–15 (2020).
Griffith, M. et al. Optimizing cancer genome sequencing and analysis. Cell Syst. 1, 210 (2015).
Luo, Z., Fan, X., Su, Y. & Huang, Y. S. Accurity: accurate tumor purity and ploidy inference from tumor-normal WGS data by jointly modelling somatic copy number alterations and heterozygous germline single-nucleotide-variants. Bioinformatics 34, 2004–2011 (2018).
Yi, K. & Ju, Y. S. Patterns and mechanisms of structural variations in human cancer. Exp. Mol. Med. 50, 98 (2018).
Kinsella, M., Patel, A. & Bafna, V. The elusive evidence for chromothripsis. Nucleic Acids Res. 42, 8231–8242 (2014).
Goodwin, S., McPherson, J. D. & Richard McCombie, W. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333 (2016).
Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).
Li, W. & Freudenberg, J. Mappability and read length. Front. Genet. 5, 381 (2014).
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Oloomi, S. M. H. The Impact of Multi-mappings in Short Read Mapping. Doctoral dissertation (2018).
Ebbert, M. T. W. et al. Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol. 20, 97 (2019).
De Coster, W. & Van Broeckhoven, C. Newest methods for detecting structural variations. Trends Biotechnol. 37, 973–982 (2019).
Sedlazeck, F. J., Lee, H., Darby, C. A. & Schatz, M. C. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat. Rev. Genet. 19, 329–346 (2018).
Gong, L. et al. Picky comprehensively detects high-resolution structural variants in nanopore long reads. Nat. Methods 15, 455–460 (2018).
Sakamoto, Y. et al. Long-read sequencing for non-small-cell lung cancer genomes. Genome Res. 30, 1243–1257 (2020).
Rang, F. J., Kloosterman, W. P. & de Ridder, J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 19, 90 (2018).
Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38, e159 (2010).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Fu, S., Wang, A. & Au, K. F. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biol. 20, 1–17 (2019).
Sakamoto, Y., Sereewattanawoot, S. & Suzuki, A. A new era of long-read sequencing for cancer genomics. J. Hum. Genet. 65, 3–10 (2019).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinforma. 13, 238 (2012).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
De Coster, W. et al. Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome. Genome Res. 29, 1178–1187 (2019).
Shiraishi, Y. et al. Precise characterization of somatic structural variations and mobile element insertions from paired long-read sequencing data with nanomonsv. Preprint at bioRxiv https://doi.org/10.1101/2020.07.22.214262. (2020).
Heller, D. & Vingron, M. SVIM: structural variant identification using mapped long reads. Bioinformatics 35, 2907–2915 (2019).
Reisle, C. et al. MAVIS: merging, annotation, validation, and illustration of structural variants. Bioinformatics 35, 515–517 (2019).
Haas, B. J. et al. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 20, 1–16 (2019).
Peng, Z. et al. Hypothesis: artifacts, including spurious chimeric RNAs with a short homologous sequence, caused by consecutive reverse transcriptions and endogenous random primers. J. Cancer 6, 555–567 (2015).
Chwalenia, K., Facemire, L. & Li, H. Chimeric RNAs in cancer and normal physiology. Wiley Interdiscip. Rev. 8, e1427 (2017).
Gao, Q. et al. Driver fusions and their implications in the development and treatment of human cancers. Cell Rep. 23, 227–238.e3 (2018).
Rusch, M. et al. Clinical cancer genomic profiling by three-platform sequencing of whole genome, whole exome and transcriptome. Nat. Commun. 9, 1–13 (2018).
Fan, X., Chaisson, M., Nakhleh, L. & Chen, K. HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies. Genome Res. 27, 793–800 (2017).
Ma, Z. S., Li, L., Ye, C., Peng, M. & Zhang, Y.-P. Hybrid assembly of ultra-long Nanopore reads augmented with 10x-Genomics contigs: Demonstrated with a human genome. Genomics 111, 1896–1901 (2019).
Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).
Marks, P. et al. Resolving the full spectrum of human genome variation using Linked-Reads. Genome Res. 29, 635–645 (2019).
Mostovoy, Y. et al. A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods 13, 587–590 (2016).
Zhou, B. et al. Haplotype-resolved and integrated genome analysis of the cancer cell line HepG2. Nucleic Acids Res. 47, 3846 (2019).
Bell, J. M. et al. Chromosome-scale mega-haplotypes enable digital karyotyping of cancer aneuploidy. Nucleic Acids Res. 45, e162–e162 (2017).
Viswanathan, S. R. et al. Structural alterations driving castration-resistant prostate cancer revealed by linked-read genome sequencing. Cell 174, 433–447.e19 (2018).
Zhang, Y. et al. High-coverage whole-genome analysis of 1220 cancers reveals hundreds of genes deregulated by rearrangement-mediated cis -regulatory alterations. Nat. Commun. 11, 1–14 (2020).
Neveling, K. et al. Next generation cytogenetics: comprehensive assessment of 48 leukemia genomes by genome imaging. Preprint at bioRxiv https://doi.org/10.1101/2020.02.06.935742. (2020).
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).
Zhou, Z., Wang, W., Wang, L.-S. & Zhang, N. R. Integrative DNA copy number detection and genotyping from sequencing and array-based platforms. Bioinformatics 34, 2349–2355 (2018).
Malone, E. R., Oliva, M., Sabatini, P. J. B., Stockley, T. L. & Siu, L. L. Molecular profiling for precision cancer therapies. Genome Med. 12, 1–19 (2020).
Nattestad, M. et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 28, 1126–1135 (2018).
Rieke, D. T. et al. Comparison of treatment recommendations by molecular tumor boards worldwide. JCO Precis. Oncol. 2, 1–14 (2018).
Tamborero, D. et al. Support systems to guide clinical decision-making in precision oncology: The Cancer Core Europe Molecular Tumor Board Portal. Nat. Med. 26, 992–994 (2020).
Yu, Y. et al. PreMedKB: an integrated precision medicine knowledgebase for interpreting relationships between diseases, genes, variants and drugs. Nucleic Acids Res. 47, D1090–D1101 (2018).
Tham, C. Y. et al. NanoVar: accurate characterization of patients’ genomic structural variants using low-depth nanopore sequencing. Genome Biol. 21, 1–15 (2020).
Roberts, H. E. et al. Short and long-read genome sequencing methodologies for somatic variant detection; genomic analysis of a patient with diffuse large B-cell lymphoma. Preprint at bioRxiv https://doi.org/10.1101/2020.03.24.999870. (2020).
Spies, N. et al. Genome-wide reconstruction of complex structural variants using read clouds. Nat. Methods 14, 915–920 (2017).
Genomics, 10x. Whole Genome Phasing and SV Calling. 10x Genomics Support https://support.10xgenomics.com/genome-exome/software/pipelines/latest/using/wgs. (2020)
Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 1–24 (2020).
Stancu, M. C. et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat. Commun. 8, 1–13 (2017).
English, A. C., Salerno, W. J. & Reid, J. G. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinforma. 15, 1–7 (2014).
Pacific Biosciences. pbsv. https://github.com/PacificBiosciences/pbsv. (2020)
Boivin, V. et al. Reducing the structure bias of RNA-Seq reveals a large number of non-annotated non-coding RNA. Nucleic Acids Res. 48, 2271–2286 (2020).
Sati, S. & Cavalli, G. Chromosome conformation capture technologies and their impact in understanding genome function. Chromosoma 126, 33–44 (2016).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Tyson, J. R. et al. MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans reference genome. Genome Res. 28, 266–274 (2018).
Rhoads, A. & Au, K. F. PacBio sequencing and its applications. Genom. Proteom. Bioinforma. 13, 278–289 (2015).
Laver, T. et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol. Detection Quant. 3, 1 (2015).
Jain, M. et al. Improved data analysis for the MinION nanopore sequencer. Nat. Methods 12, 351–356 (2015).
Chen, P. et al. Modelling BioNano optical data and simulation study of genome map assembly. Bioinformatics 34, 3966 (2018).
Niu, L. et al. Amplification-free library preparation with SAFE Hi-C uses ligation products for deep sequencing to improve traditional Hi-C analysis. Commun Biol. 2, 1–8 (2019).
Díaz, N. et al. Chromatin conformation analysis of primary patient tissue using a low input Hi-C method. Nat. Commun. 9, 1–13 (2018).
Acknowledgements
This work was financially supported by KiKa.
Author information
Authors and Affiliations
Contributions
A.S. and P.K. substantially contributed to the conception and design of the article. I.A.E.M.B. and J.H.K. drafted the article. All authors discussed the concepts and contributed to the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
van Belzen, I.A.E.M., Schönhuth, A., Kemmeren, P. et al. Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology. npj Precis. Onc. 5, 15 (2021). https://doi.org/10.1038/s41698-021-00155-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41698-021-00155-6
- Springer Nature Limited
This article is cited by
-
Full-length isoform concatenation sequencing to resolve cancer transcriptome complexity
BMC Genomics (2024)
-
De novo and somatic structural variant discovery with SVision-pro
Nature Biotechnology (2024)
-
VolcanoSV enables accurate and robust structural variant calling in diploid genomes from single-molecule long read sequencing
Nature Communications (2024)
-
A collection of read depth profiles at structural variant breakpoints
Scientific Data (2023)
-
GASOLINE: detecting germline and somatic structural variants from long-reads data
Scientific Reports (2023)