Skip to main content

Advertisement

Log in

Finding small somatic structural variants in exome sequencing data: a machine learning approach

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Genetic variation forms the basis for diversity but can as well be harmful and cause diseases, such as tumors. Structural variants (SV) are an example of complex genetic variations that comprise of many nucleotides ranging up to several megabases. Based on recent developments in sequencing technology it has become feasable to elucidate the genetic state of a person’s genes (i.e. the exome) or even the complete genome. Here, a machine learning approach is presented to find small disease-related SVs with the help of sequencing data. The method uses differences in characteristics of mapping patterns between tumor and normal samples at a genomic locus. This way, the method aims to be directly applicable for exome sequencing data to improve detection of SVs since specific SV detection methods are currently lacking. The method has been evaluated based on a simulation study as well as with exome data of patients with acute myeloid leukemia. An implementation of the algorithm is available at https://github.com/lenz99-/svmod.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Alkan C et al (2011) Genome structural variation discovery and genotyping. Nat Rev Genet 12(5):363–376

    Article  Google Scholar 

  • Bischl B et al (2012) Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol Comput 20(2):249–275

    Article  Google Scholar 

  • Bischl B et al (2015) mlr: Machine Learning in R. R package version 2.3

  • Chiara M, Pesole G, Horner DS (2012) SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data. Nucl Acids Res 40(18):1–11

    Article  Google Scholar 

  • Hanahan D, Weinberg RA (2011) Hallmarks of cancer: the next generation. Cell 144(5):646–674

    Article  Google Scholar 

  • Huang W et al (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28(4):593–594

    Article  Google Scholar 

  • Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arxiv:1303.3997

  • MacDonald JR et al (2014) The database of genomic variants: a curated collection of structural variation in the human genome. Nucl Acids Res 42(Database issue):D986–992. doi:10.1093/nar/gkt958

    Article  Google Scholar 

  • Mardis ER et al (2009) Recurring mutations found by sequencing an acute myeloid leukemia genome. N Engl J Med 361(11):1058–1066

    Article  Google Scholar 

  • Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46

    Article  Google Scholar 

  • R Core Team R (2015) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

  • Raphael BJ (2012) Chapter 6: structural variation and medical genomics. PLoS Comput Biol 8(12):e100282. doi:10.1371/journal.pcbi.1002821

    Article  Google Scholar 

  • Rausch T et al (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28(18):i333–i339

    Article  Google Scholar 

  • Schölkopf B, Smola A (2002) Learning with Kernels. MIT Press, Cambridge

    MATH  Google Scholar 

  • Spencer D et al (2013) Detection of FLT3 internal tandem duplication in targeted short-read-length, next-generation sequencing data. J Mol Diagn 15(1):81–93

    Article  Google Scholar 

  • Scott D et al (2009) Evidence of uneven selective pressure on different subsets of the conserved human genome; implications for the significance of intronic and intergenic DNA. BMC Genom 10(614):1

    Google Scholar 

  • The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74

  • Thiede C et al (2002) Analysis of FLT3-activating mutations in 979 patients with acute myelogenous leukemia: association with FAB subtypes and identification of subgroups with poor prognosis. Blood 99(12):4326–4335

    Article  Google Scholar 

  • Vogelstein B, Kinzler KW (2004) Cancer genes and the pathways they control. Nat Med 10(8):789–799

    Article  Google Scholar 

  • Ye K et al (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25(21):2865–2871

    Article  Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for their suggestions that contributed to improving the manuscript. And we thank the MessAge group and the Bioinformatics Core Unit at IMB for providing extra computational resources when they were needed.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthias Kuhn.

Additional information

This work has been supported by the German Research Foundation (DFG) Grant RO3500/4-1 within the Research Unit FOR 1961 and by the German Federal Ministry of Research and Education, Grant 031A424 “HaematoOPT”.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kuhn, M., Stange, T., Herold, S. et al. Finding small somatic structural variants in exome sequencing data: a machine learning approach. Comput Stat 33, 1145–1158 (2018). https://doi.org/10.1007/s00180-016-0674-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-016-0674-2

Keywords

Navigation