Abstract
Genetic variation forms the basis for diversity but can as well be harmful and cause diseases, such as tumors. Structural variants (SV) are an example of complex genetic variations that comprise of many nucleotides ranging up to several megabases. Based on recent developments in sequencing technology it has become feasable to elucidate the genetic state of a person’s genes (i.e. the exome) or even the complete genome. Here, a machine learning approach is presented to find small disease-related SVs with the help of sequencing data. The method uses differences in characteristics of mapping patterns between tumor and normal samples at a genomic locus. This way, the method aims to be directly applicable for exome sequencing data to improve detection of SVs since specific SV detection methods are currently lacking. The method has been evaluated based on a simulation study as well as with exome data of patients with acute myeloid leukemia. An implementation of the algorithm is available at https://github.com/lenz99-/svmod.
Similar content being viewed by others
References
Alkan C et al (2011) Genome structural variation discovery and genotyping. Nat Rev Genet 12(5):363–376
Bischl B et al (2012) Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol Comput 20(2):249–275
Bischl B et al (2015) mlr: Machine Learning in R. R package version 2.3
Chiara M, Pesole G, Horner DS (2012) SVM2: an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data. Nucl Acids Res 40(18):1–11
Hanahan D, Weinberg RA (2011) Hallmarks of cancer: the next generation. Cell 144(5):646–674
Huang W et al (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28(4):593–594
Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arxiv:1303.3997
MacDonald JR et al (2014) The database of genomic variants: a curated collection of structural variation in the human genome. Nucl Acids Res 42(Database issue):D986–992. doi:10.1093/nar/gkt958
Mardis ER et al (2009) Recurring mutations found by sequencing an acute myeloid leukemia genome. N Engl J Med 361(11):1058–1066
Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46
R Core Team R (2015) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Raphael BJ (2012) Chapter 6: structural variation and medical genomics. PLoS Comput Biol 8(12):e100282. doi:10.1371/journal.pcbi.1002821
Rausch T et al (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28(18):i333–i339
Schölkopf B, Smola A (2002) Learning with Kernels. MIT Press, Cambridge
Spencer D et al (2013) Detection of FLT3 internal tandem duplication in targeted short-read-length, next-generation sequencing data. J Mol Diagn 15(1):81–93
Scott D et al (2009) Evidence of uneven selective pressure on different subsets of the conserved human genome; implications for the significance of intronic and intergenic DNA. BMC Genom 10(614):1
The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68–74
Thiede C et al (2002) Analysis of FLT3-activating mutations in 979 patients with acute myelogenous leukemia: association with FAB subtypes and identification of subgroups with poor prognosis. Blood 99(12):4326–4335
Vogelstein B, Kinzler KW (2004) Cancer genes and the pathways they control. Nat Med 10(8):789–799
Ye K et al (2009) Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25(21):2865–2871
Acknowledgments
We thank the anonymous reviewers for their suggestions that contributed to improving the manuscript. And we thank the MessAge group and the Bioinformatics Core Unit at IMB for providing extra computational resources when they were needed.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work has been supported by the German Research Foundation (DFG) Grant RO3500/4-1 within the Research Unit FOR 1961 and by the German Federal Ministry of Research and Education, Grant 031A424 “HaematoOPT”.
Rights and permissions
About this article
Cite this article
Kuhn, M., Stange, T., Herold, S. et al. Finding small somatic structural variants in exome sequencing data: a machine learning approach. Comput Stat 33, 1145–1158 (2018). https://doi.org/10.1007/s00180-016-0674-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-016-0674-2