Bioinformatics Analysis of Whole Exome Sequencing Data

Part of the Methods in Molecular Biology book series (MIMB, volume 1881)


This chapter contains a step-by-step protocol for identifying somatic SNPs and small Indels from next-generation sequencing data of tumor samples and matching normal samples. The workflow presented here is largely based on the Broad Institute’s “Best Practices” guidelines and makes use of their Genome Analysis Toolkit (GATK) platform. Variants are annotated with population allele frequencies and curated resources such as GnomAD and ClinVar and curated effect predictions from dbNSFP using VCFtools, SnpEff, and SnpSift.

Key words

Next-generation sequencing Cancer research Exome sequencing Genome sequencing Clinical genomics Somatic variant detection Variant annotation 



The authors would like to thank the institutions, developers, and documenters of the informatics tools used in this chapter’s workflows. Genomics and disease research in general benefits hourly from the availability of tools such as Bioconda, BWA, GATK, HaplotypeCaller, Mutect2, Samtools, SNPEff , VarScan, and Vcftools, as well as public resources such as ClinVar and GnomAD.


  1. 1.
    Karapetis CS, Khambata-Ford S, Jonker DJ et al (2008) K-ras mutations and benefit from cetuximab in advanced colorectal cancer. N Engl J Med 359:1757–1765CrossRefGoogle Scholar
  2. 2.
    DePristo MA, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498CrossRefGoogle Scholar
  3. 3.
    McKenna A, Hanna M, Banks E et al (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303CrossRefGoogle Scholar
  4. 4.
    Hwang S, Kim E, Lee I et al (2015) Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep 5:17875CrossRefGoogle Scholar
  5. 5.
    Cornish A, Guda C (2015) A comparison of variant calling pipelines using genome in a bottle as a reference. Biomed Res Int 2015:456479CrossRefGoogle Scholar
  6. 6.
    Roberts ND, Kortschak RD, Parker WT et al (2013) A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 29:2223–2230CrossRefGoogle Scholar
  7. 7.
    Wang Q, Jia P, Li F et al (2013) Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med 5:91CrossRefGoogle Scholar
  8. 8.
    Xu H, DiCarlo J, Satya RV et al (2014) Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 15:244CrossRefGoogle Scholar
  9. 9.
    Gerlinger M, Rowan AJ, Horswell S et al (2012) Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 366:883–892CrossRefGoogle Scholar
  10. 10.
    Jacoby MA, Duncavage EJ, Walter MJ (2015) Implications of tumor clonal heterogeneity in the era of next-generation sequencing. Trends Cancer 1:231–241CrossRefGoogle Scholar
  11. 11.
    Pleasance ED, Cheetham RK, Stephens PJ et al (2010) A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463:191–196CrossRefGoogle Scholar
  12. 12.
    Alexandrov LB, Nik-Zainal S, Wedge DC et al (2013) Signatures of mutational processes in human cancer. Nature 500:415–421CrossRefGoogle Scholar
  13. 13.
    Roth A, Ding J, Morin R et al (2012) JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28:907–913CrossRefGoogle Scholar
  14. 14.
    Saunders CT, Wong WS, Swamy S et al (2012) Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28:1811–1817CrossRefGoogle Scholar
  15. 15.
    Cibulskis K, Lawrence MS, Carter SL et al (2013) Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31:213–219CrossRefGoogle Scholar
  16. 16.
    The Broad Institute (2018.) Accessed 08 Jan 2018
  17. 17.
    Cingolani P (2017) SnpEff: genomic variant annotations and functional effect prediction toolbox. Accessed 08 Jan 2018
  18. 18.
    Koboldt DC, Zhang Q, Larson DE et al (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 22:568–576CrossRefGoogle Scholar
  19. 19.
    Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771CrossRefGoogle Scholar
  20. 20.
    Poplin R, Ruano-Rubio V, DePristo MA, et al (2017) Scaling accurate genetic variant discovery to tens of thousands of samples. Accessed 08 Jan 2018
  21. 21.
    Garrison E and Marth G (2012) Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907v2.: Accessed 08 Jan 2018
  22. 22.
    Babraham Bioinformatics (2017) .FastQC: a quality control tool for high throughput sequence data. Accessed 8 Jan 2018
  23. 23.
    Ewels P, Magnusson M, Lundin S et al (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048CrossRefGoogle Scholar
  24. 24.
    Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120CrossRefGoogle Scholar
  25. 25.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760CrossRefGoogle Scholar
  26. 26.
    Benjamin D (2017) Pair HMM probabilistic realignment in HaplotypeCaller and Mutect. Accessed 08 Jan 2018
  27. 27.
    Benjamin D, Sato T (2018) Mathematical notes on mutect. Accessed 08 Jan 2018
  28. 28.
    Benjamin D (2017) Local assembly in HaplotypeCaller and Mutect. Accessed 08 Jan 2018
  29. 29.
    Sherry ST, Ward MH, Kholodov M et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308–311CrossRefGoogle Scholar
  30. 30.
    Consortium GP, Auton A, Brooks LD, et al (2015) A global reference for human genetic variation. Nature 526:68-74Google Scholar
  31. 31.
    Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291CrossRefGoogle Scholar
  32. 32.
    GnomAD. Browser beta, genome aggregation database (2017.) Accessed 10 Jan 2018
  33. 33.
    Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158CrossRefGoogle Scholar
  34. 34.
    Cingolani P, Platts A, Wang le L, et al (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92CrossRefGoogle Scholar
  35. 35.
    Cingolani P, Patel VM, Coon M et al (2012) Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front Genet 3:35CrossRefGoogle Scholar
  36. 36.
    McLaren W, Gil L, Hunt SE et al (2016) The Ensembl variant effect predictor. Genome Biol 17:122CrossRefGoogle Scholar
  37. 37.
    Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164CrossRefGoogle Scholar
  38. 38.
    Golden Helix SNP & Variation Suite™ (2017) Golden Helix, Inc., Bozeman, MT. Accessed 15 Jan 2018
  39. 39.
    Eilbeck K, Lewis SE, Mungall CJ et al (2005) The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol 6:R44CrossRefGoogle Scholar
  40. 40.
    Liu X, Jian X, Boerwinkle E (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat 32:894–899CrossRefGoogle Scholar
  41. 41.
    Liu X, Wu C, Li C et al (2016) dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and Splice-Site SNVs. Hum Mutat 37:235–241CrossRefGoogle Scholar
  42. 42.
    Landrum MJ, Lee JM, Benson M et al (2016) ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res 44:D862–D868CrossRefGoogle Scholar
  43. 43.
    Gates C and Bene J (2016) .Jacquard: a suite of command-line tools to expedite analysis of exome variant data from multiple patients and multiple variant callers. Accessed 08 Jan 2018
  44. 44.
    Kim SY, Jacob L, Speed TP (2014) Combining calls from multiple somatic mutation-callers. BMC Bioinformatics 15:154CrossRefGoogle Scholar
  45. 45.
    Fang LT, Afshar PT, Chhibber A et al (2015) An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol 16:197CrossRefGoogle Scholar
  46. 46.
    Callari M, Sammut SJ, De Mattos-Arruda L et al (2017) Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers. Genome Med 9:35CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.BRCF Bioinformatics CoreUniversity of MichiganAnn ArborUSA
  2. 2.Division of Hematology and Oncology, Department of Internal MedicineUniversity of MichiganAnn ArborUSA

Personalised recommendations