Abstract
This chapter contains a step-by-step protocol for identifying somatic SNPs and small Indels from next-generation sequencing data of tumor samples and matching normal samples. The workflow presented here is largely based on the Broad Institute’s “Best Practices” guidelines and makes use of their Genome Analysis Toolkit (GATK) platform. Variants are annotated with population allele frequencies and curated resources such as GnomAD and ClinVar and curated effect predictions from dbNSFP using VCFtools, SnpEff, and SnpSift.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Karapetis CS, Khambata-Ford S, Jonker DJ et al (2008) K-ras mutations and benefit from cetuximab in advanced colorectal cancer. N Engl J Med 359:1757–1765
DePristo MA, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498
McKenna A, Hanna M, Banks E et al (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303
Hwang S, Kim E, Lee I et al (2015) Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep 5:17875
Cornish A, Guda C (2015) A comparison of variant calling pipelines using genome in a bottle as a reference. Biomed Res Int 2015:456479
Roberts ND, Kortschak RD, Parker WT et al (2013) A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 29:2223–2230
Wang Q, Jia P, Li F et al (2013) Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med 5:91
Xu H, DiCarlo J, Satya RV et al (2014) Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 15:244
Gerlinger M, Rowan AJ, Horswell S et al (2012) Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 366:883–892
Jacoby MA, Duncavage EJ, Walter MJ (2015) Implications of tumor clonal heterogeneity in the era of next-generation sequencing. Trends Cancer 1:231–241
Pleasance ED, Cheetham RK, Stephens PJ et al (2010) A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463:191–196
Alexandrov LB, Nik-Zainal S, Wedge DC et al (2013) Signatures of mutational processes in human cancer. Nature 500:415–421
Roth A, Ding J, Morin R et al (2012) JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 28:907–913
Saunders CT, Wong WS, Swamy S et al (2012) Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28:1811–1817
Cibulskis K, Lawrence MS, Carter SL et al (2013) Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31:213–219
The Broad Institute (2018.) https://software.broadinstitute.org/gatk/. Accessed 08 Jan 2018
Cingolani P (2017) SnpEff: genomic variant annotations and functional effect prediction toolbox. http://snpeff.sourceforge.net/. Accessed 08 Jan 2018
Koboldt DC, Zhang Q, Larson DE et al (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 22:568–576
Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Poplin R, Ruano-Rubio V, DePristo MA, et al (2017) Scaling accurate genetic variant discovery to tens of thousands of samples. https://doi.org/10.1101/201178. Accessed 08 Jan 2018
Garrison E and Marth G (2012) Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907v2.: https://arxiv.org/abs/1207.3907. Accessed 08 Jan 2018
Babraham Bioinformatics (2017) .FastQC: a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 8 Jan 2018
Ewels P, Magnusson M, Lundin S et al (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760
Benjamin D (2017) Pair HMM probabilistic realignment in HaplotypeCaller and Mutect. https://github.com/broadinstitute/gatk/blob/master/docs/pair_hmm.pdf. Accessed 08 Jan 2018
Benjamin D, Sato T (2018) Mathematical notes on mutect. https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf. Accessed 08 Jan 2018
Benjamin D (2017) Local assembly in HaplotypeCaller and Mutect. https://github.com/broadinstitute/gatk/blob/master/docs/local_assembly.pdf. Accessed 08 Jan 2018
Sherry ST, Ward MH, Kholodov M et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308–311
Consortium GP, Auton A, Brooks LD, et al (2015) A global reference for human genetic variation. Nature 526:68-74
Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291
GnomAD. Browser beta, genome aggregation database (2017.) http://gnomad.broadinstitute.org/. Accessed 10 Jan 2018
Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158
Cingolani P, Platts A, Wang le L, et al (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80-92
Cingolani P, Patel VM, Coon M et al (2012) Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front Genet 3:35
McLaren W, Gil L, Hunt SE et al (2016) The Ensembl variant effect predictor. Genome Biol 17:122
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164
Golden Helix SNP & Variation Suite™ (2017) Golden Helix, Inc., Bozeman, MT. http://www.goldenhelix.com/. Accessed 15 Jan 2018
Eilbeck K, Lewis SE, Mungall CJ et al (2005) The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol 6:R44
Liu X, Jian X, Boerwinkle E (2011) dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat 32:894–899
Liu X, Wu C, Li C et al (2016) dbNSFP v3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and Splice-Site SNVs. Hum Mutat 37:235–241
Landrum MJ, Lee JM, Benson M et al (2016) ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res 44:D862–D868
Gates C and Bene J (2016) .Jacquard: a suite of command-line tools to expedite analysis of exome variant data from multiple patients and multiple variant callers. https://github.com/umich-brcf-bioinf/Jacquard. Accessed 08 Jan 2018
Kim SY, Jacob L, Speed TP (2014) Combining calls from multiple somatic mutation-callers. BMC Bioinformatics 15:154
Fang LT, Afshar PT, Chhibber A et al (2015) An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol 16:197
Callari M, Sammut SJ, De Mattos-Arruda L et al (2017) Intersect-then-combine approach: improving the performance of somatic variant calling in whole exome sequencing data using multiple aligners and callers. Genome Med 9:35
Acknowledgments
The authors would like to thank the institutions, developers, and documenters of the informatics tools used in this chapter’s workflows. Genomics and disease research in general benefits hourly from the availability of tools such as Bioconda, BWA, GATK, HaplotypeCaller, Mutect2, Samtools, SNPEff , VarScan, and Vcftools, as well as public resources such as ClinVar and GnomAD.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Ulintz, P.J., Wu, W., Gates, C.M. (2019). Bioinformatics Analysis of Whole Exome Sequencing Data. In: Malek, S. (eds) Chronic Lymphocytic Leukemia. Methods in Molecular Biology, vol 1881. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-8876-1_21
Download citation
DOI: https://doi.org/10.1007/978-1-4939-8876-1_21
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-8875-4
Online ISBN: 978-1-4939-8876-1
eBook Packages: Springer Protocols