Evaluation of Whole Genome Sequencing Data

Hübschmann, Daniel; Schlesner, Matthias

doi:10.1007/978-1-4939-9151-8_15

Daniel Hübschmann^3,4^nAff5 &
Matthias Schlesner⁶

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1956))

3101 Accesses
5 Citations
1 Altmetric

Abstract

Whole genome sequencing (WGS) can provide comprehensive insights into the genetic makeup of lymphomas. Here we describe a selection of methods for the analysis of WGS data, including alignment, identification of different classes of genomic variants, the identification of driver mutations, and the identification of mutational signatures. We further outline design considerations for WGS studies and provide a variety of quality control measures to detect common quality problems in the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458:719–724. https://doi.org/10.1038/nature07943
Article CAS PubMed PubMed Central Google Scholar
Ley TJ, Mardis ER, Ding L et al (2008) DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456:66–72. https://doi.org/10.1038/nature07485
Article CAS PubMed PubMed Central Google Scholar
Meyerson M, Gabriel S, Getz G (2010) Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet 11:685–696
Article CAS PubMed Google Scholar
Alioto TS, Buchhalter I, Derdak S et al (2015) A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat Commun 6:10001. https://doi.org/10.1038/ncomms10001
Article CAS PubMed Google Scholar
Davies H, Glodzik D, Morganella S et al (2017) HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nat Med 23:517–525. https://doi.org/10.1038/nm.4292
Article CAS PubMed PubMed Central Google Scholar
Hudson TJ, Anderson W, Aretz A et al (2010) International network of cancer genome projects. Nature 464:993–998. https://doi.org/10.1038/nature08987
Article CAS PubMed Google Scholar
Robbe P, Popitsch N, Knight SJL et al (2018) Clinical whole-genome sequencing from routine formalin-fixed, paraffin-embedded specimens: pilot study for the 100,000 Genomes Project. Genet Med 20(10):1196–1205. https://doi.org/10.1038/gim.2017.241
Article CAS PubMed PubMed Central Google Scholar
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. https://doi.org/10.1093/bioinformatics/btu170
Article CAS PubMed PubMed Central Google Scholar
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760. https://doi.org/10.1093/bioinformatics/btp324
Article CAS PubMed PubMed Central Google Scholar
Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. https://www.arxiv.org/abs/1303.3997
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. https://doi.org/10.1038/nmeth.1923
Article CAS PubMed PubMed Central Google Scholar
Marco-Sola S, Sammeth M, Guigó R, Ribeca P (2012) The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Methods 9:1185–1188. https://doi.org/10.1038/nmeth.2221
Article CAS PubMed Google Scholar
Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11:473–483
Article CAS PubMed PubMed Central Google Scholar
Treangen TJ, Salzberg SL (2011) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13:36–46. https://doi.org/10.1038/nrg3117
Article CAS PubMed PubMed Central Google Scholar
Lippert RA (2005) Space-efficient whole genome comparisons with burrows–wheeler transforms. J Comput Biol 12:407–415. https://doi.org/10.1089/cmb.2005.12.407
Article CAS PubMed Google Scholar
Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595. https://doi.org/10.1093/bioinformatics/btp698
Article CAS PubMed PubMed Central Google Scholar
Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Article CAS PubMed PubMed Central Google Scholar
BroadInstitute (2016) Picard Tools—By Broad Institute. http://broadinstitute.github.io/picard/. Accessed 6 May 2018
Tischler G, Leonard S (2014) Biobambam: tools for read pair collation based algorithms on BAM files. Source Code Biol Med 9:13
Article PubMed Central Google Scholar
Tarasov A, Vilella AJ, Cuppen E et al (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics 31:2032–2034. https://doi.org/10.1093/bioinformatics/btv098
Article CAS PubMed PubMed Central Google Scholar
Van der Auwera GA, Carneiro MO, Hartl C et al (2013) From fastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 43:11.10.1–11.1033. https://doi.org/10.1002/0471250953.bi1110s43
Article Google Scholar
Depristo MA, Banks E, Poplin R et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–501. https://doi.org/10.1038/ng.806
Article CAS PubMed PubMed Central Google Scholar
Poplin R, Ruano-Rubio V, DePristo MA, et al (2017) Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv 201178. https://doi.org/10.1101/201178
Rimmer A, Phan H, Mathieson I et al (2014) Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet 46:912–918. https://doi.org/10.1038/ng.3036
Article CAS PubMed PubMed Central Google Scholar
Garrison E, Marth G (2012) Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907. https://arxiv.org/abs/1207.3907
Kim S, Scheffler K, Halpern AL, et al (2017) Strelka2: Fast and accurate variant calling for clinical sequencing applications. bioRxiv 192872. https://doi.org/10.1101/192872
Xu C (2018) A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput Struct Biotechnol J 16:15–24
Article CAS PubMed PubMed Central Google Scholar
Cibulskis K, Lawrence MS, Carter SL et al (2013) Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31:213–219. https://doi.org/10.1038/nbt.2514
Article CAS PubMed PubMed Central Google Scholar
Chen X, Schulz-Trieglaff O, Shaw R et al (2016) Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32:1220–1222. https://doi.org/10.1093/bioinformatics/btv710
Article CAS PubMed Google Scholar
Chong Z, Ruan J, Gao M et al (2017) novoBreak: local assembly for breakpoint detection in cancer genomes. Nat Methods 14:65–67. https://doi.org/10.1038/nmeth.4084
Article CAS PubMed Google Scholar
Wala JA, Bandopadhayay P, Greenwald NF et al (2018) SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res 28:581–591. https://doi.org/10.1101/gr.221028.117
Article CAS PubMed PubMed Central Google Scholar
Rausch T, Zichner T, Schlattl A et al (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28:i333–i339. https://doi.org/10.1093/bioinformatics/bts378
Article CAS PubMed PubMed Central Google Scholar
Layer RM, Chiang C, Quinlan AR, Hall IM (2014) LUMPY: a probabilistic framework for structural variant discovery. Genome Biol 15:R84. https://doi.org/10.1186/gb-2014-15-6-r84
Article PubMed PubMed Central Google Scholar
Benjamini Y, Speed TP (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 40:e72. https://doi.org/10.1093/nar/gks001
Article CAS PubMed PubMed Central Google Scholar
Koren A, Handsaker RE, Kamitaki N et al (2014) Genetic variation in human DNA replication timing. Cell 159:1015–1026. https://doi.org/10.1016/j.cell.2014.10.025
Article CAS PubMed PubMed Central Google Scholar
Kleinheinz K, Bludau I, Huebschmann D, et al (2017) ACEseq—allele specific copy number estimation from whole genome sequencing. bioRxiv 210807. https://doi.org/10.1101/210807
Boeva V, Popova T, Bleakley K et al (2012) Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics 28:423–425. https://doi.org/10.1093/bioinformatics/btr670
Article CAS PubMed Google Scholar
Favero F, Joshi T, Marquard AM et al (2015) Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann Oncol 26:64–70. https://doi.org/10.1093/annonc/mdu479
Article CAS PubMed Google Scholar
Van Loo P, Nordgard SH, Lingjærde OC et al (2010) Allele-specific copy number analysis of tumors. Proc Natl Acad Sci U S A 107:16910–16915. https://doi.org/10.1073/pnas.1009843107
Article PubMed PubMed Central Google Scholar
Carter SL, Cibulskis K, Helman E et al (2012) Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol 30:413–421. https://doi.org/10.1038/nbt.2203
Article CAS PubMed PubMed Central Google Scholar
Simon A (2010) FastQC: a quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Sherry ST, Ward MH, Kholodov M et al (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308–311
Article CAS PubMed PubMed Central Google Scholar
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164–e164. https://doi.org/10.1093/nar/gkq603
Article CAS PubMed PubMed Central Google Scholar
Cingolani P, Platts A, Wang LL et al (2012) A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6:80–92. https://doi.org/10.4161/fly.19695
Article CAS Google Scholar
McLaren W, Gil L, Hunt SE et al (2016) The Ensembl variant effect predictor. Genome Biol 17:122. https://doi.org/10.1186/s13059-016-0974-4
Article CAS PubMed PubMed Central Google Scholar
Vazquez M, Nogales R, Carmona P et al (2010) Rbbt: a framework for fast bioinformatics development with ruby. Springer, Berlin, Heidelberg
Google Scholar
McCarthy DJ, Humburg P, Kanapin A et al (2014) Choice of transcripts and software has a large effect on variant annotation. Genome Med 6:26. https://doi.org/10.1186/gm543
Article PubMed PubMed Central Google Scholar
Frankish A, Uszczynska B, Ritchie GR et al (2015) Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics 16:S2. https://doi.org/10.1186/1471-2164-16-S8-S2
Article PubMed PubMed Central Google Scholar
Wu PY, Phan JH, Wang MD (2013) Assessing the impact of human genome annotation choice on RNA-seq expression estimates. BMC Bioinformatics 14(Suppl 1):S8. https://doi.org/10.1186/1471-2105-14-S11-S8
Article PubMed PubMed Central Google Scholar
Zhao S, Zhang B (2015) A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics 16:97. https://doi.org/10.1186/s12864-015-1308-8
Article CAS PubMed PubMed Central Google Scholar
Dees ND, Zhang Q, Kandoth C et al (2012) MuSiC: Identifying mutational significance in cancer genomes. Genome Res 22:1589–1598. https://doi.org/10.1101/gr.134635.111
Article CAS PubMed PubMed Central Google Scholar
Lawrence MS, Stojanov P, Polak P et al (2013) Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499:214–218. https://doi.org/10.1038/nature12213
Article CAS PubMed PubMed Central Google Scholar
Gonzalez-Perez A, Lopez-Bigas N (2012) Functional impact bias reveals cancer drivers. Nucleic Acids Res 40:e169. https://doi.org/10.1093/nar/gks743
Article CAS PubMed PubMed Central Google Scholar
Tamborero D, Gonzalez-Perez A, Lopez-Bigas N (2013) OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29:2238–2244. https://doi.org/10.1093/bioinformatics/btt395
Article CAS PubMed Google Scholar
Lochovsky L, Zhang J, Fu Y et al (2015) LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations. Nucleic Acids Res 43:8123–8134. https://doi.org/10.1093/nar/gkv803
Article CAS PubMed PubMed Central Google Scholar
Mularoni L, Sabarinathan R, Deu-Pons J et al (2016) OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol 17:128. https://doi.org/10.1186/s13059-016-0994-0
Article CAS PubMed PubMed Central Google Scholar
Alexandrov LB, Nik-Zainal S, Wedge DC et al (2013) Signatures of mutational processes in human cancer. Nature. https://doi.org/10.1038/nature12477
COSMIC—signatures of mutational processes in human cancer. https://cancer.sanger.ac.uk/cosmic/signatures. Accessed 9 May 2018
Gehring JS, Fischer B, Lawrence M, Huber W (2015) SomaticSignatures: inferring mutational signatures from single-nucleotide variants. Bioinformatics 31:3673–3675. https://doi.org/10.1093/bioinformatics/btv408
Article CAS PubMed PubMed Central Google Scholar
Huebschmann D, Kurzawa N, Steinhauser S, et al (2017) Deciphering programs of transcriptional regulation by combined deconvolution of multiple omics layers. bioRxiv 199547. https://doi.org/10.1101/199547
Mnih V (2009) CUDAMat: a CUDA-based matrix class for Python. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.232.4776&rep=rep1&type=pdf
Rosenthal R, McGranahan N, Herrero J et al (2016) deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol 17:31. https://doi.org/10.1186/s13059-016-0893-4
Article CAS PubMed PubMed Central Google Scholar
Huebschmann D, Gu Z, Schlesner M (2015) YAPSA: yet another package for signature analysis. R package. http://bioconductor.org/packages/release/bioc/html/YAPSA.html
Lek M, Karczewski KJ, Minikel EV et al (2016) Analysis of protein-coding genetic variation in 60,706 humans. Nature 536:285–291. https://doi.org/10.1038/nature19057
Article CAS PubMed PubMed Central Google Scholar
Kalatskaya I, Trinh QM, Spears M et al (2017) ISOWN: accurate somatic mutation identification in the absence of normal tissue controls. Genome Med 9:59. https://doi.org/10.1186/s13073-017-0446-9
Article CAS PubMed PubMed Central Google Scholar
Smith KS, Yadav VK, Pei S et al (2016) SomVarIUS: somatic variant identification from unpaired tissue samples. Bioinformatics 32:808–813. https://doi.org/10.1093/bioinformatics/btv685
Article CAS PubMed Google Scholar
Madubata CJ, Roshan-Ghias A, Chu T et al (2017) Identification of potentially oncogenic alterations from tumor-only samples reveals Fanconi anemia pathway mutations in bladder carcinomas. NPJ Genomic Med 2:29. https://doi.org/10.1038/s41525-017-0032-5
Article CAS Google Scholar

Download references

Acknowledgments

This work has been supported by the German Ministry of Science and Education (BMBF) in the framework of the ICGC MMML-Seq (01KU1002A-J) and the ICGC DE-MINING (01KU1505E) projects and the Heidelberg Center for Human Bioinformatics (HD-HuB) within the German Network for Bioinformatics Infrastructure (de.NBI) (#031A537A, #031A537C). We are grateful to all present and previous members of the Division of Theoretical Bioinformatics, the DKFZ-HIPO bioinformatics team, the Omics IT and Data Management Core Facility, and the Bioinformatics and Omics Data Analytics group of the German Cancer Research Center (DKFZ, Heidelberg) as well as coworkers in the ICGC MMML-seq and PedBrain projects who were involved in the establishment of the procedures described here.

Author information

Daniel Hübschmann
Present address: Division of Stem Cells and Cancer, German Cancer Research Center (DKFZ), Heidelberg, Germany, Heidelberg Institute for Stem Cell Technology and Experimental Medicine (HI-STEM), Heidelberg, Germany

Authors and Affiliations

Division of Theoretical Bioinformatics (B080), German Cancer Research Center (DKFZ), Heidelberg, Germany
Daniel Hübschmann
Department of Pediatric Immunology, Hematology and Oncology, University Hospital, Heidelberg, Germany
Daniel Hübschmann
Bioinformatics and Omics Data Analytics (B240), German Cancer Research Center (DKFZ), Heidelberg, Germany
Matthias Schlesner

Authors

Daniel Hübschmann
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Schlesner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Schlesner .

Editor information

Editors and Affiliations

Institute of Cell Biology (Cancer Research), Medical School, University of Duisburg-Essen, Nordrhein-Westfalen, Essen, Germany
Ralf Küppers

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Hübschmann, D., Schlesner, M. (2019). Evaluation of Whole Genome Sequencing Data. In: Küppers, R. (eds) Lymphoma. Methods in Molecular Biology, vol 1956. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-9151-8_15

Download citation

DOI: https://doi.org/10.1007/978-1-4939-9151-8_15
Published: 19 February 2019
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-9150-1
Online ISBN: 978-1-4939-9151-8
eBook Packages: Springer Protocols

Publish with us

Policies and ethics