Advertisement

Whole-Exome Sequencing Data – Identifying Somatic Mutations

  • Roberta Spinelli
  • Rocco Piazza
  • Alessandra Pirola
  • Simona Valletta
  • Roberta Rostagno
  • Angela Mogavero
  • Manuela Marega
  • Hima Raman
  • Carlo Gambacorti-Passerini

Abstract

The use of next-generation sequencing instruments to study hematological malignancies generates a tremendous amount of sequencing data. This leads to a challenging bioinformatics problem to store, manage, and analyze terabytes of sequencing data, often generated from extremely different data sources. Our project is mainly focused on sequence analysis of human cancer genomes, in order to identify the genetic lesions underlying the development of tumors. However, the automated detection procedure of somatic mutations and the statistical testing procedure to identify genetic lesions are still an open problem. Therefore, we propose a computational procedure to handle large-scale sequencing data in order to detect exonic somatic mutations in a tumor sample. The proposed pipeline includes several steps based on open-source software and the R language: alignment, detection of mutations, annotation, functional classification, and visualization of results. We analyzed Illumina whole-exome sequencing data from five leukemic patients and five paired controls plus one colon cancer sample and paired control. The results were validated by Sanger sequencing.

Keywords

Chronic Myeloid Leukemia Somatic Mutation Ingenuity Pathway Analysis Human Genome Reference Integrative Genomic Viewer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Abbreviations

ANNOVAR

annotation of genetic variants

BAM

binary alignment format

BC

blast crisis

BWA

Burrows–Wheeler alignment

BWT

Burrows–Wheeler transform

CIRCOS

circular visualization of tabular data

CML

chronic myeloid leukemia

CNV

copy number variation

DNA

deoxyribonucleic acid

IGV

integrative genomics viewer

INDEL

insertion or a deletion

IPA

ingenuity pathway analysis

LOH

loss of heterozygosity

LRT

linear regression technique

PDGFR

platelet-derived growth factor receptor

PML-RAR

promyelocytic leukemia-retinoic acid receptor

Ph

Philadelphia chromosome

R

R programming language

RTA

real time analyzer

SAM

sequence alignment map

SAMtools

tools for sequence alignment maps

SCS

sequencing control software

SIFT

sorts intolerant from tolerant

SNP

single-nucleotide polymorphism

SNV

single nucleotide variant

UCSC

University of California Santa Cruz

aCGH

array-comparative genomic hybridization

aCML

atypical chronic myeloid leukemia

ddNTP

dideoxynucleotide

gDNA

genomic DNA

References

  1. 25.1.
    M.R. Stratton, P.J. Campbell, P.A. Futreal: The cancer genome, Nature 458(7239), 719–724 (2009)CrossRefGoogle Scholar
  2. 25.2.
    P.J. Campbell, P.J. Stephens, E.D. Pleasance, S. OʼMeara, H. Li, T. Santarius, L.A. Stebbings, C. Leroy, S. Edkins, C. Hardy, J.W. Teague, A. Menzies, I. Goodhead, D.J. Turner, C.M. Clee, M.A. Quail, A. Cox, C. Brown, R. Durbin, M.E. Hurles, P.A. Edwards, G.R. Bignell, M.R. Stratton, P.A. Futreal: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing, Nat. Genet. 40(6), 722–729 (2008)CrossRefGoogle Scholar
  3. 25.3.
    S.B. Ng, E.H. Turner, P.D. Robertson, S.D. Flygare, A.W. Bigham, C. Lee, T. Shaffer, M. Wong, A. Bhattacharjee, E.E. Eichler, M. Bamshad, D.A. Nickerson, J. Shendure: Targeted capture and massively parallel sequencing of 12 human exomes, Nature 461(7261), 272–276 (2009)CrossRefGoogle Scholar
  4. 25.4.
    M.K. Sakharkar, V.T. Chow, P. Kangueane: Distributions of exons and introns in the human genome, In Silico Biol. 4(4), 387–393 (2004)Google Scholar
  5. 25.5.
    Y. Jiao, C. Shi, B.H. Edil, R.F. de Wilde, D.S. Klimstra, A. Maitra, R.D. Schulick, L.H. Tang, C.L. Wolfgang, M.A. Choti, V.E. Velculescu, L.A. Diaz Jr., B. Vogelstein, K.W. Kinzler, R.H. Hruban, N. Papadopoulos: DAXX/ATRX, MEN1, and mTOR pathway genes are frequently altered in pancreatic neuroendocrine tumors, Science 331(6021), 1199–1203 (2011)CrossRefGoogle Scholar
  6. 25.6.
    R Core Team: R: A Language and Enviroment for Statistical Computing (R Foundation for Statistical Computing, Vienna 2012), available online at http://www.R-project.org/ Google Scholar
  7. 25.7.
    Y. Chen, C. Peng, D. Li, S. Li: Molecular and cellular bases of chronic myeloid leukemia, Protein Cell 1(2), 124–132 (2010)CrossRefGoogle Scholar
  8. 25.8.
    S. Burgstaller, A. Reiter, N. Cross: BCR-ABL -negative chronic myeloid leukemia, Curr. Hematol. Malig. Rep. 2(2), 75–82 (2007)CrossRefGoogle Scholar
  9. 25.9.
    S.B. Primrose, R.M. Twyman: Principles of Genome Analysis and Genomics (Blackwell, Malden 2003)Google Scholar
  10. 25.10.
    Agilent Technologies: SureSelect Human All Exon Kit Illumina Paired-End Sequencing Library Prep Protocol Version 1.0.1 (2009)Google Scholar
  11. 25.11.
    Paired-End Sequencing Sample Preparation Guide, http://www.illumina.com
  12. 25.12.
    P.J. Cock, C.J. Fields, N. Goto, M.L. Heuer, P.M. Rice: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Res. 38(6), 1767–1771 (2010)CrossRefGoogle Scholar
  13. 25.13.
    P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman, C.J. Cox, A. Dalke, I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski, M.J.L. de Hoon: Biopython: Freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics 25(11), 1422–1423 (2009)CrossRefGoogle Scholar
  14. 25.14.
    H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools, Bioinformatics 25(16), 2078–2079 (2009)CrossRefGoogle Scholar
  15. 25.15.
    P. Kumar, S. Henikoff, P.C. Ng: Predicting the effects of coding nonsynonymous variants on protein function using the SIFT algorithm, Nat. Protoc. 4(7), 1073–1081 (2009)CrossRefGoogle Scholar
  16. 25.16.
    J.T. Robinson, H. Thorvaldsdottir, W. Winckler, M. Guttman, E.S. Lander, G. Getz, J.P. Mesirov: Integrative genomics viewer, Nat. Biotechnol. 29, 24–26 (2011)CrossRefGoogle Scholar
  17. 25.17.
    B. Ewing, P. Green: Base-calling of automated sequencer traces using Phred. II. Error probabilities, Genome Res. 8(3), 186–194 (1998)CrossRefGoogle Scholar
  18. 25.18.
    H. Li, R. Durbin: Fast and accurate long-read alignment with Burrows–Wheeler transform, Bioinformatics 26(5), 589–595 (2010)CrossRefGoogle Scholar
  19. 25.19.
    K. Wang, M. Li, H. Hakonarson: ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data, Nucleic Acids Res. 38, e164 (2010), available online at http://www.http://www.openbioinformatics.org/annovar/ CrossRefGoogle Scholar
  20. 25.20.
    K. Wang, M. Li, H. Hakonarson: ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data, Nucleic Acids Res. 38(16), e164 (2010)CrossRefGoogle Scholar
  21. 25.21.
    Ingenuity Systems: http://www.ingenuity.com/ (Ingenuity Systems, Inc., Redwood City)
  22. 25.22.
    S. Chun, J.C. Fay: Identification of deleterious mutations within three human genomes, Genome Res. 19(9), 1553–1561 (2009)CrossRefGoogle Scholar
  23. 25.23.
    X. Liu, X. Jian, E. Boerwinkle: dbNSFP: A lightweight database of human nonsynonymous SNPs and their functional predictions, Hum. Mutat. 32(8), 894–899 (2011)CrossRefGoogle Scholar
  24. 25.24.
    M.N. Edmonson, J. Zhang, C. Yan, R.P. Finney, D.M. Meerzaman, K.H. Buetow: Bambino: A variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format, Bioinformatics 27(6), 865–866 (2011)CrossRefGoogle Scholar
  25. 25.25.
    M.E. Sana, M. Iascone, D. Marchetti, J. Palatini, M. Galasso, S. Volinia: GAMES identifies and annotates mutations in next-generation sequencing projects, Bioinformatics 27(1), 9–13 (2011)CrossRefGoogle Scholar
  26. 25.26.
    M. Krzywinski, J.E. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S.J. Jones, M.A. Marra: Circos: An information aesthetic for comparative genomics, Genome Res. 19, 1639–1645 (2009), available online at http://mkweb.bcgsc.ca/circos/intro/genomic_data/ CrossRefGoogle Scholar
  27. 25.27.
    A. Barla, G. Jurman, R. Visintainer, M. Squillario, M. Filosi, S. Riccadonna, C. Furlanello: A machine learning pipeline for discriminant pathways identification, Proc. 8th Int. Meet. Comput. Intell. Methods Bioinf. Biostat., Gargnano (2011)Google Scholar

Copyright information

© Springer-Verlag 2014

Authors and Affiliations

  1. 1.Department of Health ScienceUniversity of Milano-BicoccaMonzaItaly
  2. 2.Department of Health ScienceUniversity of Milano-BicoccaMonzaItaly
  3. 3.Department of Health ScienceUniversity of Milano-BicoccaMonzaItaly
  4. 4.Department of Health ScienceUniversity of Milano BicoccaMonzaItaly
  5. 5.Department of Health ScienceUniversity Milano BicoccaMonzaItaly
  6. 6.Department of Health ScienceUniversità di Milano BicoccaMonzaItaly
  7. 7.Institut für NeuropathologieJustus-Liebig-UniversitätGießenGermany
  8. 8.Department of Health ScienceUniversity of Milano BicoccaMonzaItaly
  9. 9.Department of Health ScienceUniversity of Milano BicoccaMonzaItaly

Personalised recommendations