Whole-Exome Sequencing Data – Identifying Somatic Mutations

  • Roberta Spinelli
  • Rocco Piazza
  • Alessandra Pirola
  • Simona Valletta
  • Roberta Rostagno
  • Angela Mogavero
  • Manuela Marega
  • Hima Raman
  • Carlo Gambacorti-Passerini


The use of next-generation sequencing instruments to study hematological malignancies generates a tremendous amount of sequencing data. This leads to a challenging bioinformatics problem to store, manage, and analyze terabytes of sequencing data, often generated from extremely different data sources. Our project is mainly focused on sequence analysis of human cancer genomes, in order to identify the genetic lesions underlying the development of tumors. However, the automated detection procedure of somatic mutations and the statistical testing procedure to identify genetic lesions are still an open problem. Therefore, we propose a computational procedure to handle large-scale sequencing data in order to detect exonic somatic mutations in a tumor sample. The proposed pipeline includes several steps based on open-source software and the R language: alignment, detection of mutations, annotation, functional classification, and visualization of results. We analyzed Illumina whole-exome sequencing data from five leukemic patients and five paired controls plus one colon cancer sample and paired control. The results were validated by Sanger sequencing.


Chronic Myeloid Leukemia Somatic Mutation Ingenuity Pathway Analysis Human Genome Reference Integrative Genomic Viewer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



annotation of genetic variants


binary alignment format


blast crisis


Burrows–Wheeler alignment


Burrows–Wheeler transform


circular visualization of tabular data


chronic myeloid leukemia


copy number variation


deoxyribonucleic acid


integrative genomics viewer


insertion or a deletion


ingenuity pathway analysis


loss of heterozygosity


linear regression technique


platelet-derived growth factor receptor


promyelocytic leukemia-retinoic acid receptor


Philadelphia chromosome


R programming language


real time analyzer


sequence alignment map


tools for sequence alignment maps


sequencing control software


sorts intolerant from tolerant


single-nucleotide polymorphism


single nucleotide variant


University of California Santa Cruz


array-comparative genomic hybridization


atypical chronic myeloid leukemia




genomic DNA


  1. 25.1.
    M.R. Stratton, P.J. Campbell, P.A. Futreal: The cancer genome, Nature 458(7239), 719–724 (2009)CrossRefGoogle Scholar
  2. 25.2.
    P.J. Campbell, P.J. Stephens, E.D. Pleasance, S. OʼMeara, H. Li, T. Santarius, L.A. Stebbings, C. Leroy, S. Edkins, C. Hardy, J.W. Teague, A. Menzies, I. Goodhead, D.J. Turner, C.M. Clee, M.A. Quail, A. Cox, C. Brown, R. Durbin, M.E. Hurles, P.A. Edwards, G.R. Bignell, M.R. Stratton, P.A. Futreal: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing, Nat. Genet. 40(6), 722–729 (2008)CrossRefGoogle Scholar
  3. 25.3.
    S.B. Ng, E.H. Turner, P.D. Robertson, S.D. Flygare, A.W. Bigham, C. Lee, T. Shaffer, M. Wong, A. Bhattacharjee, E.E. Eichler, M. Bamshad, D.A. Nickerson, J. Shendure: Targeted capture and massively parallel sequencing of 12 human exomes, Nature 461(7261), 272–276 (2009)CrossRefGoogle Scholar
  4. 25.4.
    M.K. Sakharkar, V.T. Chow, P. Kangueane: Distributions of exons and introns in the human genome, In Silico Biol. 4(4), 387–393 (2004)Google Scholar
  5. 25.5.
    Y. Jiao, C. Shi, B.H. Edil, R.F. de Wilde, D.S. Klimstra, A. Maitra, R.D. Schulick, L.H. Tang, C.L. Wolfgang, M.A. Choti, V.E. Velculescu, L.A. Diaz Jr., B. Vogelstein, K.W. Kinzler, R.H. Hruban, N. Papadopoulos: DAXX/ATRX, MEN1, and mTOR pathway genes are frequently altered in pancreatic neuroendocrine tumors, Science 331(6021), 1199–1203 (2011)CrossRefGoogle Scholar
  6. 25.6.
    R Core Team: R: A Language and Enviroment for Statistical Computing (R Foundation for Statistical Computing, Vienna 2012), available online at Google Scholar
  7. 25.7.
    Y. Chen, C. Peng, D. Li, S. Li: Molecular and cellular bases of chronic myeloid leukemia, Protein Cell 1(2), 124–132 (2010)CrossRefGoogle Scholar
  8. 25.8.
    S. Burgstaller, A. Reiter, N. Cross: BCR-ABL -negative chronic myeloid leukemia, Curr. Hematol. Malig. Rep. 2(2), 75–82 (2007)CrossRefGoogle Scholar
  9. 25.9.
    S.B. Primrose, R.M. Twyman: Principles of Genome Analysis and Genomics (Blackwell, Malden 2003)Google Scholar
  10. 25.10.
    Agilent Technologies: SureSelect Human All Exon Kit Illumina Paired-End Sequencing Library Prep Protocol Version 1.0.1 (2009)Google Scholar
  11. 25.11.
    Paired-End Sequencing Sample Preparation Guide,
  12. 25.12.
    P.J. Cock, C.J. Fields, N. Goto, M.L. Heuer, P.M. Rice: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Res. 38(6), 1767–1771 (2010)CrossRefGoogle Scholar
  13. 25.13.
    P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman, C.J. Cox, A. Dalke, I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski, M.J.L. de Hoon: Biopython: Freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics 25(11), 1422–1423 (2009)CrossRefGoogle Scholar
  14. 25.14.
    H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools, Bioinformatics 25(16), 2078–2079 (2009)CrossRefGoogle Scholar
  15. 25.15.
    P. Kumar, S. Henikoff, P.C. Ng: Predicting the effects of coding nonsynonymous variants on protein function using the SIFT algorithm, Nat. Protoc. 4(7), 1073–1081 (2009)CrossRefGoogle Scholar
  16. 25.16.
    J.T. Robinson, H. Thorvaldsdottir, W. Winckler, M. Guttman, E.S. Lander, G. Getz, J.P. Mesirov: Integrative genomics viewer, Nat. Biotechnol. 29, 24–26 (2011)CrossRefGoogle Scholar
  17. 25.17.
    B. Ewing, P. Green: Base-calling of automated sequencer traces using Phred. II. Error probabilities, Genome Res. 8(3), 186–194 (1998)CrossRefGoogle Scholar
  18. 25.18.
    H. Li, R. Durbin: Fast and accurate long-read alignment with Burrows–Wheeler transform, Bioinformatics 26(5), 589–595 (2010)CrossRefGoogle Scholar
  19. 25.19.
    K. Wang, M. Li, H. Hakonarson: ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data, Nucleic Acids Res. 38, e164 (2010), available online at http://www. CrossRefGoogle Scholar
  20. 25.20.
    K. Wang, M. Li, H. Hakonarson: ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data, Nucleic Acids Res. 38(16), e164 (2010)CrossRefGoogle Scholar
  21. 25.21.
    Ingenuity Systems: (Ingenuity Systems, Inc., Redwood City)
  22. 25.22.
    S. Chun, J.C. Fay: Identification of deleterious mutations within three human genomes, Genome Res. 19(9), 1553–1561 (2009)CrossRefGoogle Scholar
  23. 25.23.
    X. Liu, X. Jian, E. Boerwinkle: dbNSFP: A lightweight database of human nonsynonymous SNPs and their functional predictions, Hum. Mutat. 32(8), 894–899 (2011)CrossRefGoogle Scholar
  24. 25.24.
    M.N. Edmonson, J. Zhang, C. Yan, R.P. Finney, D.M. Meerzaman, K.H. Buetow: Bambino: A variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format, Bioinformatics 27(6), 865–866 (2011)CrossRefGoogle Scholar
  25. 25.25.
    M.E. Sana, M. Iascone, D. Marchetti, J. Palatini, M. Galasso, S. Volinia: GAMES identifies and annotates mutations in next-generation sequencing projects, Bioinformatics 27(1), 9–13 (2011)CrossRefGoogle Scholar
  26. 25.26.
    M. Krzywinski, J.E. Schein, I. Birol, J. Connors, R. Gascoyne, D. Horsman, S.J. Jones, M.A. Marra: Circos: An information aesthetic for comparative genomics, Genome Res. 19, 1639–1645 (2009), available online at CrossRefGoogle Scholar
  27. 25.27.
    A. Barla, G. Jurman, R. Visintainer, M. Squillario, M. Filosi, S. Riccadonna, C. Furlanello: A machine learning pipeline for discriminant pathways identification, Proc. 8th Int. Meet. Comput. Intell. Methods Bioinf. Biostat., Gargnano (2011)Google Scholar

Copyright information

© Springer-Verlag 2014

Authors and Affiliations

  1. 1.Department of Health ScienceUniversity of Milano-BicoccaMonzaItaly
  2. 2.Department of Health ScienceUniversity of Milano-BicoccaMonzaItaly
  3. 3.Department of Health ScienceUniversity of Milano-BicoccaMonzaItaly
  4. 4.Department of Health ScienceUniversity of Milano BicoccaMonzaItaly
  5. 5.Department of Health ScienceUniversity Milano BicoccaMonzaItaly
  6. 6.Department of Health ScienceUniversità di Milano BicoccaMonzaItaly
  7. 7.Institut für NeuropathologieJustus-Liebig-UniversitätGießenGermany
  8. 8.Department of Health ScienceUniversity of Milano BicoccaMonzaItaly
  9. 9.Department of Health ScienceUniversity of Milano BicoccaMonzaItaly

Personalised recommendations