Advertisement

Computational Analysis in Cancer Exome Sequencing

  • Perry Evans
  • Yong Kong
  • Michael Krauthammer
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 1176)

Abstract

Exome sequencing in cancer is a powerful tool for identifying mutational events across the coding region of human genes. Here, we describe computational methods that use exome sequencing reads from cancer samples to identify somatic single nucleotide variants (SNVs), copy number alterations, and short insertions and deletions (InDels). We further describe analytical methods to generate lists of driver genes with more mutational events than expected by chance.

Key words

Cancer Single nucleotide variant Copy number variation Exome sequencing Gene burden InDels 

1 Introduction

Exome sequencing of paired normal and tumor samples allows for the identification of somatic changes in a patient, including the identification of single nucleotide variants (SNVs), copy number changes, and short InDels (insertions and deletions). The falling cost of exome sequencing allows for the aggregation of such mutations across the coding regions of human genes, yielding a more complete picture of the mutational landscape of cancer. Here, we outline methods for utilizing such data to identify genes that might be important for cancer. We first present methods for preprocessing and alignment of sequencing reads, followed by methods that are useful for the identification of genes with more mutational events than expected by chance. The underlying methods need to address the many genomic biases that affect the number of observed mutations, including chromatin state, or the replication time of the genomic loci with the mutations. Our analytical methods use indirect evidence to estimate the magnitude of these biases, mainly in the form of neutral mutational events (i.e., synonymous mutations).

2 Materials

For this analysis, we utilized existing software and developed our own scripts as necessary. All software used is freely available. Our sequencing data consisted of fastq files generated by Illumina GA2 and HiSeq machines, after exome capture using the NimbeGen SeqCap EZ Exome Library kit.

2.1 Data Files

  1. 1.

    Fastq files from sequencing runs for all normal and tumor samples.

     
  2. 2.

    A file indicating what genomic bases are covered in the exome capture region.

     

2.2 Software

  1. 1.

    Btrim—software to trim adapters and low quality regions in reads, obtained at http://graphics.med.yale.edu/trim/ [1].

     
  2. 2.

    BWA—Burrows-Wheeler Aligner, obtained at http://bio-bwa.sourceforge.net/ [2].

     
  3. 3.

    SAMtools—Sequence Alignment/Map tools, obtained at http://samtools.sourceforge.net/ [3].

     
  4. 4.

    Variant Effect Predictor—tool for predicting the functional consequences of known and unknown variants, downloaded from http://useast.ensembl.org/info/docs/variation/vep/index.html.

     
  5. 5.

    CONTRA—Copy Number Targeted Resequencing Analysis, obtained at http://sourceforge.net/apps/mediawiki/contra-cnv/index.php [4].

     
  6. 6.

    CMDS—correlation matrix diagonal segmentation, obtained at https://dsgweb.wustl.edu/qunyuan/software/cmds [5].

     
  7. 7.

    DNACopy—DNA copy number data analysis in R, obtained at http://www.bioconductor.org/packages/2.11/bioc/html/DNAcopy.html.

     
  8. 8.

    Circos—tool for visualizing data in a circular layout, obtained at http://circos.ca/ [6].

     
  9. 9.

    Integrated Genomics Viewer—tools for visualizing genomic tracks, downloaded from http://www.broadinstitute.org/igv/ [7].

     
  10. 10.

    LiftOver—tool to convert genome coordinates between assemblies, obtained at http://genome.ucsc.edu/cgi-bin/hgLiftOver.

     
  11. 11.

    R—language and environment for statistical computing and graphics, downloaded from http://cran.r-project.org/.

     

2.3 Databases

  1. 1.

    UCSC GRCh37/hg19 human genome sequence and RefGene.txt file (see Note 1 ).

     
  2. 2.

    dbSNP—NCBI database of short genetic variations, obtained at http://www.ncbi.nlm.nih.gov/SNP/ or http://genome.ucsc.edu/.

     
  3. 3.

    1000 Genomes—catalog of human genetic variation, obtained at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/.

     

3 Methods

The methods outlined below were developed while analyzing 99 melanoma samples and their matched normal samples [8]. These methods are applicable to sequence data generated by any next-generation sequencing and capture platforms. Figure 1 shows an overview of the methods.
Fig. 1

Sequence analysis pipeline. For each sample, reads are aligned to the UCSC human genome. Following alignment, sequence data are used to produce copy number variations, single nucleotide variants (SNVs), and insertion/deletions (InDels). SNV data are used to find loss of heterozygosity regions in tumors, and determine genes with significant nonsynonymous SNV burden across samples

3.1 Alignment

  1. 1.

    For each exome fastq file, trim low quality regions from reads using Btrim with a window size of 10 and a quality cutoff of 25. If adapters exist, they should be trimmed by Btrim.

     
  2. 2.

    Align trimmed reads to GRCh37/hg19 using BWA with default parameters.

     
  3. 3.

    For each sample, find the average sequencing error rate as the fraction of bases from sequencing reads that do not align with the reference genome.

     
  4. 4.

    For each sample, determine the average coverage for each base in the capture region.

     
  5. 5.

    For each sample, find the percent of bases covered at least eight times across the capture region.

     
  6. 6.

    Use the three sequence quality measures above to discard samples where coverage is too low, or the error rate is too high (see Note 2 ).

     

3.2 SNV Calling

  1. 1.

    Use SAMtools to remove PCR duplicates and call single nucleotide variants (SNVs).

     
  2. 2.

    Annotate SNV protein effects (synonymous, nonsynonymous, UTR, intron) based the annotation files from UCSC. RefGene.txt from UCSC contains the exon, intron, as well as transcript start and end positions. Using this information, the phase of each nucleotide with each codon can be determined. Use the Variant Effect Predictor on intronic SNVs to identify splice-site variants.

     
  3. 3.

    Check synonymous and nonsynonymous annotations for adjacent pairs of SNVs affecting the same codon. If present, sequencing reads are scanned for the occurrence of both SNVs on the same read, and the amino acid change is predicted based on the simultaneous mutations.

     
  4. 4.

    Filter SNVs as follows. Keep SNVs that meet these criteria: at least a 13 % mutant allele frequency, at least a SAMtools mapping score of 40, minimum coverage of four mutant and eight total reads, at least one forward and one backward read, and a uniform mapping of reads with the mutant allele.

     
  5. 5.

    Produce a list of novel SNVs by filtering out SNVs present in dbSNP and 1000 Genomes (see Note 3 ).

     
  6. 6.

    An SNV is somatic if the following criteria hold: one or less mutant read in the normal samples, and a sufficient variant to total read ratio in tumor and normal samples as assessed by Fisher’s exact test (p-value threshold of 0.001).

     
  7. 7.

    There will be cases where a tumor has a clear SNV, but the normal coverage at this position is too low to determine if the SNV is somatic or present in both tumor and normal. For these cases, the SNV will be called somatic if it is somatic in another sample.

     

3.3 Somatic SNV Call Precision and Sensitivity

  1. 1.

    Randomly choose a few hundred SNVs called in tumors and validate these with Sanger sequencing (see Note 4 ). Call the set of validated SNVs V.

     
  2. 2.

    Calculate the precision of SNV calling as the percent of called SNVs that were validated by Sanger sequencing.

     
  3. 3.

    Find the subset of V that is called as somatic by the computational pipeline. Call this set SV. Use Sanger sequencing to test for the absence of all SNVs in SV in the matched normal samples. This test indicates how many SNVs are truly somatic, ST. Calculate the precision of somatic SNV calls as ST/SV, the percent of SNVs called as somatic that were validated to be somatic by Sanger sequencing.

     
  4. 4.

    In each normal sample, identify variants present in dbSNP. Calculate the sensitivity of SNV calls as the percent of normal SNPs called in the corresponding matched tumor samples (see Note 5 ).

     
  5. 5.

    Take all SNVs from V, and test normal samples with Sanger sequencing to see if the SNVs are also present in the matched normal samples. Retain SNVs that are truly somatic, and calculate the sensitivity of the somatic calling pipeline as the percent of truly somatic SNVs detected computationally.

     

3.4 InDel Calling

  1. 1.

    Use SAMtools to call InDels.

     
  2. 2.

    Filter for InDels with a mutant allele frequency greater than 0.1, a minimal SNP score of 250 in at least one tumor sample, and an absence from dbSNP.

     
  3. 3.

    Find somatic InDels by excluding all InDels present in normal samples. When an InDel is not annotated in a normal sample, require the normal sequence coverage to be at least 8 independent reads for the tumor InDel to pass the filter.

     
  4. 4.

    Use the Variant Effect Predictor on all InDels to determine their effects. Retain InDels that cause frame-shifts or affect stop codons.

     

3.5 Somatic Copy Number Analysis

  1. 1.

    Use CONTRA with the multimapped read exclusion flag to determine copy number events for each matched melanoma sample.

     
  2. 2.

    The sequence coverage log fold change between tumor and normal produced by CONTRA can be visualized with the Integrated Genomics Viewer.

     
  3. 3.

    Count the number of samples for which at least one exon in a gene has a significant CONTRA call and fit a Poisson distribution to the resulting sample counts per gene. Retain genes that have copy number events in significantly more samples than expected.

     
  4. 4.

    Run CMDS across all samples. Limit genes with CONTRA copy number calls to those that are located in chromosomal bands with significant CMDS calls to produce a final list of enriched somatic copy number events.

     

3.6 Loss of Heterozygosity

  1. 1.

    Identify heterozygous positions in normal samples as SNVs with a mutant allele frequency ranging from 0.4 to 0.6 and a sequence depth of at least 10. For each normal heterozygous SNV, record the zygosity of the matched tumor SNV. Tumor SNV zygosity is determined using a binomial test to compare the observed mutant allele frequency to 0.5. Use DNACopy to perform circular binary segmentation on homozygous and heterozygous positions in tumor.

     
  2. 2.

    Filter each DNACopy call for region-wide LOH by requiring that a high percentage of tumor SNVs within the region be labeled at homozygous.

     
  3. 3.

    For a summary visual of LOH regions, use Circos to view each sample’s LOH regions in a circular layout.

     
  4. 4.

    Use LOH regions to determine the admixture of normal and tumor cells within a tumor sample as follows. For all tumor SNVs within LOH regions, determine the average absolute difference between each SNV’s mutant allele frequency and 0.5. Double this average to obtain a measure of tumor sample purity. Pure samples will have a value close to one. Samples with low purity can be discarded in further analyses.

     

3.7 Identification of Genes with Significant Somatic Nonsynonymous Mutation Burden

  1. 1.

    Construct a file of simulated mutations as follows. For each base in the exome capture region, mutate to all possible nucleotides, and record the amino acid encoded by the new codon.

     
  2. 2.

    If mutations in your data have a genomic context bias, divide the simulation file into different contexts to account for mutation biases (see Note 6 ).

     
  3. 3.

    For each gene in the exome capture region, and for each genomic context, use the simulation file to determine the ratio of nonsynonymous to synonymous changes for the gene if all possible mutations occurred according to the observed distribution of somatic mutations (see Note 7 ).

     
  4. 4.

    For each mutated gene and each genomic context, find the expected number of nonsynonymous mutations by multiplying the observed number of synonymous mutations by the ratio of nonsynonymous to synonymous calculated above. Obtain a p-value for each genomic context using a binomial test to compare the observed number of nonsynonymous mutations to the expected number given the length of the gene for this context (see Note 8 ).

     
  5. 5.

    For each gene, combine p-values across genomic contexts using Fisher’s combined probability test to arrive at a final p-value for the gene.

     
  6. 6.

    P-values must be corrected for multiple testing. We used the Benjamini–Hochberg procedure, as implemented in R through the BH function provided in the Mutoss package. We entered p-values for every gene in the capture region. Genes with no nonsynonymous mutations were given a p-value of 1.

     
  7. 7.

    Some tumors may have many nonsynonymous mutations in the same gene, resulting in a high mutation burden ranking that is only attributable to a few samples. In this case, the gene will not be a relevant driver for the whole sample set. You can put an additional filter on your gene list to require that a gene is mutated in a minimum number of samples to avoid this problem.

     
  8. 8.

    To ensure that your gene list is robust, you can remove the top 5 % of 10 % of tumors by nonsynonymous SNV count and run the mutation burden analysis on this slightly smaller set of tumors. You can report genes that are ranked highly in both burden runs.

     
  9. 9.

    To further narrow down this final list of significant genes, compare the gene list with LOH regions and genes known to be expressed in tumors. Genes that have a significant mutation burden, are expressed in tumors, and are often in LOH regions are the good candidates for cancer drivers.

     
  10. 10.

    If your tumor samples can be divided into different classes, you might want to run this mutation burden for each sample class, provided there are enough somatic mutations. As an example, for melanoma tumors, we split our tumors based on sun exposure. We only ran the burden analysis on sun-exposed tumors because sun-shielded tumors had too few mutations to adequately fill the genomic contexts for all genes (see Note 9 ).

     

3.8 Identification of Genes with Significant Mutations that Abrogate Protein Function

  1. 1.

    For a special case of mutation burden, gather the nonsense mutations, splice-site variants, frame-shift InDels, and InDels that affect stop codons for all samples. For each gene, calculate the sum of these three mutation types over all samples (see Note 10 ).

     
  2. 2.

    For each gene, use a binomial test to find the probability of observing the deleterious mutation count given the gene length and the exome-wide probability of a deleterious mutation occurring. For gene length, use the length of the gene in the capture region.

     
  3. 3.

    Use the Benjamini–Hochberg procedure to correct for multiple testing, and apply any of the additional filters from the nonsynonymous gene burden test above as needed.

     

4 Notes

  1. 1.

    All genomic coordinates for this work reference hg19. When a dataset, like 1000 Genomes, was not represented in hg19 coordinates, use liftOver to convert coordinates to hg19.

     
  2. 2.

    Melanomas showed an average sequence error rate around 0.24 %, an average coverage of 65 reads (minimum 30 and maximum 93), and at least 90 % of the capture region bases were covered by at least 8 reads.

     
  3. 3.

    dbSNP contains some rare important cancer mutations, like BRAF V600E at chr7:140453136 and chr7:140453137. dbSNPs like these are marked as having an unknown origin in dbSNP. Unknown variants should not be considered in the novel filtering process.

     
  4. 4.

    For melanoma, we validated 266 SNVs with Sanger sequencing.

     
  5. 5.

    We assume that variant normal bases that correspond to SNPs will rarely be mutated in tumor samples.

     
  6. 6.

    As an example, melanoma has three relevant genomic contexts: dipyrimidine cytosines, non-dipyrimidine cytosines, and thymidines (and their complements).

     
  7. 7.

    To find the observed distribution of mutations you can count either synonymous mutations or both nonsynonymous and synonymous mutations. Using only the synonymous mutations is preferred because nonsynonymous mutation might be affected by selection. If nonsynonymous mutations are used to determine the observed mutation distribution, do not use nonsynonymous mutations from the top most frequently mutated genes according to nonsynonymous mutation count.

     
  8. 8.

    If a gene has no synonymous mutations for a context, use the genome-wide frequency of synonymous mutations for this context with the gene’s length and nonsynonymous to synonymous ratio to arrive at an expected number of nonsynonymous mutations.

     
  9. 9.

    For sun-exposed melanomas, a median of 171 nonsynonymous mutations across 61 tumors was sufficient to run the mutation burden analysis, but a median of 9 nonsynonymous mutations across 26 sun-shielded tumors resulted in too few mutations. To produce a candidate gene list of cancer drivers for tumors with few mutations, you can simply find genes with recurrent nonsynonymous mutations from different samples that affect the same codon.

     
  10. 10.

    These mutations have a higher probability of affecting protein function than mutations that change one amino acid to another, and should be analyzed separately.

     

Notes

Acknowledgements

This work was supported by the Yale SPORE in Skin Cancer funded by the National Cancer Institute grant number 1 P50 CA121974 (principal investigator, Ruth Halaban), the Melanoma Research Alliance (a Team award to Ruth Halaban and M.K.), The National Library of Medicine Training grant 5T15LM007056 (P.E.), Yale Comprehensive Cancer Center (M.K.), and Gilead Sciences, Inc. (M.K.).

References

  1. 1.
    Kong Y (2011) Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics 98(2):152–153PubMedCrossRefGoogle Scholar
  2. 2.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760PubMedCentralPubMedCrossRefGoogle Scholar
  3. 3.
    Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079PubMedCentralPubMedCrossRefGoogle Scholar
  4. 4.
    Li J et al (2012) CONTRA: copy number analysis for targeted resequencing. Bioinformatics 28(10):1307–1313PubMedCentralPubMedCrossRefGoogle Scholar
  5. 5.
    Zhang Q et al (2010) CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics 26(4):464–469PubMedCentralPubMedCrossRefGoogle Scholar
  6. 6.
    Krzywinski M et al (2009) Circos: an information aesthetic for comparative genomics. Genome Res 19(9):1639–1645PubMedCentralPubMedCrossRefGoogle Scholar
  7. 7.
    Thorvaldsdottir H, Robinson JT, Mesirov JP (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14(2):178–192PubMedCentralPubMedCrossRefGoogle Scholar
  8. 8.
    Krauthammer M et al (2012) Exome sequencing identifies recurrent somatic RAC1 mutations in melanoma. Nat Genet 44(9):1006–1014PubMedCentralPubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of PathologyYale University School of MedicineNew HavenUSA
  2. 2.W.M. Keck Foundation Biotechnology Resource Laboratory, Department of Molecular Biophysics and BiochemistryYale University School of MedicineNew HavenUSA

Personalised recommendations