Introduction

Cancer is a complex genetic disease and understanding the myriad genetic factors involved in oncogenesis is an important step towards prevention and treatment. During the stepwise process of tumorigenesis, cells acquire a series of somatic mutations that lead to the excessive cell growth and eventually lead to the development of cancer. The progression to cancer can be accelerated when the individual also carries a germ-line mutation in a cancer susceptibility gene (Knudson 1971). According to the Cancer Gene Census (Futreal et al. 2004), the majority of known cancer mutations are somatic mutations, but some germline polymorphisms with a connection to cancer have also been identified. Identification of these mutations and polymorphisms can lead to the discovery of the genes that control cancer development and, therefore, also serve as attractive therapeutic targets.

The importance of a targeted approach towards cancer treatment has been emphasized by a number of successful therapies brought to market in recent years. Novartis’ Gleevec is an example of a drug that resulted from the identification of a cancer-causing genetic abnormality (Druker et al. 2001). A chromosomal translocation resulting in the constitutively active protein tyrosine kinase bcr-abl was identified as the casual event in development of chronic myelogenous leukemia (Lugo et al. 1990). A small molecule compound was discovered through high-throughput screening as a potent inhibitor of bcr-abl and it was then developed into Gleevec, a commercial therapy for inhibiting bcr-abl to block tumor growth while having minimal effect upon normal cells. Several other drugs have also been developed to target specific proteins that are commonly mutated in cancers. For example, Genentech’s Herceptin is a HER2-specific antibody, which is effective in treating breast cancers that overexpress the gene HER2, and AstraZeneca’s Iressa was the first of several EGFR inhibitors to treat carcinomas that have excess EGFR activity (Ciardiello et al. 2000; Vogel et al. 2002).

Rapid improvements in genomic technologies have allowed for large-scale genotyping and sequencing of cancer tissues and normal genomes as well. This influx of sequence data has revealed a vast array of genetic variations present in cancer, with a large portion of both somatic mutations and naturally occurring variations in the form of single-nucleotide substitutions. Among these single-nucleotide changes, missense mutations in which a single-nucleotide change within a gene results in an amino acid substitution in the protein product are the most investigated (Ding et al. 2008; Forbes et al. 2008; Greenman et al. 2007; Jones et al. 2008; TCGA 2008; Parsons et al. 2008; Sjoblom et al. 2006; Wood et al. 2007). The primary question facing the interpretation of this wealth of data is the delineation of functional mutations from those that are simply the result of the genetic instability inherent in cancer genomes.

The most common ways of analyzing missense mutations are focused on two distinct but related goals. In the case of recently published large-scale sequencing efforts, the analysis is gene-centric and attempts to identify highly mutated genes that are, therefore, likely to be important in the development of a specific cancer (Ding et al. 2008; Greenman et al. 2007; Jones et al. 2008; TCGA 2008; Parsons et al. 2008; Sjoblom et al. 2006; Wood et al. 2007). The premise behind this frequency-based approach is that genes that are mutated significantly more often than would be expected by chance probably function to favor tumor growth when mutated. This methodology requires a large dataset to provide sufficient statistical power and its strength lies in the identification of important genes in the condition of interest. Complementary to this approach is a mutation-centric view that removes a given mutation from the disease context in which it was observed and attempts to predict its functionality based solely on the substitution itself. These methods have the benefit of being able to potentially identify the actual causal mutation, as opposed to just the causal gene. Identification of specific functional mutations could give additional insight into the biological mechanisms of the disease.

Although the majority of large-scale sequencing efforts to date have focused on protein-coding regions, next generation sequencing technologies are beginning to allow for whole-genome sequencing of individual samples (Ley et al. 2008; Wheeler et al. 2008). This will bring in a wealth of information on mutations occurring in non-genic genomic regions, which will in turn require different analysis techniques. Single-nucleotide polymorphism (SNP) analysis has shown that alterations in non-coding sequences can have significant functional effects and contributions towards disease (Chorley et al. 2008; Srebrow and Kornblihtt 2006), and so making full use of whole-genome sequencing data will require analysis of mutations found outside of genes. The tools for predicting the functionality of a non-coding mutation are limited, but there exist a number of methods and databases that attempt to map the various non-coding functional regions to the genome through sequence analysis (Cartegni et al. 2003; Conde et al. 2006; Enright et al. 2003; Freimuth et al. 2005; Griffiths-Jones et al. 2008; Hallikas et al. 2006; Kim et al. 2008; Lambert et al. 2004; Matys et al. 2003; Palin et al. 2006; Ponomarenko et al. 2002, 2003; Riva and Kohane 2002; Sandelin et al. 2004; Tabaska and Zhang 1999; Thierry-Mieg and Thierry-Mieg 2006; Wang 2008). These tools model specific sequence elements such as transcription factor-binding sites based on the experimentally verified regions and provide the most effective means of large-scale prediction of where functional sites lie. They can also help in functional prediction by first identifying mutations that lie in functional elements, and if so, which ones may perturb the functionality of that element. Most such tools give a quantitative measure of how likely a given sequence is to be a functional element of interest (e.g. transcription factor-binding site), and so simply examining the difference in scores between a mutated sequence and its original sequence can give an idea of the functionality of the mutation.

Distinct from the study of somatic mutations in cancer is the investigation of naturally occurring human germline mutations that could contribute to the risk of cancer and other genetic diseases. A large number of SNPs have been identified, but functional information is still sparse (Sherry et al. 2001). Large-scale systematic genotyping projects (Frazer et al. 2007) have employed high-throughput genotyping technologies that enable investigations into associations between variation and disease risk. Genome-wide association studies (GWAS) have discovered SNPs that contribute to the risk of cancer development, but many of the identified risk alleles require additional analysis to validate and understand (Amos et al. 2008; Broderick et al. 2007; Easton et al. 2007; Eeles et al. 2008; Gold et al. 2008; Gudmundsson et al. 2007; Hunter et al. 2007; Kingsmore et al. 2008; Tenesa et al. 2008; Thomas et al. 2008; Tomlinson et al. 2008; Zanke et al. 2007). SNPs are distributed throughout the human genome in both coding and non-coding regions, but many methods used for analyzing mutations are equally valuable for evaluating the potential functionality of SNPs, which could be a valuable step towards the interpretation and utilization of GWAS data.

This review focuses on the analysis of single-nucleotide substitutions in the context of cancer, with a particular spotlight on recent large-scale cancer genome sequencing projects. We examine the methods by which cancer sequencing efforts can leverage their data to identify disease-driving genes and provide an overview of amino acid-change-based bioinformatics analysis methods, many of which are also applicable to non-cancer inherited diseases (Karchin 2009; Mooney 2005; Ng and Henikoff 2006; Steward et al. 2003). We also review the current knowledge of functional mutations that act in ways other than alteration of protein sequence, such as mutations that alter gene expression or splicing.

Mutation frequency-based analysis

Several large-scale cancer genome exon re-sequencing projects have recently been published by four publicly funded consortiums including groups at John Hopkins University (JHU) (Jones et al. 2008; Parsons et al. 2008; Sjoblom et al. 2006; Wood et al. 2007), Sanger Institute (Greenman et al. 2007), the Cancer Genome Atlas (TCGA 2008), and the Tumor Sequencing Project (TSP) (Ding et al. 2008) (Table 1). The JHU group focused on sequencing nearly the complete human transcriptome with a limited number of samples (11–24 samples in each cancer type), whereas the other groups sequenced a smaller number of candidate genes and, therefore, could afford to cover a larger number of samples in a single cancer type (as high as 188 lung adenocarcinoma samples in the case of the TSP study). Although the first strategy allows for the detection of novel cancer genes due to the larger search space, the latter strategy enables a broader survey of possible mutations in genes that are already known to be involved in cancer.

Table 1 List of large-scale cancer re-sequencing projects

It has been estimated that most of the observed cancer mutations are functionally neutral and are, thus, often referred to as passenger mutations, while a smaller set of driver mutations will actually confer growth advantages to tumors (Greenman et al. 2007; Sjoblom et al. 2006). Driver mutations increase the fitness of cells that they reside in and are assumed to be under positive selection during the multistage neoplastic progression. This selection should result in driver mutations occurring more frequently in tumor samples; hence, the most common approach for identifying driver mutations is based on calculating mutation frequencies with the assumption that a higher prevalence implies functionality. This method of frequency-based analysis typically requires an estimation of the non-synonymous background mutation rate (nsBMR) followed by calculation of the statistical likelihood of observing a certain number of mutations based on the nsBMR. For example, a gene harboring a significantly greater number of mutations than expected by chance would be considered a driver, since it is likely that mutations in that gene are selected for during oncogenesis. In the following sections, we will discuss how mutation data were analyzed in the recently published cancer genome studies with regard to the above-referenced steps.

It is crucial to determine a valid nsBMR, which, if underestimated, would overstate the significance of the observed mutations. The nsBMR can be estimated empirically from presumed passenger mutations as shown in the studies conducted by Jones et al. (2008); Parsons et al. (2008). These studies estimated that the nsBMR from the set of genes remaining after the most highly mutated previously known driver genes were removed from the dataset. More commonly, nsBMR is indirectly estimated as the product of the mutation rate of synonymous mutations and the expected ratio of the number of passenger non-synonymous mutations to the number of synonymous mutations (NS/S). With rare exceptions, a synonymous mutation is not likely to change the function of the protein and is, therefore, usually considered functionally neutral and not subject to selective pressure (Greenman et al. 2006). The passenger NS/S ratio is obtained by dividing the total number of possible non-synonymous changes by the total number of possible synonymous changes within the sequenced nucleotides (Ding et al. 2008; TCGA 2008; Wood et al. 2007). This ratio, ranging from 2 to 3, may result in an overestimation of the nsBMR because some of the possible non-synonymous mutations may be detrimental to the growth of the tumors and are, thus, under negative selection. A different approach is to use the observed NS/S in human population SNPs, approximately 1 (Jones et al. 2008; Parsons et al. 2008; Wood et al. 2007), which may result in an underestimation due to a greater selective pressure on germline mutations. Because these approaches, respectively, delineate an upper bound and lower bound for the NS/S ratio, the average of the two values is used by the Jones et al. (2008); Parsons et al. (2008) studies.

Estimation of a background mutation rate can be significantly affected by mutation rate heterogeneity across different DNA contexts. For example, CpG dinucleotides have a much higher mutation rate (up to 6.44-fold higher than the overall mutation frequency in one colorectal dataset (Sjoblom et al. 2006)) compared with other DNA contexts. Owing to this context-dependence, it can be beneficial to partition mutations into multiple types to account for such variations (Ding et al. 2008; Greenman et al. 2007; Jones et al. 2008; TCGA 2008; Parsons et al. 2008; Sjoblom et al. 2006; Stephens et al. 2005; Wood et al. 2007). The relative mutation rates of the different DNA contexts are usually measured directly using mutations that are presumably non-functional, such as synonymous mutations or mutations observed on the least frequently mutated genes (TCGA 2008). DNA context groups can be defined either based on prior knowledge, such as the high mutation rate at CpG dinucleotides, or using data-driven methods, which may better capture the heterogeneity of the mutation rates across different nucleotide contexts (Ding et al. 2008; Jones et al. 2008; TCGA 2008; Parsons et al. 2008; Sjoblom et al. 2006; Wood et al. 2007). In a recent study of lung carcinoma, Ding et al. 2008 partitioned all of the observed mutations into 192 categories with consideration of all 12 possible mutation changes within 16 possible flanking dinucleotides (5′ and 3′). Observed mutation rates for each category were calculated and low frequency categories were then collapsed if they did not show statistically distinct mutation rates (P < 0.05, Fisher exact test). This process resulted in 18 distinct categories. Recently, the JHU group and the TCGA group each published a glioblastoma study in which they reported two quite different background mutation rates: 0.38–1.02 (estimated lower and upper bounds) and 3.70 ± 0.57, respectively (TCGA 2008; Parsons et al. 2008). This discrepancy is likely due to the heterogeneity between the two sample populations and gene sets, as evidenced by the difference in the observed synonymous mutation rate (0.37 and 1.29). Further study might suggest that the most effective method is to use gene-specific nsBMRs to reduce the disparities between separate studies, but a large amount of data is necessary to use this method effectively.

Several studies identify novel putative cancer genes using statistical methods, some of which have stirred controversies (Forrest and Cavet 2007; Getz et al. 2007; Rubin and Green 2007). Most of these studies applied the one-tailed binomial test to identify significantly mutated genes, followed by a false discovery rate procedure to control for multiple testing (Benjamini and Hochberg 1995). As mentioned above, a single fixed background mutation rate was used in the simplest version while a more complex approach took into account the DNA context of each mutation to adjust for the heterogeneity of mutation rates under different DNA contexts. As shown in TCGAs analysis, the context-specific method is more sensitive when a smaller set of samples is analyzed (TCGA 2008). It is also worth noting that a custom method was used by the JHU group to incorporate their unique two-stage experimental design (discovery and validation screens) into the analysis (Jones et al. 2008; Parsons et al. 2008; Wood et al. 2007). Greenman et al. (2007) used a different strategy of directly modeling the NS/S ratio based on the rationale that if non-synonymous mutations yield amino acid changes with a selective advantage, a higher ratio of NS/S may be observed. Therefore, the significance of the results can be measured as a function of the degree of deviation from the expected 2:1 ratio, which had been used in several early studies (Bardelli et al. 2003; Samuels et al. 2004; Wang et al. 2004). With this model, the selective pressure can be estimated for various gene sets via maximum likelihood by considering the deviation from the expected ratio of non-synonymous to synonymous mutations. The selection pressure can be calculated on different levels, from a single gene to the whole mutation dataset, from which candidate driver genes can be predicted and the number of driver mutations can be estimated.

Another factor that could greatly affect the result of a frequency-based study is the sample size present in the study. A wide range of sample sizes has been processed in the published large-scale sequencing efforts, ranging from 11 in breast and colon cancers to about 200 in lung cancer. The authors of the TCGA study have examined the effects of sample size by randomly selecting subsets of the original 72 samples (TCGA 2008). They found that with as few as 48 samples, all eight of the cancer genes that were identified in their complete set of 72 samples could still be discovered. When the sample size was further reduced, only a fraction of the eight genes could still be identified as significant.

Somatic point mutations may account for only a fraction of the genetic alterations required for tumorigenesis. Integration with other genomic data would greatly enhance the possibility of identifying genes and biological pathways involved in tumor development. In the glioblastoma study, Parsons et al. (2008) integrated mutation analysis with genomic copy number analysis and identified three major signaling pathways with critical genes mutated in a majority of the studied tumors. They also found a mutually exclusive pattern for the alterations within each pathway. The same pattern was also reported in TCGAs glioblastoma paper (TCGA 2008). Ding et al. (2008) found that mutations in known tumor suppressor genes such as PTEN, APC and TP53 were correlated with copy number loss and mutations in proto-oncogenes, such as EGFR, HCK, KRAS and EPHB1. Therefore, an integrative approach to analyze all types of genetic alterations in a pathway context could provide greater insight into the genetic mechanisms of cancer development.

Many of the somatic mutations identified in the recent cancer exon re-sequencing studies are novel and rare mutations, often observed in only a single sample. This implies that a large number of samples are required to establish the statistical significance of potential cancer driver genes. The rapid development of sequencing technology will eventually allow us to expand beyond the current focus on coding regions to the whole human genome and therefore make it possible to identify all of the genetic alterations underlying the individual cancers. In the meantime, it also presents an even bigger statistical challenge since many more mutations need to be analyzed. Despite these challenges, the current large-scale cancer studies have successfully identified many novel cancer genes and provide more insight into the complex genetic basis of cancer (Table 1) .

Bioinformatics analysis of amino acid substitutions

The frequency-based approaches reviewed above are contingent upon either an assumption of a background mutation rate or the availability of a large number of mutations in the dataset to calculate an empirical background mutation rate for the sample of interest. Furthermore, these methods cannot be used for independently evaluating the potential function of an individual mutation. Because many genes are infrequently mutated and large disease-specific datasets are often not available, other approaches may be more suitable for identification of functional mutations on an individual gene basis. A set of methods for predicting functions of specific amino acid substitution can fill this niche (Table 2). These methods look at the actual amino acid change occurring in missense mutations and can be used for the analysis of mutations on a case-by-case basis. Such methods have also been effective in the study of natural human genetic variation. Bioinformatics methods can help prioritize which of the greater than 60,000 estimated non-synonymous single SNPs (Livingston et al. 2004) in the human population are likely to have a function impact and warrant additional investigation. Furthermore, the ability of such methods to evaluate the functional impact of individual changes makes them useful for directing mutagenesis efforts, so that mutations that are most likely to produce a phenotype can be examined first (Henikoff and Comai 2003). We will review general features of substitution-based methods and focus specifically on their application towards cancer mutation research.

Table 2 List of selected amino acid substitution prediction tools

Amino acid-change-based prediction methods are primarily based on an observation that functional mutations appear to be distributed in a non-random manner across protein sequences and structures (Miller and Kumar 2001; Sunyaev et al. 2000; Wang and Moult 2001). Based only on sequence analysis, Miller and Kumar (2001) observed that disease-associated mutations in seven genes were particularly concentrated in conserved amino acid positions. This observation is consistent with the notion that conserved residues are more likely to be functional, since the conversation at that position is likely due to purifying selection throughout evolution. By adding structural data to their analysis, Sunyaev et al. (2000) found that ~70% of disease-related mutations they studied were located in structural sites more likely to be functionally important, such as active sites, interaction sites, or positions buried within the protein and inaccessible to solvent. In a similar manner, Wang and Moult (2001) modeled the effects of disease-associated SNPs on protein stability and found that 83% of such substitutions were found to affect protein stability.

Given the observations concerning disease-related mutations, algorithms that predict the functionality of amino acid substitutions do so based on sequence information, structure information, or a combination of the two. In sequence-based methods, a substitution will be evaluated based on its sequence context. The widely used SIFT algorithm (Sorting Intolerant From Tolerant) employs a multiple sequence alignment of homologous proteins to identify conserved regions in the protein of interest, each possible substitution can then be scored according to the conservation observed at each position (Ng and Henikoff 2001). Similarly, Clifford et al. describe a tool that takes advantage of known Pfam protein motifs to identify conserved regions in protein domains (Clifford et al. 2004; Finn et al. 2006). Jiang et al. (2007) developed a method consisting of 20 modules, each of which was optimized using a subset of sequence features specific to a particular starting residue. This method was shown to outperform other general methods such as SIFT. More recently, Hon et al. (2009) examined mutations within signal peptide regions and used outputs from the SignalP program to identify mutations that could affect signal peptide function. The authors found that combining SIFT with specific signal peptide information could accurately identify functional mutations within signal peptides. This context-specific approach was also adopted by Radivojac et al. who developed a model using the output of phosphorylation-site predictor DisPhos to assess the probability of losing or gaining phosphorylation sites due to mutation. The application of this model onto a cancer genome dataset (Greenman et al. 2007) revealed that cancer somatic mutations are enriched for mutations that affect phosphorylation sites (Radivojac et al. 2008).

Structure-based amino acid substitution prediction methods rely on an ability to map mutations of interest to a structure. In these methods, the first step is to find a suitable structure for the protein of interest and then identify the possible structural effects that the given amino acid substitution may have. For example, changes that affect solvent accessibility or at sites of protein–protein interactions are more likely to be functional. PolyPhen is a rule-based system that uses structural information and annotation data to identify functionally important sites to predict the potential function of a substitution (Sunyaev et al. 2001).

Just as frequency-based methods are heavily dependent upon available mutation information, substitution-based algorithms are limited by available sequence and structure information. For instance, three-dimensional structures are only available for a small fraction of all proteins, and not in all functional conformations. In this case, applying a structure-based algorithm to a protein with only limited structural information would not produce accurate results (Chasman and Adams 2001; Yue et al. 2005; Yue and Moult 2006). In an analogous manner, sequence-based methods can be limited by the number of available sequences homologous to the protein of interest. Predictions will be less accurate in cases where an inadequate number of sequences are used to identify conserved residues. Even with a large set of homologous sequences, however, it has been shown that many disease-causing mutations are in positions that are not highly conserved across species and could therefore be subject to less accurate analysis by sequence-based methods (Torkamani and Schork 2007b).

The previously mentioned methods can be generally applied towards analysis of amino acid substitutions. However, cancer driver mutations have particular characteristics that Kaminker et al. (2007a) exploited with the CanPredict algorithm to distinguish cancer mutations from others based on annotation and sequence features. CanPredict applied specific knowledge of cancer mutations to distinguish them from other disease mutations in a manner similar to how disease mutations can be distinguished from non-functional mutations. In particular, CanPredict incorporates sequence-based predictions from two previously mentioned methods (SIFT and Pfam-based LogR.E-value) (Clifford et al. 2004; Ng and Henikoff 2001) as well as annotation information from Gene Ontology (Ashburner et al. 2000) into a random forest classifier (Breiman 2001). This classifier quantifies the differences in these features between cancer-related mutations and others and is then able to provide a call for whether or not a given mutation is likely to be a causal mutation in cancer.

Subsequently, Torkamani and Schork (2008) reported an SVM-based classifier to distinguish cancer driver mutations. A collection of sequence and structure features were used in their model, including sequence conservation measured with the SubPSEC score (Thomas et al. 2003; Thomas and Kejariwal 2004), the wild-type and mutant amino acid identity (Torkamani and Schork 2007b), changes in five amino acid metrics (Atchley et al. 2005), changes in hydropathy, water/octanol partition energy, hydrophobicity, polarity, charge and volume, protein domain information, protein secondary structure, amino acid solvent accessibility and structure flexibility predicted by Wiggle (Gu et al. 2006). Only mutations within protein kinase families are analyzed in this study, which allows the incorporation of two protein kinase-specific features into the model. The subgroup annotation of the specific protein kinase was used as the first feature since the distributions of disease and non-disease mutations within different protein kinase groups are significantly different. The second feature is the subdomain predictor of whether a given mutation falls within the N-terminal or the C-terminal lobe, since disease mutations have a tendency to cluster within the C-terminal lobe rather than the N-terminal lobe. The authors showed that this context-specific method outperforms CanPredict on the kinase mutations. In a different study, Torkamani et al. reported that their method is also superior to other popular methods (SIFT, Polyphen, Pmut, and SNPs3D) applied to germline variants within protein kinases (Torkamani and Schork 2007a). They attributed much of their success to the context-specific training data, where specific protein kinase features such as the group membership can be used.

The protein sequence and structure-based analysis have also been applied in the recent large-scale cancer genome projects in order to help prioritize genes and somatic mutations for further validation (Ding et al. 2008; Jones et al. 2008; TCGA 2008; Parsons et al. 2008; Wood et al. 2007). Ding et al. (2008) used SIFT and PolyPhen to evaluate the potential impact on protein function for 811 missense mutations. SIFT predicted 430 missense mutations as deleterious while PolyPhen predicted 438 mutations as probably/possibly damaging. Taken together, 579 mutations were identified as likely to affect protein function. Wood et al. (2007) used two sequence analysis tools, SIFT and logR.E to prioritize mutations for further analysis. After projecting mutations onto protein structures, they observed that some somatic mutations showed clustering of mutations around active sites of proteins or near an interface residue. In the glioblastoma and pancreatic cancer studies by the same research group, a machine learning classifier using a random forest algorithm, LSMUT, was developed to predict the functional impact of the non-synonymous mutations (Jones et al. 2008; Parsons et al. 2008). Fifty-eight features based on the sequence and structural information of amino acids involved in the alterations were used as the predictive features for the classifier. The classifier was trained on common SNPs as the negative dataset (common SNPs are assumed to be tolerated and therefore not disease-causing) and cancer mutations in the COSMIC database as the positive dataset. The distribution of LSMUT scores of the missense mutations in the top-ranked CAN genes is significantly different from the scores in a set of randomly generated mutations. Approximately 15 and 17.3% of the missense mutations that can be predicted by the classifier were predicted to affect protein function in the glioblastoma and pancreatic cancer datasets, respectively. Furthermore, using protein structure information, they discovered that over 10% mutations (35 in glioblastoma and 55 in pancreatic cancer) are located close to a domain interface or substrate-binding site and thus are likely to affect protein functions.

The frequency-based analysis approaches and amino acid substitution prediction methods have been compared in a few papers. The two methods were found to produce the results that correlate well with each other. Indeed, mutations in genes that have a high CaMP score tend to be also classified as cancer-associated using CanPredict (Hon et al. 2008). Similarly, functional scores computed by combining the predictions of PolyPhen, PMut, SIFT and SNPs3D correlated with the odd ratios identified in association studies (Zhu et al. 2008). More importantly, these two approaches may work together in a complementary nature. For example, CanPredict can capture the functional effect of known driver mutation BRAF V600E (Kaminker et al. 2007a), which was missed by frequency analysis due to the lack of enough samples (Sjoblom et al. 2006). Many prediction tools have been developed in recent years to identify the potential functionality of amino acid substitutions (Table 2). However, these tools often produce inconsistent results, making interpretation of the individual results more difficult. Chan et al. (2007) show that when the predictions of four different methods are in agreement the prediction accuracy is significantly improved. Others have proposed a metaserver to enable end users to more easily access consensus prediction results from different prediction servers (Karchin 2009; Ng and Henikoff 2006). With high-throughput, next generation sequencing technology becoming increasingly reliable and affordable, future cancer genome studies are likely to be sequencing the entire genome in large collections of cancer samples. Therefore, the power of frequency-based analysis will be increased accordingly. In the meantime, the amino acid-based bioinformatics analysis will become even more critical as more rare mutations are identified.

It is well known that there are some recurrent cancer “hotspot” mutations that can be observed in many samples, for example BRAF V600E is reported in over 4,000 samples according to COSMIC database. Moreover, cancer mutations are also found to be located at the locations that are analogous to other known mutations. For example, the T790M mutation in EGFR occurs at the same residue in the kinase domain as other known mutations in BCR-ABL, PDGFRA and KIT (Kobayashi et al. 2005). Marks et al. (2007) reported a novel mutation in the kinase domain of FGFR4, which is located at an analogous location to a known cancer mutation in ERBB2, which lead to the development of the “Mutagrator” web site (http://cbio.mskcc.org/mutagrator/) to capture a few other analogous mutation clusters in the protein kinase domain. Wood et al. also discovered a number of cancer mutations that occurred at locations identical to those of genes involved in human germline diseases. Based on this concept, one may develop a mutation cluster analysis tool to identify the analogous mutation clusters between cancer and germline disease mutations. Such a tool would be a valuable in addition to the existing bioinformatics tools for identifying functional mutations.

Analysis and significance of non-coding functional variants

We have thus far focused exclusively on reviewing the analysis of non-synonymous coding mutations, but there are several ways in which single-nucleotide mutations may give rise to abnormal function and therefore potentially lead to a disease phenotype. In addition to amino acid substitutions, single-nucleotide changes may also result in irregular gene expression through modification of regions of the genome important for regulation of transcription, such as transcription factor-binding sites (Knight 2005; Pastinen and Hudson 2004). In addition, most human genes require post-transcriptional processing before resulting in a mature mRNA, so modifications in splice sites or polyadenylation may also lead to altered protein function. Finally, mutations in any of the many regulatory RNAs that the genome encodes could result in an undesired phenotype through abnormal regulation of gene expression. Here, we will review experimental evidence demonstrating that mutations in these non-translated regions can have an effect on human health and give an overview of methods and resources that can be applied towards large-scale functional characterization of these non-coding features.

Regulatory SNPs

Gene expression is a tightly regulated cellular process and so mutations that affect gene expression can have a profound phenotypic effect or dramatically increase disease risk. Depending on the gene in question, variations that increase or decrease the expression of a gene can both have deleterious effects. One example of overexpression increasing disease susceptibility is in the MDM2 gene where a single nucleotide change known as SNP309 alters transcription factor binding (Bond et al. 2004). Individuals with a T=>G mutation show a substantially increased risk for developing colorectal cancer. MDM2 acts as an inhibitor of the p53 tumor suppressor pathway, and the evidence suggests that a guanine at SNP309 results in overexpression of MDM2, which in turn leads to increased suppression of the p53 pathway and increased risk of cancer development (Bond et al. 2005; Bond and Levine 2007). Another regulatory SNP of interest to cancer researchers is the 938C/A polymorphism present in the promoter region of the BCL-2 anti-apoptosis gene. The alanine variant of this polymorphism is thought to reduce the expression of BCL-2 relative to wild-type expression, which would in turn provide low risk of cancer development. In fact, studies have shown that BCL-2 938A is associated with decreased risk in prostate cancer and squamous cell carcinoma (Chen et al. 2007; Kidd et al. 2006). However, an additional study with a small patient set could not find association between protein levels of BCL-2 and any laboratory or clinical features of chronic lymphocytic leukemia (Majid et al. 2008). In one further example that is distinct from the MDM2 and BCL-2 SNPs above, rs6983267 is an SNP in an intergenic region of 8q24 which is hundreds of kilobases away from the nearest functional gene. Multiple independent studies have shown that the guanine allele at this position is associated with several cancers, most prominently colorectal cancer (Haiman et al. 2007; Tomlinson et al. 2007; Tuupanen et al. 2008; Zanke et al. 2007). In this case, it is unclear whether rs6983267 is functional by itself through disruption of a long-range enhancer element or if it is tightly linked to another functional variant.

In a manner analogous to GWAS to discover SNPs correlated with complex traits, several studies have attempted large-scale efforts to associate SNPs with gene expression phenotypes (Cheung et al. 2003; Cheung and Spielman 2002; Cheung et al. 2005; Morley et al. 2004; Spielman et al. 2007). Rather than thinking of a disease condition as a phenotype, these studies utilize specific gene expression levels as the phenotype of interest and they discover a large number of genetic markers that are tightly associated with expression phenotypes. The results of these studies effectively comprise a list of candidate regulatory SNPs. Since with more traditional GWAS, it is not apparent without additional experimentation which of the associated SNPs are actually causal, as opposed to simply correlated. Even with the identification of many associated SNPs, it is likely that experiments are needed to filter for the genotypes that directly contribute to the gene expression phenotype. For example, although association of rs6983267 to cancer has been found, it would require detailed experimentation to resolve whether or not the SNP is causal, and if so whether it is tissue-specific. Such experiments are often performed with reporter assays where promoters containing the candidate alleles are used to drive expression of a reporter gene such as Luciferase (Cheung et al. 2005; Ogasawara et al. 2008), so that allele-specific expression can be quantified. Even so, however, identifying mutations that alter expression of key genes may still not result in discovering the causal factor in disease since it would have to be shown that the modified expression results in the observed phenotype. In cases such as rs6983267, the causal factor ends up being extremely difficult to detect since it is not near any gene and so even if it is affecting expression, identification of the genes that it is directly affecting can be a difficult problem.

In one case where significant experimental evidence has provided a strong theory for a regulatory SNP being responsible for driving disease, De Gobbi et al. (2006) characterized an SNP associated with the blood disorder α-thalassemia. This SNP does not alter protein function directly, but instead it lies in an intergenic region within the α-globin gene cluster, which has been associated with α-thalassemia onset. The disease-associated variant is in fact a gain-of-function mutation that results in a new transcriptional promoter for the GATA-1 transcription factor being created in the midst of the gene cluster. Activation of transcription at this new promoter appears to result in suppressed expression of downstream α-globin genes, which leads to α-thalassemia (De Gobbi et al. 2006).

A number of computational tools have been developed to help with the analysis of SNPs (Mooney 2005), and many of them have specific features towards the identification of regulatory SNPs. A recent review on methods for annotating SNPs provides a detailed description of many of these resources (Karchin 2009). With regards to functional analysis of regulatory SNPs, the general paradigm is to utilize existing databases of transcription factor-binding sites, such as TRANSFAC and JASPAR to identify SNPs that may map to important regulatory regions (Matys et al. 2003; Sandelin et al. 2004). Transcription factor-binding site mapping is primarily accomplished through identification of genomic elements that match known sequence motifs or positional weight matrices, so there can be a quantitative measure of how close a given sequence is to the canonical binding site. A method of predicting whether or not a nucleotide change will have an effect on gene expression could be to score both the wild-type and the variant sequences for transcription factor binding and look for differences (GuhaThakurta et al. 2006). Any perturbation of a binding site could result in a functional effect, since reducing transcription factor binding will cause deregulation of the target gene, whereas introducing a new site could cause undesired transcription (De Gobbi et al. 2006). For example, JASPAR sequence analysis of two breast-cancer susceptibility SNPs (rs7895676 and rs2981578) in the FGFR2 locus shows that they are likely to affect transcription factor binding. In each case, JASPAR scores one allele with a high similarity to the known transcription factor-binding site whereas the other receives a score below the default 0.80 relative profile score threshold. Detailed experimental evidence confirms that the minor allele C in rs7895676 disrupts binding of C/EBPβ while the minor allele G in rs2981578 increases binding affinity of Runx2 (Meyer et al. 2008).

Post-transcriptional processing SNPs

Single-nucleotide mutations can affect the cell in ways even beyond amino acid substitution and transcriptional regulation. There are several cases of SNPs affecting protein function through alterations in splicing rather than affecting protein structure through amino acid substitution (Pagani and Baralle 2004; Srebrow and Kornblihtt 2006). Furthermore, gene expression can be regulated through means such as micro RNAs (miRNAs) or polyadenylation rather than through genomic regulatory regions. These factors can also be influenced through mutation and result in abnormal phenotypes.

Splicing mechanisms in humans are relatively well known and with the genome tools available today, it has become possible to systematically identify genomic sites important in splicing, which in turn allows the identification of variants that may affect splicing. Several large-scale efforts have attempted to identify SNPs that may have an effect on splicing (ElSharawy et al. 2006; Hull et al. 2007; Nembaware et al. 2008) and a number of tools exist for identification of splicing-related genomic elements (Cartegni et al. 2003; Thierry-Mieg and Thierry-Mieg 2006). There are multiple accounts of splicing SNPs with an impact in cancer risk, with many of these being identified in the breast-cancer susceptibility gene BRCA1 (Mazoyer et al. 1998; Pettigrew et al. 2005). Splicing mutations and polymorphisms can affect phenotype in several ways. Modification of a splicing donor or acceptor site could affect splicing efficiency, which can result in unwanted constitutive splicing or reduced splicing. Altering splicing enhancers or silencers could have a similar affect of affecting splicing efficiency. Alternative splicing is a well-known phenomenon that could also be affected by genetic variation. The large amount of transcript data available now suggests that alternative splicing is an extremely prevalent process in human, and misregulation of this process could produce undesirable phenotypes (Blencowe 2006). Several splicing-related mutations have been identified in cis-acting sequences that affect cancer-related genes and drive cancer formation. Li-Fraumeni syndrome and Peutz–Jeghers syndrome are hereditary genetic disorders that substantially increase risk and both have been linked to splicing-related polymorphisms (Hastings et al. 2005; Warneford et al. 1992).

In a non-cancer example related to post-transcriptional regulation, Uitte de Willige et al. (2007) found that a single SNP in the alternatively spliced fibrinogen gamma (FGG) gene leads to increased risk for deep-venous thrombosis. Linkage studies found that a particular haplotype of FGG was associated with increased disease risk and reduced protein levels. The C10034T SNP was discovered to be primarily responsible for this phenotype, and sequence evidence suggested that this SNP could be disrupting the normal polyadenylation signal and in fact increasing polyadenylation of one isoform relative to another and disrupting the ratios of protein production. This phenomenon is similar to rSNP variations resulting in misregulated gene expression except that it occurs at the post-transcriptional level, demonstrating that disruption of proper protein production at any stage could lead to deleterious effects.

In another example, where disruption of post-transcriptional gene expression regulation results in disease risk, there have been numerous studies linking mutations in the Hmga2 gene to cancer risk (Fedele et al. 2001; Lee and Dutta 2007; Mayr et al. 2007). Many of these mutations are a result of a truncation in the open-reading frame (ORF) of the Hmga2 gene, but a subset of these do not disrupt the ORF but instead only truncate the 3′ untranslated region (UTR). The Hmga2 3′ UTR contains several conserved binding sites for the let-7 miRNA and experimental evidence suggests that let-7 is involved in post-transcriptional repression of Hmga2 production. Further studies confirmed that, indeed, truncation of the Hmga2 3′ UTR was responsible for loss of let-7 repression, which in turn leads to oncogenesis (Lee and Dutta 2007; Mayr et al. 2007). In a manner similar to Hmga2/let-7, there is evidence that miRNA SNPs can also be involved in altering drug response. For example, the C829T SNP in the 3′ UTR of dihydrofolate reductase appears to affect miRNA-dependent regulation of DHFR expression. DHFR is the target of the commonly used chemotherapeutic agent methotrexate and studies have shown that SNP C829T causes loss of miR24 miRNA binding and results in DHFR overexpression which in turn drives resistance to methotrexate (Mishra et al. 2007).

There are many tools available for predicting the potential functional impact of SNPs, whether they affect the coding sequence of a protein, the sequences regulating the expression of a gene, or other aspects of protein expression. In addition, there are several databases available that catalog known SNPs and provide tools for selecting SNPs under specified criteria. A number of resources have been developed specifically to assist in analysis of potential regulatory SNPs (Table 3). For instance, rSNP_Guide (Ponomarenko et al. 2002, 2003) and SNP@Promoter (Kim et al. 2008) both store known SNPs, but specifically attempt to associate them with known transcription factor-binding sites for identification of potential regulatory SNPs. MAPPER is a companion tool to the SNPper retrieval system and database that locates computationally predicted transcription factor-binding sites (Riva and Kohane 2002). Both PolyMAPr (Freimuth et al. 2005) and PupaSNP Finder (Conde et al. 2006) use computational methods to find possible exon splicing enhancer sites that can then be mapped to SNP locations to find SNPs that potentially affect splicing. Other than simply attempting to map SNPs to potential transcription factor-binding sites or promoter regions to find SNPs that may have regulatory significance, the tools will also use functional information provided by databases, such as HGMD (Stenson et al. 2008) or OMIM (Amberger et al. 2009) to attempt to provide a functional annotation to some SNPs. Figure 1 shows a schematic of the various sorts of functional mutations that are possible and what tools are available for analysis of each type of functional region. The greatest number of methods exists for analysis of non-synonymous coding mutations, but there are tools available that attempt to identify each of the non-coding functional regions, such as exonic splicing enhancers (Cartegni et al. 2003), splice junctions (Stamm et al. 2006; Thierry-Mieg and Thierry-Mieg 2006), polyadenylation sites (Lambert et al. 2004; Tabaska and Zhang 1999) (http://www.imtech.res.in/raghava/polyapred/), transcription enhancers (Hallikas et al. 2006; Palin et al. 2006), micro-RNA-binding sites (Enright et al. 2003; Griffiths-Jones et al. 2008; Wang 2008), and transcription factor-binding sites (Matys et al. 2003; Sandelin et al. 2004). Furthermore, a recent study has shown that some regions of the genome have evolutionarily conserved three-dimensional DNA structures that correlate with non-coding functional genomic regions, thereby providing another method for identification of important substitutions (Parker et al. 2009). In each of these cases, even if there is not a tool that specifically predicts the potential functional impact of a sequence alteration, simply examining the difference in scores between the wild-type and mutated sequences is a method that can be universally applied. These algorithms could then provide a comprehensive means of functional prediction on all kinds of genetic variations.

Table 3 List of resources available for high-throughput SNP annotation and selection
Fig. 1
figure 1

Genomic regions, which are subject to functional alteration through single-nucleotide substitutions. Select computational tools that could be used for mapping or analysis of the various kinds of sequence elements are listed under each category. Methods for analysis of amino acid substitutions are roughly separated into those that incorporate protein structure information or those that are purely sequence based

Although the examples of functional non-coding SNPs presented above are all naturally occurring polymorphisms, it is likely that somatically gained mutations would have similar functional effects. The focus of most large-scale cancer sequencing projects to date has been on finding mutations in coding regions, with the majority of sequencing projects focusing on transcript sequence. Because of this data-generation bias, the amount of sequence data available for non-coding regions of the human genome is substantially smaller than for coding regions, but the extensive evidence from SNP data presented above implies that it is likely that some cancers may at least be partially driven by regulatory, splicing, or miRNA mutations. With the release of the first completely sequenced cancer genome and other cancer sequencing projects with coverage beyond that of just the coding regions (Ley et al. 2008), the data are becoming available to fully explore the extent of non-coding mutations in cancer.

Conclusion

The field of cancer mutation research is speeding up dramatically as the rate of data generation increases for advancements in sequencing technology. With the success of targeted cancer therapeutics, it appears that there is a significant benefit to be gained from continued efforts to identify genes and mutations that drive cancer development. The recent cancer genome sequencing projects are just the beginning of what will likely to be a continued flood of mutation data as next generation sequencing technologies continue to increase throughput and decrease cost, enabling the examination of both more regions of the genome as well as more samples. The combination of these factors emphasizes the need for robust computational pipelines for analysis of mutation data. Frequency-based methods are best equipped to leverage the statistical benefits of large datasets, but they may be subject to some weaknesses that amino acid-change-based methods can mitigate. Methods targeted to specific types of proteins (such as kinases and their targets) have also shown that they can be more effective than generalized tools due to their ability to incorporate more specific models and reduce noise through prior knowledge (Radivojac et al. 2008; Torkamani and Schork 2008). Integration of other datasets, such as genome-wide expression and copy number analysis will also be crucial in providing the best candidates for focused analysis (TCGA 2008).

Many of the issues resulting from small sample size or a candidate gene approach that were present in early studies will be mitigated through the decreasing cost of sequencing. However, the rapidly approaching dream of inexpensive complete genome sequencing will also bring about a new set of analysis challenges. Current cancer genome sequencing projects are able to constrain their analysis by focusing on protein-coding regions and non-synonymous-coding mutations since that is the majority of generated data. Next generation sequencing efforts will likely include the copious amounts of non-coding genomic sequence present in the human genome which have not yet been examined by most existing sequencing efforts. The results from GWAS and small-scale experiments have already demonstrated that non-coding mutations can have a significant impact on cellular function, but novel analysis methods are needed to leverage this new data.

The exponentially increasing ability to sequence has enabled experiments that were previously prohibitively expensive. This new technology leads to an exciting time in the field of cancer mutation research, but puts the burden on the computational tools to provide the greatest value from the generated data. Next generation tools will have to be both accurate and fast to process the large amounts of incoming data, and it will require multilateral efforts to fully mine each dataset.