Background

Expressed Sequence Tags (ESTs) are single-pass, partial sequences of cDNA clones derived from a vast number of disease and normal tissues [1]. ESTs have been used extensively for gene discovery and transcript mapping of genes from a wide number of organisms, including human and mouse [1, 2]. ESTs have also been used for SNP identification [35], gene expression analysis and transcriptome analysis [6, 7]. Currently there are more than 4 millions human ESTs in GenBank dbEST database and the number is still growing.

Susceptibility to common, complex diseases is in part genetically determined [811], although the genetic contribution might vary greatly depending on the diseases. Single nucleotide polymorphisms (SNPs) are the most common genetic variation in the human genome, and the number of SNPs identified experimentally is growing tremendously. Currently, dbSNP (build 106) contains more than 2.7 million unique SNPs. These data provides a vital resource to study the role of specific sequence alterations on disease susceptibility as well as drug resistance/sensitivity. In recent years SNPs have been favored as more tractable genotypic markers [12]. As genetic markers, SNPs have several advantages over microsatellites sequence repeats, including abundance (one every 750–1000 bp) [13], stability, and suitability for high throughput analysis. As a consequence, SNPs are being utilized with increasing frequency as markers in human genetic analysis, such as studies of comparative population variation [1416] and candidate gene association analysis [1721]. Finally, the combination of SNP analysis with new approaches to investigate profiles of gene expression and proteomics should lead to fundamental insights into the biological importance of common genetic variations in the human genome [22].

Cancer is a polygenic, complex disease caused by the interaction of many genetic and environmental factors [23, 24]. Presently, fewer than 10% of tumor cases are attributable to the inheritance of mutations in a single gene, such as BRCA1/2, BRAF and p53. Mutation in any one gene in the polygenic pathway may have a small effect on the risk of developing cancer in a particular individual, but may still make a substantial contribution to cancer incidence within the population if the mutation is present with high frequency [22, 23]. Careful study of the huge number of single nucleotide polymorphism (SNPs) will eventually provide new insights into carcinogenic mechanisms [19, 25, 26]

In this report, we detail a novel approach which utilizes the publicly available dbEST, and dbSNP datasets http://www.ncbi.nlm.nih.gov to identify SNPs located in genes potentially involved in tumor development.

Methods

SNP classification by clustering with dbEST and in silico genotyping

All unique SNP and EST records were obtained from the NCBI databases (dbSNP build 106 :http://www.ncbi.nlm.nih.gov/SNP, dbEST release 092002 http://www.ncbi.nlm.nih.gov/dbEST/). ESTs sequences and their associated tissue library information were extracted and organized in a relational database (Sybase, SQL Server Release 11.0, CA, Sybase Inc.). The EST cDNA libraries were manually curated and cataloged into tumor and non-tumor libraries. A total of 4153 tumor and 2178 non-tumor cDNA libraries were identified.

EST and SNP sequences were clustered using a "common tag" method as previously described [7], SNP sequences that contiged with ESTs were, for purposes of this study, assumed to map to exons, and thus designated coding SNPs (cSNPs). SNP sequences not aligning with ESTs were excluded from further analysis. For each cSNP, sequence alignment was performed against dbEST using BLASTN [27]. In an effort to eliminate false clustering, 95% identity over 50 bp was selected as a minimum homology threshold. The genotype for each EST at the SNP position was fetched from the BLAST alignment. For those SNP sequences with at least 50 EST hits, ESTs were grouped further by their tissue sources. The cutoff of 50 ESTs was chosen at random, and corresponds to an average representation of 8 distinct tissues. One SNP allele was picked for each tissue if it was present in greater than 80% of ESTs. A tissue was designated heterozygous if both SNP alleles were present in an equal number of ESTs.

SNP distribution analysis in normal vs. tumor cases

The major and minor allele frequencies for each SNP were calculated for tumor and normal tissue. Fisher's exact test was used to test the significance of occurrence of the SNP genotype in both tissue types. Fisher's exact test is calculated using:

Let there exist two such variables X and Y, with m and n observed states, respectively. Now form an n × m matrix in which the entries a ij represent the number of observations in which x = i and y = j. Calculate the row and column sums Ri and Cj, respectively, and the total sum of the matrix

All SNP meeting the Fisher's exact test P value < 0.05 significance threshold were further analyzed for amino acid codon conservation.

Codon conservation analysis for cSNP with P < 0.05

SNP sequences were subject to BLAST analysis against Genbank nr database ftp://ftp.ncbi.nih.gov/genbank/ using BLASTX. The top protein hit with percent identity greater than 95% over a 30 amino acids window size was further analyzed to determine whether the SNP resulted in codon change. The codon which contained the IUPAC code of the SNP was replaced with the corresponding nucleotide codes (A,T,C,G) and tested for amino acid codon change.

Results and Discussions

A large number of studies have focused on investigating genetic polymorphisms in individual genes in order to estimate genetic contribution to the development of cancer [28]. Cancer susceptibility SNPs have been identified among genes with known activity in cell cycle maintenance and DNA repair as well as those encoding phase I and phase II enzymes [28]. Recent advancements in large scale SNP genotyping have made genome-wide SNP association analysis possible [29, 30]. However, despite large efforts to identify SNPs in genes previously identified as candidates for cancer susceptibility, genome wide identification and characterization of SNPs among cancer patients or tumor tissue has not been reported.

This report describes a comprehensive allele frequency analysis of ~2.7 million unique SNPs in tumor vs normal tissues. The goal of the study was to identify SNPs over-represented in tumor-derived ESTs using dbEST tissue library information. Initially, all SNPs from dbSNP build 106 were downloaded from NCBI. A total of 741,244 (27.5%) SNPs mapped to transcribed regions by clustering to the dbEST database (release 092002) (Table 1). SNPs overlapping with an EST were further subject to allele frequency analysis using dbEST tissue information. Fisher's exact test identified 4865 (0.66%) SNPs with allele frequencies which were significantly different between tumor and normal tissue. A less conservative confidence interval of P < 0.05 was used in this study. A Multiple testing correction was not performed as we were more willing to accept false positives than false negatives. Multiple testing correction, including the Bonferroni correction (which assumes independent markers), would markedly overcorrect for the inflated false-positive rate and thereby throw away valid information. This is especially true given the large number of tests involved in this study and the relatively small P values obtained due to the low count of tissues for each SNP.

Table 1 SNPs counts in each analytical step. SNP sequences that overlap with ESTs are referred as (cSNPs).

Table 2 [see Additional file 1, table2.pdf] summarizes those cSNPs identified by the present analysis with allele frequencies significantly different between tumor and normal tissue that result in amino acid change. Many of these genes are known to be involved in tumor development.

HLA/MHC gene SNPs represented approximately 15% (50/327) of cSNPs identified as differing significantly between tumor and normal tissue. It has been previously reported that tumor cells undergo changes in the major histocompatibility complex (MHC) class I locus during tumor development [31, 32]. These HLA losses produce tumor cells that are able to escape anti-tumor T cell immune responses. Defects in the antigen processing machinery and in HLA class I antigens in malignant cells may have a significant impact on the clinical course of malignant diseases and on the outcome of T cell-based immunotherapy [32]. In addition, MHC class I loss or down regulation in cancer cells is a major immune escape route used by a large variety of human tumors to evade anti-tumor immune responses mediated by cytotoxic T lymphocytes. Multiple mechanisms are responsible for such HLA class I alterations. These data suggest that SNPs in HLA might be another important mechanism that causes loss-of-function, affecting the role of HLA in presenting immunogenic peptides to T cells.

Glutathione S-transferases (GST) constitute a large multigene family of phase II enzymes involved in detoxification of potentially genotoxic chemicals. Total or partial deletions or SNPs in alleles encoding GSTM1, GSTM3, GSTPI, GSTT1, GSTZ1 are associated with reduction of enzymatic activity toward several substrates of different GST isoenzymes. In addition, molecular epidemiological studies indicate that a single SNP in glutathione S-transferase appears to be a moderate lung cancer risk factor. However, the risk is higher when interactions with more GST polymorphisms and other risk factors (e.g. cigarette smoking) occur. Individuals with decreased rate of detoxification or with "high risk" glutathione S-transferase genotypes have a slightly higher level of carcinogen-DNA adducts and more cytogenetic damages [28, 33]. Blackburn et al. have reported that an A/G transition at position 94 of GSTZ1, which reflects a Lys to Glu changes in the encoded peptide, displayed differences in activity towards several substrates [34]. This SNP (rs7975) was also found to display different allele frequencies in normal compared to tumor tissue in this study. The present study also identified a SNP (rs1065411) in GSTM1, causing a Lys to Gln change at reside 173. Based on the association of the SNP in GSTZ1 with increased cancer risk, further analysis of the variability in GSTM1 is warranted.

The protein kinase PITSLRE is part of the large family of p34cdc2 related kinases whose functions appear to be linked to the control of cell division and possibly programmed cell death [35]. Evidence also suggests that one or more PITSLRE kinase isoforms may be tumor suppressor genes [36]. It has been suggested that one PITSLRE isoform p110 protein kinase are cleaved in vivo by multiple caspases during Fas-mediated cell death at several sites within the amino-terminal domain and the caspase cleavage of this protein is affected by the phosphorylation [37]. This study identified one SNP (rs1059828) in PITSLRE kinase (amino acid 401 on CDC2L1, NP_277021.1 and amino acid 396 on CDC2L2, NP_284922.1) which yields an amino acid alteration of Ser->Leu with significantly different allele frequencies in normal compared to tumor tissues. Feng et al [38] discovered a similar mutation (C/T at nucleotide location 97 of exon 7, Ser-Leu) on PITSLRE CDC2L1 in the melanoma cell line UACC903. While their exact role remains to be tested, the potential of these two independently identified mutations to induce phosphorylation site changes on PILSLRE kinase, suggest importance in tumor development.

Finally, mitochondria have been reported to play a key role in various apoptotic processes including cell death induced by cytotoxic agents [39, 40]. Mitochondria undergoing permeability transition release apoptogenic proteins such as cytochrome c and apoptosis-inducing factor from the mitochondrial intermembrane space into the cytosol, where they can activate caspases and endonucleases [39, 40]. This analysis has identified several mitochondrial genes including dUTP Pyrophosphatase and ATP synthase with significantly difference SNP allele frequencies. While their role in apoptosis remains to be determined, the large number of SNPs in mitochondrial genes revealed by this analysis suggests that such mutations may contribute to the tumor development.

The approach described here has several limitations. Due to the continually evolving nature of the human protein catalog, SNPs located in previously unannotated coding regions were not included in this analysis. A complete list of all significant SNPs is available [see Additional file 2, all_snp.xls], allowing the analysis to be repeated as the protein catalog is updated. In silico analyses are also limited by the quantity and quality of the data present in databases used in the analysis. The data present in dbEST is not well annotated with regard to the precise origin of the source tissue used in cDNA library construction. It is possible, for example, that EST data from multiple tissues sourced from the same donor were used in the present analysis. This lack of diversity could artificially bias the significance of any particular allelic imbalance observed. The homogeneity of the tissue characterized as tumor-derived is another potential source of error. Analysis of actual tumor tissue might contain a large portion of normal tissue into which the tumor infiltrate. For those reasons, the number of ESTs which contain a particular SNP and the diversity of source tissues that contain those SNPs will affect the quality of the analysis. In addition, limited representation of low-abundance transcripts in dbEST likely has introduces a bias towards SNPs present in genes which display widespread tissue distribution, or are present in tissue types overrepresented in the database. SNPs present in genes expressed at low levels are under represented from this analysis as they were likely to fall short of the protocol thresholds. Another drawback is that somatic mutations might be excluded from the list since the majority of dbSNP entries represent chromosomal mutations and therefore primarily represent inherited polymorphisms. Somatic mutations that cause cancer in some genes (i.e: BRAF) [19] might not be detected if the same mutation is not stably inherited. Lastly, bias was introduced by the using EST tissue library information to assess allele frequencies. This limited the present analysis to SNPs present in known or predicted amino acid coding sequences, excluding those common, functional intronic or promoter region SNPs which may result in splicing or expression changes.

Large scale genotyping of samples from patients will lead to important breakthroughs in understanding mechanism of gene-environment and gene-gene interactions in common polygenic cancers. Effort has been initiated in large sequencing laboratories to carry out comprehensive SNP analysis in all disease candidate genes. However, this is a labor intensive, lengthy and very costly effort. The in silico analysis described here provide a quick and economic approach to screen through a large number of identified SNPs in the human genome to pinpoint possible cancer susceptibility genes, utilizing the rich tissue and library information present in the public dbEST database. Nevertheless, positive associations of SNPs with cancers reported here are very preliminary and are subject to interpretation and careful experimental validation. Only the combined consideration of studies in different populations produce similar results will result in the belief that a SNP is indeed a cancer risk factor.

Although we do not validate all the tumor related genes identified in this report, the approach taken here identified numerous hits in DNA repair genes, genes encoding phase I and phase II enzymes and other tumor related genes, some of which are already under scrutiny by the cancer research community. A couple of the SNPs revealed in this analysis have been suggested to have roles in tumor development in previously published studies [33]. Complementary to any other disease gene and SNP association study, this approach can help to prioritize the genes that need to be validated and further help to elucidate the genetic contribution to the development of cancer. This method can also help to identify new genes or SNPs that might be crucial to tumor development. Additional genome wide screens through cancer cell DNA for somatic mutations ultimately will provide a more complete picture of the number and patterns of mutations underlying human oncogenesis.