Background

Increasing data suggests that selenium deficiency is a risk factor for certain cancers, neurodegenerative disorders and complications from diabetes [14]. Selenium is required for normal immune function and selenium deficiency can be associated with enhanced infectious disease severity [1, 5]. Selenium deficiency impairs the expression and production of selenium containing enzymes, known as selenoproteins, resulting in enhanced susceptibility to oxidative stress. In addition, it is possible that functional polymorphisms in selenoprotein genes might also influence selenoenzyme expression, stability or activity modifying disease outcomes in a manner similar to that observed with selenium deficiency.

The 6 genes selected for re-sequencing in this project play an important role in antioxidant defense; they include selenoprotein P (SEPP1), thioredoxin reductase 1 (TXNRD1), and 4 selenium containing glutathione peroxidase genes, GPX1, GPX2, GPX3 and GPX4 [68]. The glutathione peroxidase family is the largest of the selenoprotein gene families. Glutathione peroxidases are named for the ability to use glutathione as a reducing substrate. GPX1 and GPX2 appear to have similar substrate specificity, catalyzing the reduction of hydrogen peroxide to water, but differ in their tissue distribution, with GPX1 expression being particularly abundant in erythrocytes and GPX2 expression being restricted primarily to the gastrointestinal tract [9, 10]. GPX1 knockout mice have a normal phenotype, but are highly sensitive to oxidative stressors[11]. Some epidemiologic studies have correlated low GPX1 activity or particular GPX1 polymorphisms with enhanced risk of cancer, although these correlations have not been consistently observed in all populations [1217]. Mice with combined disruption of GPX1 and GPX2 develop bacteria associated ileocolitis and intestinal cancers [9]. GPX3 (extracellular or plasma) is a circulating plasma selenoprotein and is able to utilize thioredoxin reductase, thioredoxin or glutaredoxin as reductants [18]. GPX4 reduces phospholipid hydroperoxides, localizes to the mitochondria or to the nucleus and the cytosol, and appears to be essential for survival [19, 20]. GPX4 expression is particularly high in various endocrine tissues, especially the testis. Moreover, in mature spermatozoa, GPX4 functions as a structural protein that helps anchor the helix of mitochondria in the midpiece of spermatozoa, suggesting a possible mechanism by which selenium deficiency might impair fertility [21, 22]. SEPP1 is a major plasma selenoprotein and along with GPX3 accounts for the majority of plasma selenium[23]. SEPP1 is a secreted protein that likely functions as a selenium delivery molecule and perhaps as an extracellular antioxidant with glutathione peroxidase-like activity [24]. Unique among the selenoproteins, SEPP1 has 10 in frame UGA codons, each encoding for the selenium containing amino acid selenocysteine [25]; the other known selenoproteins generally have only one UGA codon [26]. Cytosolic thioredoxin reductase (TXNRD1) is one of the most abundant selenium-containing proteins and is able to catalyze the reduction of thioredoxin in a reaction that uses electrons from NADPH [27]. TXNRD1 is a major antioxidant redox regulator and supports the function of p53. It's expression may be regulated in a contrasting pattern to GPX1 in certain cancer systems and disruption of its expression may reverse the phenotype and carcinogenicity of lung cancer cells [28].

The primary goal of this study was to characterize genetic variation across 6 selenoprotein genes. Specifically, re-sequence analysis was performed in a multiethnic population to determine common single nucleotide polymorphisms (SNPs) and estimate haplotypes for use in large genetic association studies or for future functional studies. Sequence analysis targeted exons, regulatory regions and the sequence motifs characteristic of selenoproteins; the latter include an in frame UGA "stop" codon that is recoded to allow insertion of the selenium containing amino acid selenocysteine [26]. Both cis-acting features, including a 3' UTR RNA stem loop known as a selenocysteine insertion sequence (SECIS), and trans-acting factors (including tRNA-selenocysteine (TRSP), a selenocysteine-tRNA-specific elongation factor (EEFSEC) and SECIS binding protein 2 (SECISBP2)) are required for efficient selenoprotein translation [2932]. Lastly, the selenoprotein SNPs and fine haplotype maps described in this report will be valuable resources for future functional studies and for population specific genetic studies designed to comprehensively explore the role of selenoprotein genetic variants in the etiology of human diseases.

Results

Polymorphism analysis

Six selenoprotein genes (GPX1, GPX2, GPX3, GPX4, SEPP1 and TXNRD1) were re-sequenced using the SNP500 polymorphism discovery resource (Table 1), a panel of 102 DNA samples obtained from lymphoblastoid cell lines from 4 ethnically diverse control groups, Caucasian (CA, n = 31), African American (AA, n = 24), Pacific Rim/Asian (PR, n = 24), and Hispanic (HI, n = 23). In all, the re-sequencing project covered 58,251 base pairs of genomic sequence, for a total of >5.9 million sequenced base pairs. The mean number of base pairs sequenced per gene was 9709 (range, 7007 to 13,880). On average we sequenced 3320 bases 5' of the ATG and 3282 bases 3' of the stop codon. In each case the re-sequencing spanned all exonic regions and the 3' UTR SECIS region. The re-sequencing of the SEPP1 locus was extended to include the exons and 5' region of an antisense transcript that overlaps the 3' UTR of the SEPP1 locus. Of the 235 segregating sites, the number of SNPs with a rare allele frequency ≥0.05 or ≥0.1 were 103 and 92, respectively. In this regard, we observed a small number of rare variants (Additional Files 1 to 6).

Table 1 Details of Sequence Analysis of 6 Selenoprotein Genes in a 102 Person Multi-ethnic Population Performed to Identify Single Nucleotide Polymorphisms

The analysis of the possible sites of heterozygosity in the coding regions revealed several interesting observations. Of the 235 SNPs determined across the 6 genes, our analysis identified 5 non-synonymous variants, 6 synonymous variants and 224 non-coding SNPs. The coding region SNPs identified were located in the GPX1 (P75R, L91L, A192T, and P198L), GPX3 (L13L), GPX4 (L193L), TXNRD1 (L55L, L80L, and C383C), and SEPP1 (K19E, A234T) loci. Since sequence variation at the RNA level could in theory influence translation read through efficiency at the UGA selenocysteine codon, synonymous variants might be of particular functional relevance in selenoproteins; however, none of the identified synonymous substitutions were in the immediate vicinity of a selenocysteine codon. No putative coding region SNPs were identified in the antisense transcript that overlaps the 3' UTR of SEPP1. Identified non-coding SNPs included two SECIS region SNPs, both located within the GPX4 locus. One of these is a previously reported high frequency SNP, of possible functional significance, located 44 bp from the stop codon and just before the SECIS stem loop (stop +35 to +128) [33]. The other is a rare variant, identified in a single individual of African American/African heritage; this SNP (stop +103) is located in the vicinity of the highly conserved SECIS core. SNP density varied from 1.945 SNPs/kbp of genomic sequence in SEPP1 to 6.124 SNPs/kbp at the GPX3 locus. The mean number of SNPs/kbp for all 6 gene loci was 4.034. Perhaps reflecting greater functional constraint, the mean number of SNPs/kbp was lower in coding regions at 2.161. Within the coding region, GPX1 had the most SNPs/kbp (6.568) while GPX2 had no SNPs. Additional variation is present at the GPX1 and SEPP1 loci in the form of a variable number alanine repeat polymorphism within the first exon of GPX1 and a complex variable repeat polymorphism in the promoter of SEPP1, neither of which could be accurately resolved from our sequence tracings [34, 35].

Evolutionary analysis

We determined two measures of sequence diversity at the 6 selenoprotein loci (Table 2), the population mutation parameter (Θ) and nucleotide diversity (π). Nucleotide diversity and the population mutation parameter differ in that Θ is a measure of the number of variant sites and π is a measure of the observed heterozygosity per base pair. More specifically, nucleotide diversity is a parameter used to measure the degree of polymorphism within a population; it is defined as the average number of nucleotide differences per site between and two DNA sequences chosen randomly from the sample population. The population mutation parameter differs in that it is a measure of the observed number of variant sites, normalized to the number of chromosomes studied and the total sequence length, which corrects for sample size [36]. For the 6 genes the mean value for nucleotide diversity was 7.2 × 10-4. The greatest amount of nucleotide diversity (11.0 × 10-4) was observed at the GPX3 locus, while the least amount of nucleotide diversity was observed at the TXNRD1 locus (3.7 × 10-4). In general, the value for sequence diversity as measured by nucleotide diversity was similar to that measured by the population mutation parameter. For the 6 genes, the mean value for the population mutation parameter was 7.3 × 10-4. Under the infinite-sites model of DNA sequence evolution, if the nucleotide sequence variation among haplotypes at a locus is neutral and the sample population is in equilibrium with respect to drift and mutation, then the degree of polymorphism estimated by calculating the nucleotide diversity and the population mutation parameter should be equal. This is measured statistically using the Tajima's (DT) statistic [37]. A strongly negative Tajima's D test is suggestive of positive selection. In the Asian population at the GPX1 locus there was a strongly negative DT value (-1.760), however this test did not achieve statistical significance (P > 0.05, P < 0.10). Using an alternative neutrality test, the DF and F statistics of Fu and Li [38], however, we do detect possible evidence of selection at the GPX1 locus. Although non-significant for the various subpopulations, for the combined populations the values for DF (-2.495) and F (-2.319) are significant at the P < 0.05 level. We also observed significantly positive (P < 0.05) Tajima's D tests at the GPX4 (2.249) and the SEPP1 (2.056) loci, in the Hispanic and Caucasian populations, respectively. Although a positive D tests might be indicative of balancing selection (positive heterozygote advantage), a very plausible explanation for the positive tests in this case is the presence of a significant degree of genetic admixture within one or both of the control populations [39].

Table 2 Sequence Diversity and Evolutionary Analysis of 6 Selenoprotein Loci Stratified by Estimated Population

Confirmation of recent positive selection using data from the HapMap Project

To confirm recent positive selection at the GPX1 locus, we used the web application Haplotter, developed in the Pritchard laboratory, to query a map of recent positive selection in the human genome. The input SNP data for this map are derived from the Phase 1 International HapMap Project [40]. Strong evidence for recent positive selection, as evidenced by a strong iHS (integrated haplotype score) signal, supports the hypothesis that the GPX1 locus has undergone a recent selective sweep in the Asian Population (Figure 1) [41]. Strong signatures of positive selection were not observed at the GPX2, GPX3, GPX4, TXNRD1 or SEPP1 loci in any of the subpopulations.

Figure 1
figure 1

Confirmation of recent positive selection at the GPX1 locus (3p21). To confirm recent positive selection at the GPX1 locus, we used Haplotter to query the results of a scan for positive selection in the human genome developed using SNP data from the International HapMap project [41]. The vertical line indicates the location of the GPX1 locus. The strong iHS (integrated haplotype score) signal in the Asian (ASN) population at this locus is highly suggestive for recent positive selection. Data is based on the analysis of unrelated individuals from 3 populations: ASN (Han Chinese and Japanese, n = 89), CEU (Northern and Western European, n = 60), and YRI (Sub-Saharan Africans from the Yoruban population, n = 60).

Genetic difference between sample groups

The proportion of the total genetic variance (Fst) contained in a subpopulation relative to the total genetic variance was calculated (Table 3). The data from the re-sequencing of the SNP500Cancer population suggest that there is some evidence for specific differences in genotype distribution between different ethnic groups, especially at the GPX1 locus. At the GPX1 locus the estimation of population subdivision between the Pacific Rim/Asian and the African American/African populations was 0.2418, and between the Pacific Rim/Asian and the Caucasian populations it was 0.2682. Altogether, these data suggest that there is evidence for specific differences in genotype distribution between the different ethnic groups, especially at the GPX1 locus.

Table 3 Estimation of Population Subdivision (Fst) at 6 Selenoprotein Loci

Haplotype structure

The most probable PHASED haplotypes derived using SNPs with minimum rare allele frequencies of ≥5% are presented as supplementary data (Additional Files 7 to 12). Estimates for linkage disequilibrium (LD) and location of major haplotype blocks across each of the 6 selenoprotein loci are provided for the total population in Figures 2 to 7 and for each of the individual ethnic groups in the supplementary data files (Additional File 13). For the data set, the haplotype diversity is restricted and the number of unique haplotypes varied by gene locus from 16 (GPX1) to 51 (GPX3), with a mean of 28.2. In most cases, the African American population had the greatest number of unique haplotypes (mean 14.8), whereas the Pacific Rim/Asian population had the fewest (mean 9.2). The number of common haplotypes with a frequency of ≥0.05 ranged from 3 (GPX3) to 5 (GPX1, SEPP1 and TXNRD1). Examined in a population specific manner, we also noted variation in the frequency of these major haplotypes. At the GPX1 locus, for example, haplotype number 1 was observed in 63% of individuals of Pacific Rim/Asian heritage, whereas the frequency of this major haplotype was much lower in the other populations (AA 0.17, CA 0.13 and HI 0.30). Similarly, at the TXNRD1 locus haplotype 1 had a frequency of 65% in the Pacific Rim/Asian population, but was observed less often in the other populations (AA 0.29, CA 0.19, and HI 0.28). Although the functional significance of the various imputed haplotypes remains to be determined, it is of interest to note that key SNPs of possible functional consequences segregate with particular haplotypes. For example, the T variant of a common GPX4 SECIS region SNP (Stop +44) is found in haplotype 1 but not in any of the 8 next most common GPX4 haplotypes. Similarly, for the GPX1 P198L variant, the proline variant (C) resides on the 4 most common GPX1 haplotypes whereas the lucine variant (T) is only observed on the backbone of several rarer haplotypes (5, 6, 9, 10, 13, and 14); these rarer haplotypes are relatively uncommon among individuals of Pacific Rim/Asian heritage. In addition, there is a common non-synonymous (A234T) variant in SEPP1, located between 2 histidine rich regions. This variant is a major distinguishing feature between the most common SEPP1 haplotype (0.36) and the next most common haplotype (0.18). Again it is notable that the T234 encoding haplotypes (2, 6 and 10) are rare in the Asian/Pacific Rim Populations, with respective frequencies of only 0.04, 0.02 and 0.

Figure 2
figure 2

Estimates for linkage disequilibrium (LD) and location of major haplotype blocks across 6 selenoprotein loci. Pair wise plots (D') across 6 selenoprotein loci based on genotype data obtained from re-sequencing the 102 person multiethnic SNP500 DNA population, which is comprised of individuals of AA, CA, HI and PR heritage. LD plots for the various ethnic subpopulations are available as supplementary data. Re-sequenced genes include GPX1 (Figure 2), GPX2 (Figure 3), GPX3 (Figure 4), GPX4 (Figure 5), SEPP1 (Figure 6), and TXNRD1 (Figure 7). SNP identifiers are indicated on the abscissas. Numbers within cells correspond to LD values (D'). The LD color scheme is stratified according to the logarithm of the odds (LOD) score and D': LOD <2 (white for D'<1 and blue for D' = 1) or LOD >2 (shades of pink/red for D'<1 and bright red for D' = 1). Haplotype blocks were created using the algorithm of Gabriel et al, Science 2002 [76]. 95% confidence bounds on D' were generated and each comparison was called "strong LD", "inconclusive" or "strong recombination". A block was created if 95% of informative comparisons were "strong LD".

Figure 3
figure 3

Estimates for linkage disequilibrium (LD) and location of major haplotype blocks across 6 selenoprotein loci. Pair wise plots (D') across 6 selenoprotein loci based on genotype data obtained from re-sequencing the 102 person multiethnic SNP500 DNA population, which is comprised of individuals of AA, CA, HI and PR heritage. LD plots for the various ethnic subpopulations are available as supplementary data. Re-sequenced genes include GPX1 (Figure 2), GPX2 (Figure 3), GPX3 (Figure 4), GPX4 (Figure 5), SEPP1 (Figure 6), and TXNRD1 (Figure 7). SNP identifiers are indicated on the abscissas. Numbers within cells correspond to LD values (D'). The LD color scheme is stratified according to the logarithm of the odds (LOD) score and D': LOD <2 (white for D'<1 and blue for D' = 1) or LOD >2 (shades of pink/red for D'<1 and bright red for D' = 1). Haplotype blocks were created using the algorithm of Gabriel et al, Science 2002 [76]. 95% confidence bounds on D' were generated and each comparison was called "strong LD", "inconclusive" or "strong recombination". A block was created if 95% of informative comparisons were "strong LD".

Figure 4
figure 4

Estimates for linkage disequilibrium (LD) and location of major haplotype blocks across 6 selenoprotein loci. Pair wise plots (D') across 6 selenoprotein loci based on genotype data obtained from re-sequencing the 102 person multiethnic SNP500 DNA population, which is comprised of individuals of AA, CA, HI and PR heritage. LD plots for the various ethnic subpopulations are available as supplementary data. Re-sequenced genes include GPX1 (Figure 2), GPX2 (Figure 3), GPX3 (Figure 4), GPX4 (Figure 5), SEPP1 (Figure 6), and TXNRD1 (Figure 7). SNP identifiers are indicated on the abscissas. Numbers within cells correspond to LD values (D'). The LD color scheme is stratified according to the logarithm of the odds (LOD) score and D': LOD <2 (white for D'<1 and blue for D' = 1) or LOD >2 (shades of pink/red for D'<1 and bright red for D' = 1). Haplotype blocks were created using the algorithm of Gabriel et al, Science 2002 [76]. 95% confidence bounds on D' were generated and each comparison was called "strong LD", "inconclusive" or "strong recombination". A block was created if 95% of informative comparisons were "strong LD".

Figure 5
figure 5

Estimates for linkage disequilibrium (LD) and location of major haplotype blocks across 6 selenoprotein loci. Pair wise plots (D') across 6 selenoprotein loci based on genotype data obtained from re-sequencing the 102 person multiethnic SNP500 DNA population, which is comprised of individuals of AA, CA, HI and PR heritage. LD plots for the various ethnic subpopulations are available as supplementary data. Re-sequenced genes include GPX1 (Figure 2), GPX2 (Figure 3), GPX3 (Figure 4), GPX4 (Figure 5), SEPP1 (Figure 6), and TXNRD1 (Figure 7). SNP identifiers are indicated on the abscissas. Numbers within cells correspond to LD values (D'). The LD color scheme is stratified according to the logarithm of the odds (LOD) score and D': LOD <2 (white for D'<1 and blue for D' = 1) or LOD >2 (shades of pink/red for D'<1 and bright red for D' = 1). Haplotype blocks were created using the algorithm of Gabriel et al, Science 2002 [76]. 95% confidence bounds on D' were generated and each comparison was called "strong LD", "inconclusive" or "strong recombination". A block was created if 95% of informative comparisons were "strong LD".

Figure 6
figure 6

Estimates for linkage disequilibrium (LD) and location of major haplotype blocks across 6 selenoprotein loci. Pair wise plots (D') across 6 selenoprotein loci based on genotype data obtained from re-sequencing the 102 person multiethnic SNP500 DNA population, which is comprised of individuals of AA, CA, HI and PR heritage. LD plots for the various ethnic subpopulations are available as supplementary data. Re-sequenced genes include GPX1 (Figure 2), GPX2 (Figure 3), GPX3 (Figure 4), GPX4 (Figure 5), SEPP1 (Figure 6), and TXNRD1 (Figure 7). SNP identifiers are indicated on the abscissas. Numbers within cells correspond to LD values (D'). The LD color scheme is stratified according to the logarithm of the odds (LOD) score and D': LOD <2 (white for D'<1 and blue for D' = 1) or LOD >2 (shades of pink/red for D'<1 and bright red for D' = 1). Haplotype blocks were created using the algorithm of Gabriel et al, Science 2002 [76]. 95% confidence bounds on D' were generated and each comparison was called "strong LD", "inconclusive" or "strong recombination". A block was created if 95% of informative comparisons were "strong LD".

Figure 7
figure 7

Estimates for linkage disequilibrium (LD) and location of major haplotype blocks across 6 selenoprotein loci. Pair wise plots (D') across 6 selenoprotein loci based on genotype data obtained from re-sequencing the 102 person multiethnic SNP500 DNA population, which is comprised of individuals of AA, CA, HI and PR heritage. LD plots for the various ethnic subpopulations are available as supplementary data. Re-sequenced genes include GPX1 (Figure 2), GPX2 (Figure 3), GPX3 (Figure 4), GPX4 (Figure 5), SEPP1 (Figure 6), and TXNRD1 (Figure 7). SNP identifiers are indicated on the abscissas. Numbers within cells correspond to LD values (D'). The LD color scheme is stratified according to the logarithm of the odds (LOD) score and D': LOD <2 (white for D'<1 and blue for D' = 1) or LOD >2 (shades of pink/red for D'<1 and bright red for D' = 1). Haplotype blocks were created using the algorithm of Gabriel et al, Science 2002 [76]. 95% confidence bounds on D' were generated and each comparison was called "strong LD", "inconclusive" or "strong recombination". A block was created if 95% of informative comparisons were "strong LD".

Discussion

Selenium deficiency impairs the production of selenium containing proteins and may be a risk factor for cancer, infectious disease severity and enhanced susceptibility to oxidant stressors. Recently, selenium has emerged as one of the most promising cancer chemoprevention agents and is the focus of a large clinical trial (SELECT) that has enrolled 35,000 men to determine if selenium supplementation prevents prostate cancer[3, 4, 42]. It is possible that the anticancer properties of selenium are mediated through selenoproteins, many of which have antioxidant properties. An alternative hypothesis, however, suggests that the anticancer property of selenium compounds occurs at doses beyond those that are required to ensure maximal selenoprotein production [43, 44]. If selenoproteins play a direct role in cancer chemoprevention, then it is possible that genetic variation in selenoprotein activity or expression might also modify susceptibility to genome damaging environmental exposures such as cigarette smoke or dietary carcinogens. Similarly, it is also possible that inter-individual variation in selenoprotein expression could modify disease outcomes by influencing major antioxidant pathways, such as the glutathione cycle or thioredoxin metabolism. Pathways relevant not only to cancer susceptibility, but also to chemotherapy induced toxicities [45], and infectious disease severity (i.e., viral myocarditis, malaria, and septic shock syndrome) [4648]. We therefore explored the genetic variation in 6 selenoprotein genes in order to provide the foundation for the comprehensive analysis of selenoprotein genetic variation in candidate gene association studies. In this regard we have re-sequenced 6 of the 25 known human selenoprotein genes to identify common SNPs and haplotypes and to explore the selective processes acting on these loci. The genes selected for re-sequencing and evolutionary analysis are among the best-studied selenoproteins and all have important antioxidant properties; they include 4 glutathione peroxidases (GPX1-4), SEPP1 and TXNRD1 [68].

In total, we sequenced approximately 5.9 million base pairs of DNA from 102 individuals, representative of 4 ethnic populations common within the United States, Caucasian (CA, n = 31), African American (AA, n = 24), Pacific Rim/Asian (PR, n = 24), and Hispanic (HI, n = 23). We identified 235 SNPs, of which 103 had a rare allele frequency of greater than 0.05. For the 6 selenoprotein genes the mean value for nucleotide diversity was 7.2 × 10-4, which is similar to the value of 6.7 × 10-4 obtained by the Environmental Genome Project which recently re-sequenced 213 genes in 90 individuals[49]. Particularly interesting SNPs (with minimum rare allele frequency ≥0.05), of potential functional importance, include the GPX1 P75R and P198L variants, a high frequency GPX4 SECIS region SNP, and an A234T non-synonymous variant in SEPP1. The GPX1 P198L and GPX4 SECIS SNPs have both been previously described [16, 33]. In addition, we identified a rare GPX4 SECIS SNP adjacent to the SECIS core. SECIS SNPs are of particular interest, as this RNA stem loop structure is required for the translational incorporation of the amino acid selenocysteine. In the absence of a functional SECIS, translation will terminate prematurely at the UGA-selenocysteine codon. At this point, the functional significance of the identified SNPs and haplotypes remains largely uncharacterized. Although there is data suggesting that each GPX1-L198 allele decreases red cell glutathione peroxidase activity by about 5%, attempts to correlate enzyme activities with specific genotypes have provided inconsistent results, perhaps reflective of the observation that selenium status may influence selenoprotein expression or enzymatic activity[16, 33, 50]. Moreover, it is possible that haplotype analysis may provide a better means for correlating enzymatic activity or serum selenium levels, especially if this is done in individuals maintained on a diet containing optimal supplemental selenium.

Overall the pattern of the observed genetic variation was consistent with the expectations of the neutral equilibrium model of evolution for 5 genes, but at the GPX1 locus we found evidence for selection. At the GPX1 locus, the DF and F statistics of Fu and Li were strongly negative. The presence of a significantly negative D value indicates the presence of an excess of rare alleles inconsistent with neutral processes in a stable population, but consistent with either a demographic or selective processes [38]. The fact that a similar phenomenon is not observed at the other loci, suggests that the phenomenon is not simply the result of a demographic process such as a recent population expansion. Additional support for selection at the GPX1 locus is provided by the negative value for the Tajima's test (-1.760) in the Pacific Rim/Asian population, which just missed achieving statistical significance (P > 0.05, P < 0.10). Of further interest, we also found evidence for differences in genotype distribution between different ethnic groups, especially at the GPX1 locus. The relatively high Fst values of 0.2418 (Pacific Rim/Asian vs. African American/African) and of 0.2682 (Pacific Rim/Asian vs. Caucasian) suggest that there is substantial genetic differentiation between these populations. Inspection of the major GPX1 haplotypes in the Pacific Rim/Asian population reveals that the P198 containing haplotypes predominate and that the L198 variant is rarely observed. Moreover, this is consistent with reports that the L198 variant was not observed among individuals of Chinese heritage [51].

Whether the relative absence of L198 haplotypes within the Asian population is the result of a recent selective sweep, perhaps in response to an environmental or infectious exposure, cannot be determined from our data set. However, strong confirmation for a recent selective sweep involving chromosome region 3p21, which includes the GPX1 locus, is provided by analysis of SNP data from the International HapMap project [40, 41]. The strong iHS signal observed in the Asian population at this locus is one of the highest observed on Chromosome 3 and is highly suggestive for recent positive selection (Figure 1) [41]. A selective sweep at the GPX1 locus may explain an earlier observation that there is significantly less variation in red cell glutathione peroxidase activity among individuals of Asian heritage compared to what is observed in Occidental Populations [52]. Understanding whether functional variants of GPX1, or other genes at the 3p21 locus, confer protection or susceptibility in disease populations may provide insight into the selective pressures responsible for this recent selective sweep.

The genomic locations of several selenoprotein genes are of particular interest. For example, there is strong LD between the GPX1 P198L variant and variants in the nearby gene RHOA. Since RHOA belongs to the ras oncogene family and studies both in vitro and in vivo suggest that its overexpression may lead to cancer [53, 54], it is possible that observed associations between the L198 variant and an increased risk of cancer may in fact be due to LD between this variant and as yet unidentified variants within RHOA or another nearby gene [51]. Similarly, it is of great interest that the 3p21 genomic region also includes the gene for α-dystroglycan (DAG1), which encodes for a peripheral membrane protein used as a cellular receptor for arenaviruses, the causative agents of fatal hemorrhagic fevers, and also as the Schwann cell receptor for M. leprae [55, 56]. Likewise, it is also worth noting that SEPP1 is located at chromosome position 5p13.1, close to chromosomal regions that contain the growth hormone receptor and alpha-methylacyl-CoAracemase, genes of potential relevance to cancer susceptibility [57, 58]. We also note here the presence of an antisense transcript that overlaps the 3' UTR of SEPP1. Since some antisense transcripts post-transcriptionally regulate the expression of the overlapping transcript, we extended our resequencing at the SEPP1 locus to include the antisense transcript. Future studies utilizing these data will be able to explore if this antisense transcript plays a role in the regulation of SEPP1.

Conclusion

Genetic variation across selenoprotein genes could be of great interest to not only association testing strategies but also to strategies to investigate the pattern of molecular evolution in a group of genes with a distinctive feature, the incorporation of the amino acid selenocysteine. The 6 genes re-sequenced in this project include some of the best characterized selenoproteins, most of which have important antioxidant properties. It is likely that additional selenoproteins also play a role in pathways relevant to cancer and disease susceptibility, such as endoplasmic reticulum stress response and inflammation [59, 60]. The potential importance of selenoproteins in a wide array of human diseases including cancer, heart disease, aging and infections coupled with the promise of selenium as a chemoprevention agent warrants further investigation of the role of these and other selenoproteins in human disease. We believe that the study of selenoproteins provides a unique model system for exploring the complex interaction between genes and environmental exposures. The fine haplotype maps described in this report will be useful for exploring associations between selenoprotein variants and diseases, studying selenoprotein loss of heterozygosity in tumor samples, or for correlating selenoprotein genotypes with serum selenium levels or selenoenzyme activity in patients enrolled on clinical trials using selenium as a chemoprevention agent[61].

Methods

Population

The control population used for re-sequencing is the SNP500Cancer DNA panel, which represents a subset of the available DNA Polymorphism Discovery Resource [62]. The SNP500Cancer set consists of DNAs from 102 lymphoblastoid cell lines from 4 ethnically diverse groups, 31 Caucasian-Americans (CA), 24 African/African-Americans (AA), 24 Pacific Rim/Asian-Americans (PR), and 23 Hispanic-Americans (HI). The use of these publicly available panels, which are anonymized except for information about ethnic group and gender, for re-sequencing was deemed exempt from Institutional Review Board (IRB) approval by the Johns Hopkins University IRB. Genotype data and validated assays for genotyping select haplotype tagged SNPs identified through this re-sequencing project and for additional unrelated loci are publicly available as part of the cancer genome anatomy project at the SNP500 website [63].

PCR primers and sequencing

For each selenoprotein gene the full coding sequence and approximately 3000 bases of the 5' promoter and 3' UTR were re-sequenced. Overlapping PCR products of approximately 500 bases were designed using Primer 3 (Additional File 14)[64]. Each forward primer was tagged with a universal M13 forward sequence (5'-TGTAAAACGACGGCCAGT-3') and each reverse primer was tagged with a universal M13 reverse sequence (5'-CAGGAAACAGCTATGACC). The reliability of the sequencing data was ensured by sequencing in both directions, and in the case of most high frequency SNPs results were further confirmed by an independent genotype methods performed through the SNP500 genotyping project [63]. Primers were designed to include all exons, intron/exon borders, the 5' UTR and the 3' UTR, including SECIS elements. For some small regions, we were unable to obtain good quality sequence, despite multiple attempts at primer redesign and optimization. PCR and DNA sequencing reactions were amplified on MJ Research Tetrad thermalcyclers. Big Dye Terminator chemistry sequencing reactions were run in either 96 well or 384 well format on ABI 3700 capillary automatic sequencers. Forward and reverse sequence tracing were aligned in Sequencher 4.2 (Gene Codes, Ann Arbor, MI) and SNPs were determined by visual inspection. SNP data was placed in "prettybase" format and summary statistics and Hardy-Weinberg Equilibrium calculations were performed using software available through the Innate Immunity PGA [65]. Prettybase files, the reference sequences used to assign prettybase SNP locations, and gaps in sequence coverage are available as supplementary data for each gene (Additional Files 15, 16, 17, 18, 19, 20, 21).

Mapping DATA

For the purpose of mapping SNPs and primer locations we used the May 2004 assembly of the human genome (Build 43, NCBI). Genomic sequences between the most 5' forward PCR primer and the most 3' reverse PCR primer were obtained using the UCSC In-Silico PCR program [66]. The location of each SNP was mapped onto the gene structure relative to the following Entrez RefSeq curated mRNA sequences (Additional Files 22, 23, 24, 25, 26, 27): GPX1 (NM_000581), GPX2 (NM_002083), GPX3 (NM_002084), GPX4 (NM_002085), TXNRD1 (NM_003330), and SEPP1 (NM_005410). SNPs 5' of the ATG are represented as a negative number relative to the first base of the start codon; SNP 3' of the stop are represented as a positive number relative to the last base of the stop codon; SNPs within an intron are represented as intron number plus the number of bases from the first base of the start of the intron; SNPs within an exon are represented as either synonymous (SYN) or non-synonymous (NSYN) and the amino acid position is provided. In the case of GPX1, the resequencing in the 5' direction extended into the coding region of a neighboring gene, ras homolog gene family member A (RHOA). Of note, there is an uncharacterized phylogenetically conserved transcript (BC039102) overlapping the 3' end of SEPP1 in an antisense orientation. Resequencing at the SEPP1 locus was expanded to include putative exons and the promoter region corresponding to this antisense transcript because of the possibility that overlapping transcripts might post-transcriptionally regulate each other's expression [67, 68]. TXNRD1 exhibits alternative splicing at the 5' end. Our re-sequencing corresponded to the exons of TXNRD1 transcript variant 1 (NM_003330) and also included the published promoter region, which is conserved between the mouse and human[69].

Evolutionary analysis

To compare the sequence diversity between genes, the heterozygosity per nucleotide site was estimated by calculating nucleotide diversity (π) and the population mutation parameter (Θ) [70]. To determine whether the observed variation was consistent with the expectations of the neutral equilibrium model of evolution, neutrality was tested using Tajima's (DT) and Fu and Li's (DF and F) statistics [37, 38]. The most probable imputed PHASED haplotypes were used as input sequence for evolutionary analyses in the software program DNASP [71]. PHASED haplotypes were estimated using the Bayesian statistical method in PHASE2.0 run either locally or off the Innate Immunity web site [65, 72]. PHASE output was transformed into the proper DNASP input format using the perl script phasetodnasp-v2.1.pl written and kindly provided by Eduardo Tarazona Santos (Section of Genomic Variation, Pediatric Oncology Branch, NCI, NIH, Bethesda, MD). Genomic regions for which sequence data was not available were excluded from various population genetic analyses (Additional File 21). Evidence for specific differences in genotype distribution between the various ethnic groups was explored by calculating the allele identity F-statistic (FST) for all population pairs using GENEPOP on the Web, developed from the Genepop DOS versions 3.3/3.4[73]. FST is the proportion of the total genetic variance contained in a subpopulation (s) relative to the total genetic variance (t). Values can range from 0 to 1. High FST implies a considerable degree of differentiation among populations. GENEPOP is a population genetics software package originally designed by Michel Raymond and Francois Rousset, at the Laboratiore de Genetique et Environment, Montpellier, France. Transformation of data from prettybase format to GENEPOP format was facilitated by using the perl script report_prettybase.pl written by Fares Z. Najar (revised by James D. White) at the Advanced Center for Genome Technology, University of Oklahoma. To confirm recent positive selection at the GPX1 locus, we used Haplotter to query the results of a scan for positive selection in the human genome developed using SNP data from the International HapMap project [40, 41, 74]. The iHS is a new test for detecting recent positive selection developed by the Pritchard laboratory and based on the extended haplotype homozygosity (EHH) statistic proposed by Sabeti et al [75].

Haplotype structure and patterns of Linkage Disequilibrium (LD)

For each gene the most probable PHASED haplotype was determined, as described above, using only those SNPs that had a minimum rare allele frequency of ≥0.05. Using the Chimp BLAT Search at the UCSC Genome Bioinformatics Site, we aligned the human locus of interest and the corresponding locus from the chimp genome (Nov. 2003 assembly) to infer a chimp haplotype [66]. LD (D') between pairs of variants (minimum rare allele frequency of ≥0.05) was computed using the software program Haploview 3.2, using the most probable PHASED haplotypes as the input sequence. Using Haploview, haplotype blocks were created using the algorithm of Gabriel et al, Science 2002 [76]. 95% confidence bounds on D' were generated and each comparison was called "strong LD", "inconclusive" or "strong recombination". A block was created if 95% of informative comparisons were "strong LD". To identify a set of htSNPs for each gene, we used the Haploview's tagger feature with the following default settings: pairwise tagging only, r2 threshold 0.8 and LOD threshold for multi-marker tests 3.0. Of note, htSNPs are selected on a block-by-block basis; therefore, the end set of htSNPs is not necessarily the most parsimonious one for the entire data set, but is more likely to capture variation in a new, larger data set that was not observed in the initial data set. Non-synonymous SNPs and SECIS region SNPs with a minimum rare allele frequency of ≥0.05 were force included as tagged SNPs.