Abstract
We studied several methods for selecting singlenucleotide polymorphisms (SNPs) in a disease association study. Two major categories for analytical strategy are the univariate and the set selection approaches. The univariate approach evaluates each SNP marker one at a time, while the set selection approach tests disease association of a set of SNP markers simultaneously. We examined various test statistics that can be utilized in testing disease association and also reviewed several multiple testing procedures that can properly control the familywise error rates when the univariate approach is applied to multiple markers. The set association methods were then briefly reviewed. Finally, we applied these methods to the data from Collaborative Study on the Genetics of Alcoholism (COGA).
Background
Due to the abundance and utility of singlenucleotide polymorphism (SNP) markers in the finemapping of complex traits, a growing amount of current genetic research focuses on the analyses of SNP data. Such analyses typically involve association, in which differences in allele or genotype frequencies of SNPs near or within candidate genes between affected and unaffected individuals are tested. To localize disease susceptibility genes (loci), thousands of SNPs are usually investigated and the main question is how to identify diseaseassociated SNP markers among a large pool.
A simple approach that is commonly used is to evaluate one SNP at a time. In this analytical strategy, each SNP is tested with appropriate testing procedures, such as Pearson's chisquare test and CochranArmitage (CA) trend test, and those SNP markers with a significant disease association are identified. Current technology, however, can genotype on the order of 100,000 SNPs at a time. Even with a preliminary genome scan, such as linkage analysis, which can restrict the chromosomal region to reduce the number of SNPs for investigation, often a large number of SNPs are tested simultaneously. Therefore, investigators are at great risk of falsepositive findings. Various methods for marker selection with consideration of multiple comparisons are available. Dudoit et al. [1, 2] summarized a number of procedures that control different type I error rates, such as familywise error (FWER) and false discovery rates (FDR) [3].
For a complex trait, however, several markers, each with a rather small effect, might act together to contribute to disease susceptibility. In this case, markerbymarker approaches often fail to find significance. Recently, several investigators incorporated the multigenic nature of complex traits in selecting SNPs for association [4, 5]. One promising approach has been proposed by Hoh et al. [6], which performs a simultaneous significance test on a set of possibly interacting SNP markers while controlling the genomewide significance level via permutation procedures.
In this study, we describe different strategies for selecting SNPs in a disease association study and apply them to the Collaborative Study on the Genetics of Alcoholism (COGA) data.
Methods
Measures for disease association
Allelic association and HardyWeinberg disequilibrium
To measure the extent of the association for a given SNP, Hoh et al. [6] proposed a statistic that combines several sources of information, such as allelic association (AA) and HardyWeinberg disequilibrium (HWD). In a 2 × 2 table with rows corresponding to cases and controls, and columns corresponding to SNP alleles, the χ^{2} statistic can be utilized as a measure for AA. HWD can also be computed using χ^{2} for deviation from HardyWeinberg equilibrium based on the affected individuals only. Let a_{ i }and u_{ i }be the AA statistic and HWD for association of the i^{th} SNP, respectively. The product of these two statistics, a_{ i }× u_{ i }, is used to measure the effects of AA and HWD for association. We denote this test statistic as AA × HWD. Hoh et al. [6] used trimming for markers with extremely high values of HWD. They first find the number d of largest HWD values (for example, using 99^{th} percentile of the χ^{2} distribution) based on control individuals, and d HWD values are set to zero in the further analysis.
Robust linear trend testsMERT and MAX
Two robust tests, the maximin efficiency robust test (MERT) and the maximal test (MAX) are useful in detecting diseaseassociated markers when the underlying genetic model is unknown. Suppose we have a family of optimal test statistics {Z_{ i }: i ∈ Λ}, where Λ = {1, 2, ..., k}is an index of k underlying models. For example, using the CA trend test, Z_{ x }, x = 0, 1/2, 1, are optimal test statistics for the recessive, additive, and dominant models, respectively [7]. Assume that under the null hypothesis, each Z_{ i }asymptotically follows a standard normal distribution and that their correlation matrix under the null hypothesis of no disease association is given by . Closed forms of the test statistics and correlations for the CAtrend test in casecontrol studies can be found in Friedlin et al. [8]. From Gastwirth [9], MERT can be written as a linear combination of two tests with the minimum correlation. Suppose that the minimum correlation is reached at the two tests and , i_{1}, i_{2} ∈ Λ. Then, a linear combination of the extreme pair given by
which asymptotically follows a standard normal distribution under the null hypothesis.
When the minimum correlation ρ_{0} is small, MERT may not be powerful. Freidlin et al. [10] suggested the use of a maximal statistic (MAX) when ρ_{0} < 0.50 and showed that the MAX and MERT have similar power when ρ_{0} ≥ 0.75. Several versions of MAX tests are possible but here we focus on Z_{MAX} = max(, Z_{MERT}, ) for a onesided test and Z_{MAX} = max(, Z_{MERT}, ) for a twosided test.
Multiple testing
Dudoit et al. [1] provided multiple testing procedures which strongly control the FWER for gene expression data and which are directly applicable to disease association data with multiple markers. The Bonferroni singlestep adjusted pvalue is a well known procedure for dealing with multiple testing. While it is easy to calculate, this method is extremely conservative. The improvement in power can be achieved by stepwise procedures such as Holm's procedure. To take into account the dependence structure between test statistics, Westfall and Young's [11] stepdown minP or stepdown maxT adjusted pvalues are useful. Since the joint distribution of the test statistics is usually unknown, resampling methods can be used to estimate these adjusted pvalues.
Set association approach
Hoh et al. [6] provided a method that tests the diseaseassociation of a set of markers instead of testing each SNP separately. In their method, the sum of test statistics over a suitable set of markers is first formed to combine the evidence for association. Permutation procedures are then used to evaluate pvalues associated with each sum and the overall type I error. The following summarizes the set association approach of Hoh et al. [6].
1) Order test statistics t_{ i }, i = 1, ..., m, so that t_{(1)} ≥ t_{(2)} ≥ ... ≥ t_{(m)}.
2) For a fixed N ≤ m, take sums with an increasing number of terms, starting with the most significant markers, such that S(n = 1) = t_{(1)}, S(n = 2) = t_{(1)} + t_{(2)}, ..., S(n = N) = t_{(1)} + ... + t_{(N)}.
3) Generate the permutation samples from the original sample (permuting labels of cases and controls) under the null hypothesis of no association and evaluate the pvalue of each sum. Take the minimum pvalue (minP).
4) Generate other permutation samples from the original sample under the null hypothesis of no association. To obtain the pvalue corresponding to each permutation sample, repeat the above 3 steps by regarding each permutation sample as the original.
5) Evaluate the overall significance level of (minP).
Study subjects and genetic markers
The COGA data provide alcoholism diagnosis on 1,614 individuals from 143 families. We focus on two categories for the alcoholism diagnosis (aldx1), "affected" as a case and "purely unaffected" as a control, and we used all 609 cases and 261 controls whose SNP data were available. From the preliminary genome scan by linkage analysis (Lin and Wu [12]), one candidate gene cluster, alcohol dehydrogenase, on chromosome 4 was identified. Alcohol dehydrogenase catalyzes the ratedetermining reaction in ethanol metabolism. Genetic studies of diverse ethnic groups have firmly demonstrated significant allelic associations between alcohol dehydrogenase genes and alcoholism. Therefore, we restrict our analysis to SNPs located near this gene cluster. Because the SNPs are evenly distributed in the entire genome but not densely genotyped near any genes, we found two SNPs (rs749407, rs980972) within the cluster and we selected two additional SNPs (rs1037475, rs1491233) flanking each side from the Illumina SNP data.
Results
Table 1 presents the results from the univariate method for testing association using four test statistics, χ^{2}, AA × HWD, MERT, and MAX. The unadjusted pvalues for AA × HWD were obtained via permutation with 20,000 replicates and the pvalues for MAX were calculated based on 20,000 simulations. In Hoh et al. [6], unusually large HWD values were trimmed based on HWD in control individuals. Because we did not find any SNP markers whose HWD value was larger than their suggested cutoff value (the 99^{th} percentile for a χ^{2} distribution with 1 degree of freedom) we did not need trimming in our analysis. The diseaseassociation of rs1037475 is significant based on most of the test statistics with correction for multiple testing. The smallest correlations between linear trend tests for recessive and dominant models for all four SNP markers were less than 0.4, and therefore MAX may be more efficient than MERT [10]. As expected, Westfall and Young's stepdown method is less conservative than Holm's method, which in turn is less conservative than the Bonferroni correction. One exception is found when we used AA × HWD. We found that even though rs1037475 has the maximum observed test statistic (19.685), other markers have a larger chance of having a test statistic greater than 19.685 in the permutation samples. We do not know why this happened, but it shows that the test statistic AA × HWD is rather unstable in the permutation procedure. The SNP marker rs1037475 shows a significant disease association using the χ^{2} and MAX tests. The other three markers failed to show a significant association.
Figure 1 summarizes the result from the set association approach. Because there were only four markers under investigation, we considered the sum of test statistics up to all four SNP markers. We performed 20,000 permutations to obtain corresponding pvalues for each of 10,000 permutation samples. The order of SNP markers included in the sum statistics based on the univariate test statistics is rs1037475, rs980972, rs1491233, rs749407, except for MERT, where rs1491233 and rs749407 are switched. Using χ^{2}, MERT, and MAX, the smallest pvalue is reached at S(n = 2), which is the sum statistic of rs1037475 and rs980972. For AA × HWD, the smallest pvalue is obtained at S(n = 1). The overall significance levels of these smallest pvalues (adjusted for multiplicity) are 0.0396, 0.0097, 0.0839, and 0.0225 for χ^{2}, AA × HWD, MERT, and MAX, respectively. Only MERT failed to reach the global significance level. Using univariate analyses, rs980972 has rather negligible effect. However, the effect of rs980972 combined with rs1037475 became significant using the set association approach.
We carried out an additional analysis on a total of 8 SNPs in the nearest area including the above four SNPs. Using the univariate method with Bonferroni and Holm's methods, only AA × HWD found rs1037475 to be significant. None of the methods found significant markers based on Westfall and Young's method. In the set association approach, the smallest pvalues were reached at S(n = 1) using χ^{2} and AA × HWD, and at S(n = 2) using MERT and MAX, where S(n = 1) corresponds to rs1037475 and S(n = 2) is the sum of rs1037475 and rs980972. The overall significance levels of these smallest pvalues were 0.094, 0.022, 0.226, and 0.074, respectively. Again, only AA × HWD reached the overall significance at α = 0.05. When we included more SNPs in the analysis (a total of 28), none of the methods found significant markers. By adding SNPs which may not be in linkage disequilibrium with the mutation, the method became extremely conservative.
Conclusion
In this paper, we studied different strategies to select diseaseassociated SNP markers when multiple markers are tested. Various test statistics can be utilized to measure the degree of individual association, and using these statistics, the univariate approach combined with an appropriate correction for multiple testing can identify significant markers. However, if several markers are acting together to contribute to the susceptibility of the disease, the set association approach may be useful. In the application to the COGA data, we observed different results using the univariate and set association approaches, that is, a SNP marker with a rather negligible effect using the univariate approach is picked up by the set association approach. An added advantage of the set association methods is their ability to detect interacting loci, though we do not investigate that property here. For a rigorous comparison of the performances between different approaches, further investigation with simulated data would be necessary.
We used only four SNPs in our analysis. In principal, these procedures can also be applied to testing thousand of SNPs as in a genomewide association study. However, for testing a very large number of SNPs, these procedures can be extremely conservative and computationally intense. As we include more SNPs in the analysis, the methods tend to become very conservative and fail to find any significance. Reducing the number of tests by restricting areas of investigation is one common approach to address the multiple testing problems in genomewide association studies and the methods described here may be optimal with the reduced data. To take full advantage of the abundant information from a genomewide SNP map, alternative approaches such as a method for controlling FDR and a sequential type analysis [13] are possible.
The choice of test statistics has a great impact on the testing results. The CA trend test is usually preferable to the χ^{2} test [14, 15] and two robust tests, MERT and MAX, provide protection against model misspecification [7, 8]. AA × HWD [6] showed quite consistent result using different numbers of SNPs in the analysis. However, its performance was unstable in the permutation procedure. The properties of these test statistics under a variety of genetic models may need further investigation.
The casecontrol dataset used in this study is a family dataset in which cases and controls could be biologically correlated. The effect of correlated structures between family members in statistical testing leads to an inflated variance due to the positive correlation. Therefore, without considering this factor, inflation in type I error rates may result. In one of our studies using the same dataset [16], we applied the method of Slager and Schaid [17] with modification, in which the correlations of related individuals are incorporated into the CA trend test. While adjusting for the correlations is desirable, we found that the variance inflation is rather minor, and thus in this study, we ignored family structure. The test statistics which incorporate the correlations between family members can also be utilized in the univariate and set association approaches described in this study.
Abbreviations
 AA:

Allelic association
 CA:

CochranArmitage
 COGA:

Collaborative Study on the Genetics of Alcoholism
 FDR:

False discovery rates
 FWER:

Familywise error rate
 HWD:

HardyWeinberg disequilibrium
 MAX:

Maximal text
 MERT:

Maximin efficiency robust test
 SNP:

Singlenucleotide polymorphism
References
Dudoit S, Yang YW, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sinica. 2002, 12: 111139.
Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Stat Sci. 2003, 18: 71103. 10.1214/ss/1056397487.
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995, 57: 289300.
Stoesz MR, Cohen JC, Marcovina S, Guerra R: Extension of the HasemanElston method to multiple alleles and multiple loci: theory and practice for candidate genes. Ann Hum Genet. 1997, 61: 263274. 10.1017/S0003480097006179.
Nelson MR, Kardia SLR, Ferrell RE, Sing CF: A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001, 11: 458470. 10.1101/gr.172901.
Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human casecontrol association studies. Genome Res. 2001, 11: 21152119. 10.1101/gr.204001.
Zheng G, Freidlin B, Li Z, Gastwirth JL: Choice of scores in trend tests for case control studies of candidategene associations. Biometrical J. 2003, 45: 335348. 10.1002/bimj.200390016.
Freidlin B, Zheng G, Li Z, Gastwirth JL: Trend tests for casecontrol studies of genetic markers: power, sample size and robustness. Hum Hered. 2002, 53: 146152. 10.1159/000064976.
Gastwirth JL: The use of maximin efficiency robust tests in combining contingency tables and survival analysis. J Am Stat Assoc. 1985, 80: 380384. 10.2307/2287901.
Freidlin B, Podgor MJ, Gastwirth JL: Efficiency robust tests for survival or ordered categorical data. Biometrics. 1999, 55: 883886. 10.1111/j.0006341X.1999.00264.x.
Westfall PH, Young SS: Resamplingbased Multiple Testing. 1993, New York: John Wiley & Sons
Lin JP, Wu C: Bivariate genome scans incorporating factor and principal component analyses to identify common genetic components of alcoholism, eventrelated potential, and electroencephalogram phenotypes. BMC Genet. 2005, 6 (Suppl 1): S11410.1186/147121566S1S114.
Province M: A single, sequential, genomewide test to identify simultaneously all promising areas in a linkage scan. Genet Epidemiol. 2000, 19: 301322. 10.1002/10982272(200012)19:4<301::AIDGEPI3>3.0.CO;2G.
Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 12531261. 10.2307/2533494.
Slager SL, Schaid DJ: Casecontrol studies of genetic markers: power and sample size approximations for Armitage's test for trend. Hum Hered. 2001, 52: 149153. 10.1159/000053370.
Tian X, Joo J, Zheng G, Lin JP: Robust trend tests for association in case control studies using family data. BMC Genet. 2005, 6 (Suppl 1): S10710.1186/147121566S1S107.
Slager SL, Schaid DJ: Evaluation of candidate genes in casecontrol studies: a statistical method to account for related subjects. Am J Hum Genet. 2001, 68: 14571462. 10.1086/320608.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
JJ was involved in the design of the study, performed statistical analysis and interpretation of data, and drafted the manuscript. XT, GZ, JPL and, NLG were involved in the design of the study, statistical analysis, interpretation of data, and revising the manuscript. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Joo, J., Tian, X., Zheng, G. et al. Selection of singlenucleotide polymorphisms in disease association data. BMC Genet 6 (Suppl 1), S93 (2005). https://doi.org/10.1186/147121566S1S93
Published:
DOI: https://doi.org/10.1186/147121566S1S93