Background

One limitation of genome-wide association studies is that population stratification can be a confounding variable. Population stratification occurs when there are systematic ancestry differences in allele frequencies between case subjects and control subjects. If not taken into account, population stratification can cause false-positive and/or false-negative findings [1] and can produce spurious associations [2]. Principal components analysis can be used to correct for population stratification by applying methods that infer genetic ancestry [3]. Population stratification is mainly due to the demographic history of a population, natural selection, and random fluctuations resulting from admixture. In this paper we examine the statistical properties of analysis procedures used in genome-wide association studies by adjusting principal components (PCs) across the whole genome. Another approach is to use local PC adjustment [4], but the Genetic Analysis Workshop 17 (GAW17) genotype data are not sufficiently extensive to consider this strategy.

The GAW17 data set is composed of mini-exome simulated data using 697 unrelated subjects from the 1000 Genomes Project. The quantitative phenotypes Q1 and Q2 are generated as normally distributed phenotypes. We document the p-value of the test of the coefficient of a genotype with and without adjusting for population stratification in selected genes known not to cause the phenotypes Q1 and Q2. We compare the power of the regression coefficient test when using PCs for ancestry adjustment with the power when using the seven populations given as ancestry controls for all the genes known to cause phenotypes Q1 and Q2. We study two types of phenotype, quantitative and dichotomized, test all the single-nucleotide polymorphisms (SNPs) that cause Q1 and Q2, and examine selected noncausal SNPs for these two traits.

Methods

A SNP that causes a trait is one that is specified in the function used to simulate the trait [5, 6]. Any other SNP is called noncausal. SNPs on chromosomes 12, 21, and 22 are used as SNPs not causing Q1. SNPs on chromosomes 21 and 22 are used as SNPs not causing Q2. Table 1 lists the distribution of the minor allele frequencies (MAFs) of the SNPs in the genes studied.

Table 1 Distribution of minor allele frequencies of SNPs in the genes studied

We dichotomize the quantitative measures Q1 and Q2 so that the top 25% of each of the 200 replicates is scored as affected (1) and others as unaffected (0).The independent variables in these analyses are selected from the number of minor alleles in the ith SNP genotype (SNP i ), the participant’s age (Age) and smoking status (Smoking), six indicator variables of the populations (POP1, …, POP6), and the 10 ancestry-adjusted PC scores (GPC1, …, GPC10). We use the FamCC software [7] to calculate these 10 PCs. All 24,487 SNPs are used in the calculations.

We use the PLINK software [8] to fit three logistic regression models to assess the association between each SNP in the genes studied and the dichotomized phenotype. The ith SNP is considered associated with the phenotype when the permutation p-value of the coefficient of SNP i reported in the PLINK logistic regression analysis is less than 0.05. Because Q1 is affected by age and smoking, the models considered are the following: (1) the SNP model, in which each SNP is adjusted for age and smoking; (2) the population adjustment model, in which each SNP is adjusted for the populations, age, and smoking; and (3) the PC adjustment model, in which each SNP is adjusted for age, smoking, and ancestry adjustment PCs. The models are defined as follows:

SNP model:

(1)

Population adjustment model:

(2)

PC adjustment model:

(3)

For the population adjustment model, only six indicators are needed to represent seven populations. The Luhya population is the reference population for the dichotomized phenotype, and the CEU population (European-descended residents of Utah) is the reference population for the quantitative phenotype. Because Q2 is not associated with either age or smoking, the covariates Age and Smoking are not used in the models for Q2. We also fit the three models to the continuous phenotypes Q1 and Q2 using PLINK. Each model is fitted to the 200 replicates provided.

Results

The type I error rate (i.e., false-positive rate) for noncausal genes is the fraction of p-values from noncausal SNPs with permutation p-value less than 0.05. Table 2 contains the type I error rates for Q1 and Q2. The PC adjustment model has a type I error rate closer to 0.05 than the type I error rates for the SNP model and the population adjustment model. For Q2, the type I error rates are relatively close to the nominal value of 0.05 for each model.

Table 2 Type I error rates for Q1 and Q2 using all noncausal SNPs in noncausal genes

Tables 3 and 4 contain the results for Q1 and Q2 using all causal and noncausal SNPs in causal genes that determine that trait. For noncausal SNPs in causal genes for both Q1 and Q2, the PC adjustment model has permutation type I error rates that are closest to 0.05, although the type I error rates are slightly above the nominal value of 0.05. In Q1 the PC adjustment model has the lowest power for causal SNPs, possibly because of better control of the type I error rate. For Q2, where all null type I error rates are relatively close to the nominal rate of 0.05, the power for causal SNPs is roughly the same for the three models.

Table 3 Type I error rates and power for Q1 using all SNPs in causal genes
Table 4 Type I error rates and power for Q2 using all SNPs in causal genes

Discussion

Because the disease status of interest is dichotomous in many studies, we study these dichotomized phenotypes. Chromosomes 21 and 22 have no causal SNPs for both Q1 and Q2. Therefore we define the SNPs on these two chromosomes as noncausal SNPs. Because other GAW17 participants have reported highly significant association between SNPs on chromosome 12 and Q1, we include the SNPs on chromosome 12 in our set of noncausal SNPs. For the genes reported here, the PC adjustment model has an empirical type I error rate that is apparently closer to the nominal level for SNPs in genes not causing the phenotype and for noncausal SNPs in causal genes, especially for genes determining Q1. The p-values for Q2 are much closer to the nominal level of 0.05 for each of the three models. This may be due to the way that the Q2 phenotype was generated. Although the PC adjustment model successfully controls the type I error rate, considering the actual population of origin does not. For noncausal SNPs in causal genes, the type I error rates in the PC adjustment model of the continuous Q1 measure are slightly higher than the nominal level in two of the four SNP strata. This may be due to the association between the noncausal SNPs and causal SNPs within the gene resulting from linkage disequilibrium. It may also result from multiple testing.

The power of the PC adjustment model is relatively strong and increases as the MAF increases, as expected. The power of regression modeling for the quantitative phenotype is greater than the power of logistic regression modeling of the dichotomized phenotype for both Q1 and Q2.

In this study, we compare the PC adjustment model with a model including population of origin as a factor. The PC adjustment model has both a type I error rate closer to the nominal level of 0.05 and high power. This is because, in general, PCs calculated using all SNPs contain more information about demographic history, natural selection, and random fluctuation in admixture than the population to which a participant is assigned. That is, participants’ genes may still hold genetic information that distinguishes them from the population from which they originated.

The data used here were simulated rather than real. We set our significance level to 0.05 because the number of replicates is 200. As a result, the expected number of null rejections is 10, which allows for meaningful statistical comparison. We also studied a nominal significance level of 0.01 (data not shown) and found similar control of the type I error rate except for SNPs with MAF < 0.005, where the type I error rate was 0.036, somewhat higher than expected. We could not study the type I error rate using typical genome-wide significance levels, such as 10−8.

Conclusions

The PC adjustment model with permutation p-value controls the type I error rate in the GAW17 Q1 and Q2 phenotypes. The power of the regression analysis of the quantitative phenotype is greater than the power of the analysis of the dichotomized phenotype. There is a slight decrease in power for the PC adjustment model even when MAF < 0.005.