Genotypic data
Genotypes of publicly available Hapmap draft release 3 data (http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-05_phaseIII/plink_format/) (Altshuler et al. 2010) and 1KG data (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521) (Auton et al. 2015) were used for the simulation analyses. In the Hapmap sample, analyses were restricted to subjects with Northern or Western European background (CEU) or African ancestry (YRI) while in the 1KG data, analyses were restricted to the CEU subjects and Americans of African Ancestry in SouthWest USA (ASW).
Calculation of FST
For 3,173,374 and 30,102,059 SNPs in the Hapmap and 1KG dataset (respectively), population divergence between the ancestral European and African populations was quantified using the fixation index (FST). We used the Weir and Cockerham (Weir and Cockerham 1984) unbiased estimator of the FST statistic:
$${{F}_{ST}}~=~{}^{\left( MSP~-~MSG \right)}\!\!\diagup\!\!{}_{\left( MSP~+~\left( {{n}_{c}}-1 \right)MSG \right)}\;$$
(1)
where,
$${{n}_{c}}=\left( \mathop{\sum }^{}{{n}_{i}}-\frac{\mathop{\sum }^{}{{n}_{i}}^{2}}{\mathop{\sum }^{}{{n}_{i}}} \right)$$
$$MSP=\mathop{\sum }^{}\left( {{n}_{i}}{{\left( {{p}_{i}}-p \right)}^{2}} \right)$$
$$MSG=\left( {}^{1}\!\!\diagup\!\!{}_{\mathop{\sum }^{}\left( {{n}_{i}}-1 \right)}\; \right)\mathop{\sum }^{}{{n}_{i}}{{p}_{i}}\left( 1-{{p}_{i}} \right)$$
n
i
is sample size in population i (=1, 2), p
i
is frequency of the given allele in population i, and p is the average frequency of the allele across the populations while MSP and MSG denote the population variance and the genetic variance, respectively.
Data quality checks
Initially, the Hapmap data included 1,397 individuals and 1,457,897 SNPs. Analyses were restricted to 259 founders with CEU and YRI ethnic backgrounds. Next, genotype and individual missingness was controlled by (i) excluding SNPs with high levels of missingness (>0.2); and (ii) excluding individuals and SNPs with missingness >0.05. SNPs with minor allele frequency (MAF) <0.01 were excluded. Finally, analyses were limited to SNPs with well-defined FST values. These quality checks resulted in a final sample of 259 individuals (112 CEU and 147 YRI) and 1,304,792 SNPs.
The downloaded 1KG vcf chromosome files were transformed into Plink format data and merged into a single binary Plink file. This file included 1,029 individuals and 38,151,414 SNPs. Analyses were restricted to 146 founders with CEU and ASW ethnic backgrounds. The same quality checks were performed as for the Hapmap sample, but as an additional step, variants with one or more multi-character allele codes were excluded with the –snps-only option in Plink. These QC steps resulted in a final sample of 146 individuals (85 CEU and 61 ASW) and 5,131,518 SNPs.
Calculation of MDS components
Twenty multi-dimensional scaling (MDS) components were calculated in PLINK with the option --cluster --mds-plot 20. The MDS components were included as covariates in the genetic association analyses to correct for population stratification. We initially corrected for 10 components, but the number was increased in the presence of remaining inflation of the test statistics.
Investigation of type-I error rate
Quantitative phenotypes were simulated for the CEU and YRI samples of the Hapmap data, and for the CEU and ASW samples of the 1KG data, under the assumption of strong population trait divergence (e.g., significant differential trait scores) across ethnic populations. Quantitative phenotypes were drawn from normal distributions with means of 0 and 3, respectively, and standard deviations of 1. To investigate the impact of population stratification on type-I error rate (prior to and after MDS correction), the inflation in test statistics and p-values was investigated by calculating the average and standard error of the genomic inflation factor based on 100 simulations. Quantile–quantile (QQ) plots of observed and expected p-values were created for one of these simulations.
Here we consider the mean test statistic (for a quantitative trait) on the non-causal variants to investigate the sources of inflation. Assuming p causal variants and L total number of non-causal SNPs, the mean \({\chi }^{2}\) statistic on the non-causal variants can be derived (Sham and Purcell 2014) from the non-centrality parameter (NCP) as follows:
$${{\lambda }_{mean}}=1+\frac{1}{L}\underset{j=1}{\overset{p}{\mathop \sum }}\,\underset{k=1}{\overset{m\left( j \right)}{\mathop \sum }}\,\frac{Nh_{j}^{2}r_{jk}^{2}}{1-h_{j}^{2}r_{jk}^{2}}$$
(2)
Here N is the sample size, \(h_{j}^{2}\) is the (non-zero) heritability attributable to the causal variant j, and \(r_{jk}^{2}\) is the linkage disequilibrium between the causal variant j and the (non-causal) SNP k (Spencer et al. 2009). Note that the NCP for SNP k in LD with causal variant j decays with LD and is given by \(\frac{Nh_{j}^{2}r_{jk}^{2}}{1-h_{j}^{2}r_{jk}^{2}}\). Thus the genomic inflation factor increases with the sample size N in a linear manner and may be influenced by longer-range (admixture) linkage disequilibrium. As has been noted, polygenicity [i.e., a large number p of contributing variants] may also result in an inflated distribution of the test statistic (Bulik-Sullivan et al. 2015).
Investigation of statistical power
Quantitative phenotypes were simulated for the CEU and YRI/ASW samples under the assumption of strong population trait divergence for a SNP with a high value of FST. In Hapmap, the selected causal SNP (rs711274) has a FST value of 0.74. For 1KG, the selected causal SNP (rs7530465) has a FST value of 0.65. In each analysis, 10,000 simulations were performed. Phenotypes were simulated to be associated with ethnicity and the SNP genotype according to the following model in which \({N}_{j}\) represents the number of risk alleles {0,1,2} for the causal SNPj.
$$\text{CEU}:y\sim{}N(0,1)+\text{ SNPeff}*{{N}_{j}}$$
$$\text{YRI/ASW}:y\sim{}N(4,1)+\text{ SNPeff}*{{N}_{j}}$$
The theoretical value of the statistical power was calculated by performing 10,000 simulations in the full samples in the absence of population trait divergence. The effect sizes (SNPeff) in Hapmap and 1KG data were set at 0.2 and 0.3, respectively; a larger effect size was applied in 1KG data to compensate for the smaller sample size. The impact of including the MDS components on statistical power was assessed by repeating the association tests while correcting for the same number of MDS components as in the tests performed in the presence of population trait divergence. This allowed us to quantify the loss in power due to the application of the MDS correction (whether or not there was any difference in phenotype between the source populations) and any additional loss in power from the correction in the presence of population trait divergence.
Gene-based tests
Finally, we investigated inflation in gene-based tests which may include multiple correlated SNPs and may therefore be more strongly affected by population stratification. We mapped the SNPs to the genomic intervals defined by genes (as defined by GENCODE) (Harrow et al. 2006) using the “intersectBed” command in the bedtools suite (Quinlan and Hall 2010). SNPs which map to multiple genes were excluded. The gene-defined intervals defined the sets on which gene-based tests were performed. For these gene-based tests, analysis was restricted to sets that include at least two variants and were then repeated for genes which include at least five variants to investigate the impact of the number of SNPs in a set on inflation. Gene-based tests were performed with –set-test in Plink; using 10,000 permutations per set and the following options: --set-r2 0.1 --set-p 0.10 --set-max 10.