
Genome Wide Association Studies (GWASs) have provided a powerful approach for identifying association between genetic variants and a single phenotype. An alternative and complementary approach to query genotype-phenotype associations is the Phenome-Wide Association Study (PheWAS) [1]. With PheWAS, associations between a specific genetic variant and a wide range of phenotypes can be explored. They are well suited to facilitate the identification of new associations between SNPs and phenotypes as well as SNPs with pleiotropy [2,3,4]. The PheWAS approach was mainly pioneered by investigators at Vanderbilt University [1] and flourished in various hospital-based cohorts by scanning phenomic data in electronic medical records for genetic associations [1, 4,5,6] as well as by meta-analyzing data collected in observational cohort studies like the Population Architecture using Genomics and Epidemiology (PAGE) study [2].

As of January 2017, GWASs have identified ~ 44,000 SNPs important for various human phenotypes as summarized in the GWAS catalog [7], which makes it possible to reveal pleiotropic effects and genetic mechanisms shared by different traits. Conducting PheWASs using SNPs which were reported to be associated with one or more traits is an efficient method for replication of previous results and identification of pleiotropic effects.

In this study, we used the REasons for Geographic And Racial Differences in Stroke (REGARDS) Study to examine 4956 GWAS catalog SNPs (Additional file 1) that are included on the Infinium HumanExome-12v1-2_A (exome chip) array from Illumina with a rich collection of phenotypes. The REGARDS study is a population-based, longitudinal study including 30,000 participants (~ 40% African Americans), sampled from the continental US [8]. Among 12,000 African American participants, 7726 were genotyped with the exome chip. Since most PheWAS studies have considered individuals of European ancestry and cross-sectional phenotypes, REGARDS is an excellent resource for both cross-racial validation and identifying pleiotropic effects.


We tested for association between 4956 GWAS catalog SNPs and 67 phenotypes. Genomic inflation factors (λ) generated from including all SNPs for a given phenotype showed good fitting of all models with λ range from 0.95 to 1.12. Table 1 summarizes 29 significant associations passing the significance threshold with P value less than 1.5E-7. S2 compares results extracted from the GWAS catalog on significant PheWAS SNPs to the REGARDS results. The significant associations are in several major phenotype groups: C reactive protein, lipid profile, diabetes, cystatin C, heart event risk, heart rate, and height. We classified the significant SNPs in two ways: 1. the SNP was associated to a phenotype matching previous publications 2. the SNP was associated to a phenotype related to the previously reported phenotype (Additional file 2).

Table 1 Summary of identified significant associations in REGARDS study

Validation of known genetic associations of phenotypes

Among the 29 significant genotype and phenotype associations, 17 have been previously reported for the same phenotype (Table 1 and Additional file 2). The effect directions of the 17 associations were the same as those in the previous reports. For eight of these phenotype –genotype associations, our study represents the first validation in an African American population (see section below). These replications validated the reliability of our PheWAS analysis approaches. We confirmed that C reactive protein level was related to rs2794520 (P = 3.9E-34), rs7553007 (P = 6.6E-34) and rs876537 (P = 8.0E-33), which are located near the CRP gene (Table 1). Five SNPs located near the CETP gene were associated with HDL cholesterol including rs173539 (P = 1.2E-19), rs1800775 (P = 1.5E-29), rs247616 (P = 4.9E-19), rs3764261 (P = 1.8E-30), and rs7499892 (P = 1.4E-19). Two SNPs were significantly associated with heart rate: rs12110693 near LOC644502 gene (P = 4.3E-11) and rs9398652 near GJA1 gene (P = 1.2E-11). We also reproduced the association between rs1173727 near the NPR3 gene and height with P = 9.9E-8. Three SNPs were significantly associated with LDL cholesterol including rs12740374 in the SORT1/ PSRC1/ CELSR2 cluster (P = 1.6E-10), rs6511720 in LDLR (P = 1.2E-10), and rs7412 in APOE (P = 2.2E-65). Rs10096633 in the LPL gene (P = 4.9E-10) and rs326 in the C8orf35/SLC18A1/LPL cluster (P = 8.2E-9) were associated with total cholesterol. Apart from 17 reported associations, the other 12 SNPs were associated with phenotypes that are closely related to previously published associations indexed in the GWAS catalog (Table 1 and Additional file 2).

Cross-racial validation

Eight of our findings were reported in other races previously but not in African Americans. Observed associations of rs173539, rs1800775, rs247616, and rs7499892 with HDL had not been previously reported in African Americans. The other new cross-ethnic validations from our study included rs1173727 with height, rs911119 with cystatin C, rs247616 with the Framingham risk score, and rs646776 with dyslipidemia (Table 1 and Additional file 2). Interestingly, we saw even more significant results for the association between rs247616 and HDL with P = 4.88E-52 and beta value = 4.3 (mg/dL) in REGARDS, compared to P = 9.7E-24 and beta value = 3.0 (mg/dL) in the GWAS catalog report [9] (Additional file 2).

SNPs associated with multiple traits

The 29 significant genotype and phenotype associations involved 20 SNPs, and 11 of these were associated with multiple traits (P-value < 1.0E-7 for the first trait and P < 3.7E-5 for the second trait) (Additional file 3). We also listed the genome-wide significant SNPs for one trait which were suggestively associated with another trait with nominal P < 0.05 in Additional file 3. Figure 1 listed those 11 SNPs and another 8 SNPs which were significantly associated with the first trait (P-value < 1.0E-7) and nominally associated with another trait (P < 0.05). Generally, the pleotropic effects were caused by one SNP associated with multiple correlated phenotypes. In the conditional analysis, the associations were not significant between the second top traits and the corresponding SNPs after including the top traits as the covariate. For example, rs7412 was associated with LDL (P = 7.64E-62) and Cystatin C (P = 1.80E-04) due to a significant association between these two phenotypes (P = 6.48E-06).

Fig. 1
figure 1

Heatmap shows the -log10P for association between SNPs with different traits. Shown in colors are the association P values of SNPs which are associated with first trait with P < 1.00E-7 and second trait with P < 0.05. The stars indicate the primary trait associated with the SNPs


Our PheWAS presented association of 4956 SNPs with 67 phenotypes using a subset of African Americans from the REGARDS study. Our study validated 29 previous GWAS associations, of which eight associations were reported for the first time in African Americans (AAs). Among many of our findings, 11 SNPs were associated with multiple traits.

We identified 29 significant genotype and phenotype associations. 17 of these have been reported previously. The phenotypes of the other 12 associations were related with those previously reported but not exactly the same. For instance, rs911119 located in the CST3/CST4/CST9 gene cluster was reported previously associated with chronic kidney disease in a European population [10]. Our current study found that in African Americans allele C of rs911119 was negatively associated with the level of cystatin C, which is a biomarker for kidney function (P = 6.2E-8). Rs7903146 in TCF7L2 gene was reported associated with type 2 diabetes in several different populations [11], which agrees with our current results (P = 2.3E-12). Rs247616 in the CETP gene was significantly associated with the Framingham CHD Hard Event Risk Score (Fram_CHD: Risk of Coronary Death or MI over 10 Years) with P = 3.8E-9. While this SNP has not been previously associated with the Framingham risk score, it has been associated with its components as well as related phenotypes including blood metabolite levels, cardiovascular disease risk factors, and lipoprotein-associated phospholipase A2 mass and activity only in Europeans [9, 12, 13]. Rs7412 in the APOE gene was associated with Fram_CHD (P = 3.0E-12), total cholesterol (P = 2.9E-37), lipidemia (P = 6.2E-33) and Ideal7 (the American Heart Association’s “Life’s Simple Seven” score, i.e., total number of ideal risk behaviors or metrics for each of the seven) (P = 3.3E-14). Our findings were consistent with previous studies, which showed that rs7412 was associated with several lipid related phenotypes including LDL cholesterol, lipid metabolism phenotypes, lipid traits, and response to statin therapy [14,15,16,17]. Here, we also found that rs629301 (in CELSR2, PSRC1 and SORT1), rs646776 (in CELSR2, PSRC1 and SORT1) and rs6511720 (in LDLR) are associated with dyslipidemia. This is in alignment with previously findings: associations of rs629301 with total cholesterol and LDL cholesterol [18]; associations of rs646776 with total cholesterol, LDL cholesterol, lipid metabolism phenotypes, coronary artery disease, myocardial infarction (early onset), and response to statin therapy in Europeans [19, 20]; associations of rs6511720 with total cholesterol, LDL cholesterol, lipid metabolism phenotypes, lipoprotein-associated phospholipase A2 activity and mass, and cardiovascular disease risk factors [18]. Rs12740374 in CELSR2/PSRC1/SORT1 cluster was associated with two lipid traits: total cholesterol and dyslipidemia in our study, which is closely related with previously reported associations with LDL cholesterol and lipoprotein-associated phospholipase A2 activity and mass [21, 22].

We validated eight associations in AAs for the first time. Due to the difference of genetic variants between African Americans and the other races [23], it is interesting to check whether the associated variants reported in other races are associated with the same traits in AAs or not. When SNPs replicate across diverse populations, the gene’s importance in the disease process is emphasized, and consistency of findings may indicate genes that are especially important for future functional validation. Importantly, the effects of eight variants in AAs were of the same directions as in the other reported races.


In this study, we leveraged the rich phenotype collection and the exome chip data in 7726 REGARDS AA participants, and examined the associations between 4956 GWAS catalog SNPs and 67 phenotypes. We validated 29 previous GWAS associations, of which eight associations were reported for the first time in AAs.


Study population and design

The REGARDS Study is a prospective, longitudinal population-based cohort study [8] of European American and African American adults aged 45 and older. Detailed description of the objectives and design of this study has been published [8]. The baseline telephone interview and separate in-home visit were conducted between 2003 to 2007 [24]. Baseline data collection resulted in a broad range of demographic, diet, and clinical information as well as banked biospecimens which were used to extract DNA and assess multiple clinical measurements [8]. Participants continue to be contacted every 6 months by telephone to identify stroke events and other incident outcomes [8]. The REGARDS study protocol was approved by the institutional review boards of each participating institution, and written informed consents were obtained from all participants. This current study examined phenotypes available in REGARDS participants to explore their association with exome-chip SNP genotypes. A total of 7726 self-reported African Americans with exome chip data were included in our study. The average age of participants was 64.6 years old (standard deviation = 9.0), and 4770 (61.7%) were female.

SNP selection and genotyping

Genotyping was conducted using the Infinium HumanExome-12v1-2_A from Illumina (San Diego, CA, USA). The Illumina exome chip provides genotype data on > 240,000 putative functional variants selected based on over 12,000 individual exome and whole-genome sequences derived from individuals of European, African, Chinese, and Hispanic ancestry ( Raw genotyping data were called by GenomeStudio (version 2.0). The variant quality control included removing SNPs with call rate < 95%, monoallelic SNPs, multiallelic SNPs, and SNPs that had mapping errors. After further removing first and second degree relatives, samples with technical issues, and samples with mismatched sex, 7726 samples were available for analysis. In total, 4956 autosomal SNPs with minor allele frequency > 0.05 aligned to the GRCh37 reference sequence were matched to GWAS published SNPs catalog V1.0.1, which were reported to be associated with at least one trait with P < 1.0E-5 (Additional file 1) [7, 25].


Lists of phenotypes included in this study are shown in Table 2 and Table 3. The phenotypes included both baseline and incident events among the 7726 African Americans. Baseline information included medical history, personal history, demographic data, socioeconomic status, cognitive screening, laboratory assays, urine, height, weight, waist circumference, blood pressure, pulse, electrocardiography, and medications in the past 2 weeks [8]. Follow-up events included stroke, coronary heart disease (CHD), myocardial infarction, infection, sepsis, end-stage renal disease, and death. All the phenotypes were binary or continuous variables (See Tables 2-3). Totally, 26 binary and 41 continuous phenotypes were included for current study [26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68]. The binary variables follow a binomial distribution and their frequencies for each category were calculated. Most of the continuous variables followed normal distribution. For variables with large skewness or kurtosis, a logarithm or square root transformation was performed. Obvious outliers with values at more than 10 standard deviations away from the mean were redefined as missing.

Table 2 List of binary phenotypes
Table 3 The list of continuous phenotypes of this study

Statistical methods

Single SNP linear or logistic regressions were performed by PLINK for continuous or binary phenotypes respectively using an additive genetic model. The top 10 principal components determined by EIGENSTRAT [69], age, and gender were used as covariates for all phenotypes. Additional covariates were used for cholesterol, high-density lipoprotein (HDL), low-density lipoprotein (LDL), triglyceride, glucose, and insulin. Those covariates included whether the participants were fasted under examination, whether they had self-reported diabetes and took insulin/glucose lowering pills, and whether they had self-reported dyslipidemia and took lipid lowering medication.

The threshold of significance level for PheWASs is not straightforward and multiple approaches have been used in other PheWAS studies [2,3,4]. The PAGE study used five population-based studies representing major racial/ethnic groups, and their threshold is “ P<0.01 observed in two or more PAGE studies for the same SNP, phenotype class, and race/ethnicity, and consistent direction of effect” [2]. The Environmental Architecture for Genes Linked to Environment (EAGLE) study used similar threshold with an additional condition for allele frequency > 0.01 and sample size > 200 [4]. The Norfolk Island study performed a principal component analysis of phenotypes and used principal components as the final phenotypes. A P value of 1.84E-7 was considered the threshold for a significant association between a component and SNP [3]. In our study, the criteria for a significant association between a single SNP and a single phenotype with Bonferroni correction was defined as P value = \( \frac{0.05}{4956\ast 67} \)=1.5E-7. In our study, significant genotype and phenotype associations involved 20 SNPs. Therefore, the significance threshold for a second trait of the pleiotropic effect is P = 0.05/(67*20) = 3.7E-5.