Background

Understanding the genetic architecture underlying complex phenotypes is crucial in decoding disease mechanisms as well as treatment and drug development. Genome-wide association studies (GWAS) have contributed significantly to the identification of associated variants with numerous traits. Although the sample size requirement of GWAS is high, the proportion of the disease risk explained by a single variant always remains low. However, availability of longitudinal data on multiple phenotypes might carry more information in identifying associated variants. The Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study provides a data set with triglyceride (TG) and high-density lipoprotein (HDL) levels at 4 time points for a fixed set of families with few missing observations during follow-up. Consequently, given a moderate sample size, reduction of the multiple testing burden and/or use of longitudinal information is required. Moreover, with available data on multiple interacting phenotypes, it is informative to study both the inherent environmental correlation and the genetic correlation.

Increased levels of TG and decreased HDL levels are well-known causes of heart disease. So we developed one model that captures the genome-wide genetic association with interacting phenotypes, that is, TG and HDL, over time by introducing both a temporal covariance structure and a genetic covariance structure between phenotype measurements. Our method is expected to increase power as a result of the inclusion of more information through genetic and environmental correlation structures.

On the other hand, fenofibrate is an antiinflammatory drug, well known for TG-lowering effects. Some studies report modulation of a lipid response to fenofibrate as the result of genetic variants involved in lipid metabolism [1], but the response to treatment by fenofibrate varies across individuals in a population [2]. Some GWAS ventured to find associated variants behind such a response, but met no success in the sense of identifying variants significantly associated with fenofibrate response [3]. The reasons might be low sample size, noncomparable baseline lipid profiles, and environmental exposures of the individuals. Thus, along with simple GWAS, we studied multiloci association of response to fenofibrate treatment with the genomic regions, which reduces the chance of missing a moderately associated locus. We also examined the association of single variants using multiple TG and HDL phenotypes in the GOLDN study. We found that 1 single nucleotide polymorphism (SNP) is associated with the TG and HDL phenotypes, although we did not find any significant SNP or gene that is associated with the drug response. It is important to note that because the sample in this study is not very large, we report a few top significant loci that might be associated with phenotype and response to drug.

Methods

We analyzed the real data provided by GAW20 (ie, GOLDN study) data set [4]. The data set contains data on age, smoking status, etc. as covariates, pedigree information, and genome-wide genetic variation, as well as TG and HDL levels measured before and after the drug at 4 time points. Information on genetic variation was available for 822 individuals, while other information, except kinship structure, had a sample size of 1105 individuals. Pedigree data was available for 4151 individuals. The kinship structure, covariates, and TG and HDL phenotypes were available for all 822 individuals, but genotype information was missing for 1 individual. Consequently, in the subsequent analysis we used the remaining 821 subjects. Next, the missing genotypes and monomorphic SNPs were removed from the analysis. For variants with only 2 observed genotypes, we eliminated the SNP if 1 genotype frequency was < 5%. We imputed missing phenotype data using a mixed-model approach under the null model, and used log–log transformation of the phenotype variables for the entire analysis. This transformation made the data normally distributed and, hence, the resultant test statistic followed a standard distribution under a null hypothesis (H0) of no association. During imputation, we assumed constant heritability, and this value was from an existing study [5]. We calculated p values using the asymptotic distribution of test statistics under H0 after Benjamini-Hochberg (BH) correction.

To meet the objectives as stated in the previous section, we first do a GWAS based on longitudinal data with TG and HDL together as phenotypes. We use a mixed model that includes environmental as well as genetic correlation structure.

With TG and HDL at all time points as response vector, our model is:

$$ \left(\begin{array}{c}{Y}^{TG}\\ {}{Y}^{HDL}\end{array}\right)=\left(\begin{array}{c}{X}^{TG}{\beta}^{TG}+{Z}^{TG}{u}^{TG}\\ {}{X}^{HDL}{\beta}^{HDL}+{Z}^{HDL}{u}^{HDL}\end{array}\right)+\left(\begin{array}{c}{\varepsilon}^{TG}\\ {}{\varepsilon}^{HDL}\end{array}\right) $$
(1)

where \( {Y}^{TG}={\left({Y}_1^{TG},{Y}_2^{TG},{Y}_3^{TG},{Y}_4^{TG}\right)}^{\hbox{'}} \) and \( {Y}^{HDL}={\left({Y}_1^{HDL},{Y}_2^{HDL},{Y}_3^{HDL},{Y}_4^{HDL}\right)}^{\hbox{'}} \) denote TG and HDL respectively, at 4 time points, for n individuals. In this model, XTG and XHDL are fixed effects design matrices, where XTG = XHDL = [I4 ⊗ 1n  14 ⊗ gn], gn is the genotype vector for n individuals at a single marker locus, uTG and uHDLare random effect vectors for n individuals, and ZTG = ZHDL = 14 ⊗ In is the corresponding matrix. Here \( {\beta}^{TG}={\left({\beta}_1^{TG},{\beta}_2^{TG},{\beta}_3^{TG},{\beta}_4^{TG},{\beta}_5^{TG}\right)}^{\hbox{'}} \), where the first 4 components are temporal effects for 4 different time points that includes the drug effect and \( {\beta}_5^{TG} \) is the effect of the SNP. We assume that, \( Var\left({u}^{TG}\right)={\sigma}_{u, TG}^2K, Var\left({u}^{HDL}\right)={\sigma}_{u, HDL}^2K \) where K is the kinship matrix, Var(εTG) = ΣTG ⊗ In, Var(εHDL) = ΣHDL ⊗ In,

$$ {\Sigma}_{TG}={\sigma}_{TG}^2\left(\begin{array}{cccc}1& {\rho}_{1, TG}& {\rho}_{1, TG}{\rho}_{2, TG}& {\rho}_{1, TG}{\rho}_{2, TG}{\rho}_{3, TG}\\ {}{\rho}_{1, TG}& 1& {\rho}_{2, TG}& {\rho}_{2, TG}{\rho}_{3, TG}\\ {}{\rho}_{1, TG}{\rho}_{2, TG}& {\rho}_{2, TG}& 1& {\rho}_{3, TG}\\ {}{\rho}_{1, TG}{\rho}_{2, TG}{\rho}_{3, TG}& {\rho}_{2, TG}{\rho}_{3, TG}& {\rho}_{3, TG}& 1\end{array}\right) $$

where \( {\sigma}_{TG}^2 \) is the common variance of TG at all 4 time points and the correlation coefficient matrix is parameterized by three parameters: ρ1, TG, ρ2, TG and ρ3, TG. Similarly we define βTG and ΣHDL. Now, denoting ρg and ρε as genetic and environmental correlations respectively, we assume the correlation structure as:

\( Var\left(\begin{array}{c}{Y}^{TG}\\ {}{Y}^{HDL}\end{array}\right)=Z\left({\Sigma}_{u, TG, HDL}\otimes K\right){Z}^{\prime }+\left({\Sigma}_{\varepsilon, TG, HDL}\otimes {I}_n\right) \) where \( Z=\left(\begin{array}{cc}{Z}^{TG}& 0\\ {}0& {Z}^{HDL}\end{array}\right), \) \( {\Sigma}_{\varepsilon, TG, HDL}=\left(\begin{array}{cc}{\Sigma}_{TG}& {\rho}_{\varepsilon }{\Sigma}_{TG}^{\frac{1}{2}}{\Sigma}_{HDL}^{\frac{1}{2}}\\ {}{\rho}_{\varepsilon }{\Sigma}_{HDL}^{\frac{1}{2}}{\Sigma}_{TG}^{\frac{1}{2}}& {\Sigma}_{HDL}\end{array}\right), \) \( {\Sigma}_{u, TG, HDL}=\left(\begin{array}{cc}{\sigma}_{u, TG}^2& {\rho}_g{\sigma}_{u, TG}{\sigma}_{u, HDL}\\ {}{\rho}_g{\sigma}_{u, HDL}{\sigma}_{u, TG}& {\sigma}_{u, HDL}^2\end{array}\right). \).

We test the null hypothesis of no genetic association of an SNP with TG and HDL using a likelihood ratio test. The asymptotic distribution of the log-likelihood ratio statistic can be shown to follow a \( {\chi}_2^2 \) distribution. We apply this test at the GWAS level after appropriate multiple-testing correction.

To address our second objective (ie, to test association of response to fenofibrate treatment), we model our data using Henderson’s mixed-model approach with adequate modification. Incorporating correlation structure among family members, we propose our model as

$$ Y= X\beta + Zu+\varepsilon $$
(2)

where Y is the vector of changes in phenotype (measured as log-log of TG) before and after the drug treatment; X is the design matrix of covariates, namely, age, smoking status, and SNP/genotypes in a genomic region (not in linkage disequilibrium); β is the associated fixed effect parameter; Z is the design matrix of random components; u is the random effect of the family; and ε is the error component. We assume \( u\sim N\left(0,{\sigma}_g^2K\right) \), where K is the kinship matrix and \( \varepsilon \sim N\left(0,{\sigma}_e^2R\right) \) independently of u. Hence, \( V(Y)={\sigma}_g^2 ZK{Z}^{\prime }+{\sigma}_e^2R \). Note that because we are dealing with family data, a non-diagonal positive definite matrix R appears in the variance-covariance matrix of ε.

During analysis, we use 778 individuals after removing those with no response either before or after drug treatment. But in case of missing response at one of the time points before (after) drug treatment, we impute it with the other response value. To calculate kinship matrix we use R package “kinship2” with the entire family data. To find association of genomic regions, we first identify the genomic regions and then remove the SNPs that are in linkage disequilibrium (r2 > 0.5). The genomic regions are basically (a) the genic regions, and (b) the intergenic regions lying between 2 consecutive genes, that overlap the genotyped SNPs in our data. We use Henderson’s iterative procedure for mixed model approach [6] after substantial modification and after adjusting for age, smoking status as fixed effects, and random genetic effect within a family. We use the restricted maximum likelihood (REML) approach to test our H0 of no association, adopting the expectation-maximization (EM) algorithm for parameter estimation.

Maximization of joint likelihood of Y and u and eq. (3) [6] provide the best linear unbiased predictors (BLUPs) for the random component under normal assumption of the response variable.

$$ \left(\begin{array}{c}{X}^{\prime }{R}^{-1}X\widehat{\beta}+{X}^{\prime }{R}^{-1}Z\widehat{u}\\ {}{ZR}^{-1}X\widehat{\beta}+\left({Z}^{\prime }{R}^{-1}Z+{D}^{-1}\right)\widehat{u}\end{array}\right)=\left(\begin{array}{c}{X}^{\prime }{R}^{-1}y\\ {}{Z}^{\prime }{R}^{-1}y\end{array}\right) $$
(3)

So to test the association of (a) whole-genome SNPs and (b) genomic regions, with response to the drug treatment, our null hypothesis will be, H0 :  = 0 for an arbitrary p × q matrix M with rank(M) = p. Thus, if n be the number of observations and rank(M) = p, the test statistic [7],

$$ F=\frac{{\left(M\widehat{\beta}\right)}^{\hbox{'}}{\left(M{\left({X}^{\prime }{\widehat{V}}^{-1}X\right)}^{-1}{M}^{\prime}\right)}^{-1}\left(M\widehat{\beta}\right)}{\mathit{\operatorname{rank}}(M)}\sim F\left(p,n-q\right),\mathrm{under}\ {H}_0 $$
(4)
$$ \mathrm{where}\ \widehat{V}=\widehat{Var\left(\widehat{\beta}\right)}={X}^{\prime }{\widehat{V}}_y^{-1}X\ \mathrm{and}\ {\widehat{V}}_y=\widehat{Var}(y). $$

We developed the above procedure to test association of multiple SNPs (genomic regions), which reduces the multiple-testing burden. However, this test can be seen as a single-marker association test that we have used to perform our GWAS study with appropriate multiple-testing correction.

Results

With longitudinal information using multiple phenotypes we identified only 1 significant SNP (Fig. 1) after BH correction. The SNP is rs2880301, located at TPTE2 in an intron. rs2880301 is reported to be associated with HDL particle diameter and low-density lipoprotein (LDL) particle diameter [8] and is also known to confer protection against hepatocellular carcinoma [9]. However, we think that there might be other SNPs that remain unidentified as a consequence of small sample size. Hence we report the top 20 SNPs based on p value in Table 1. rs752273 is reported to be associated with cardiovascular diseases [10] while rs2896368 is known to be associated with α1-antitrypsin level [11].

Fig. 1
figure 1

Manhattan plot of genome-wide p values of SNPs on interacting phenotypes, namely, TG and HDL

Table 1 Top 20 SNPs associated with TG and HDL

To test the null hypothesis of no association with drug response, we examined 243,593 whole-genome SNPs and 18,266 genomic regions. The genomic regions in our study are (a) genic, that overlap the genotyped SNPs, and (b) intergenic, that lie between 2 consecutive genes and overlap the genotyped SNPs in the data set. After BH correction, none of the SNPs nor genomic regions showed significant association with the drug response (Fig. 2). However, we report the top 20 SNPs (Table 2) and top 20 genomic regions (Table 3). The moderately small sample size of the data compared to most of the GWAS might be a reason behind this result. Application of our methods to larger GWAS and further functional validation of the reported top loci might provide some directive for studying inflammatory biomarkers and response to fenofibrate treatment.

Fig. 2
figure 2

Manhattan plot of genome wide p values corresponding to response of drug treatment on TGs

Table 2 Top 20 SNPs associated with fenofibrate treatment response
Table 3 Top 20 genomic regions associated with fenofibrate treatment

Discussion and conclusions

In this paper, we developed novel methods for (a) GWAS using longitudinal data and (b) GWAS/genomic region association with response to fenofibrate treatment based on a family-based design. These methods are agnostic to the choice of phenotype and can be generalized to any such study. Although we could not detect any novel biologically relevant locus that is significantly associated with response to fenofibrate treatment, we identified a few loci that are associated with TG and HDL levels. Our belief is that the primary reason for obtaining only a small number of significant association findings is the much smaller sample size in our analyses as compared to conventional GWAS. Validation in a larger sample might throw more light on the roles of the top few significantly associated SNPs and/or genomic regions in controlling TG and HDL levels. Nevertheless, this study emphasizes the effect of administering fenofibrate to individuals with specific genetic profiles.

We pruned the available set of SNPs to an independent set of SNPs in our GWAS, primarily to reduce the multiple-testing burden. Because many studies impute SNPs, and hence augment the number of available SNPs, to explore association findings for previously reported SNPs that have not been genotyped, our strategy has a caveat in the sense of reduction in the overall power of the GWAS. Similarly, while our proposed method involves simultaneous testing of multiple SNPs within a gene in order to evaluate association at the gene level, it may yield lower powers compared to the usual single SNP analyses in GWAS.

We imputed the missing phenotype data using a known heritability value [5] and have applied the EM algorithm. Although studies show that such imputation may lead to some loss of power and hence seems to be a limitation of our current method, the intuition behind the imputation strategy was to use the maximum phenotype data in our analyses. A more general model that includes the genotype data can be developed in a likelihood framework for testing association, but this would increase substantial computational complexity while calculating the test statistic.

Association findings based on any real data set are susceptible to being false positives. If these findings validate previous reports of association, they are more likely to be true positives. In case of novel significant findings, it is necessary to either validate them in an independent data set or, alternatively, to perform extensive simulations under similar genotype and phenotype structures to evaluate the false-positive rates of the underlying test procedures.