Background

The Genetic Analysis Workshop 18 (GAW18) real data are family-based, consisting of cleaned single-nucleotide polymorphism (SNP) genotypes, sex, age at the time of examination, hypertension status, and smoking for up to 4 time points. Data on 157 unrelated individuals are also extracted from the families and made available for analysis. Previous studies have examined more than 50 genes for their associations with hypertension, and the number is growing [1]. Moreover, hypertension is also considered to be age-dependent; the chance of being hypertensive rises with age and the risk after midlife (eg, more than 50 years of age) is considerable [2]. Hence, in this paper, we aim to identify both genetic and environmental factors that are associated with high blood pressure, with the effects potentially varying at different ages.

This contribution concerns a haplotype-based method because haplotype procedures can be more powerful than a single SNP analysis if there are multiple causal variants interacting in cis-fashion, or if only SNPs in linkage disequilibrium with causal SNPs are genotyped. Rare haplotypes can result even when only common SNPs are considered. Thus, novel methods are needed not only to take the varying effects of haplotypes and environmental factors into account, but also to deal with the anomaly of rare variants. The particular environmental factor of interest is smoking, as it has been shown to be a potential risk factor for hypertension [3]; thus, we include smoking in our model in addition to sex and age.

Because the GAW18 data are collected prospectively, we first formulate a prospective likelihood. We then borrow the idea from the logistic Bayesian LASSO (LBL) approach to penalize parameters (regression coefficients) by setting up proper prior distributions [4]. We chose to follow the LBL idea because it has been shown to be capable of detecting rare associated haplotypes, albeit under a retrospective setting with fixed, rather than varying, coefficients. As such, the LBL time-varying coefficient (LBL-tvc) method developed in this paper can be considered as a generalization of the original LBL. The haplotype and environmental effect coefficients are obtained by using Markov chain Monte Carlo (MCMC) methods. By using the proper percentiles of the sampled parameters, we can also construct hypothesis tests to determine whether a haplotype or an environmental covariate is associated with the disease.

Methods

Data

We considered data from 153 unrelated individuals. Blood pressure measurements, age, and smoking status were available for up to 4 time points. Specifically, 40, 41, 45, and 31 individuals had measurements at 1, 2, 3, and 4 time points, respectively. Binary hypertension status is as defined in the original study: An individual is labeled as hypertensive if the systolic blood pressure is greater than 140 mm Hg, or the diastolic blood pressure is greater than 90 mm Hg, or if the individual is on antihypertensive medication at the time of examination. Individuals with incomplete genotype data for the SNPs under consideration were excluded from the analysis. However, individuals with measurements at less than 4 time points were all included because such individuals can be accommodated by our model (see below).

Selection of 4 regions in the MAP4 gene

To provide a focused analysis, we only considered the MAP4 gene, which was associated with hypertension in previous studies. First we carried out preliminary analysis to find regions that provide at least weak evidence of association in MAP4 by single SNP analysis to reduce the computational burden of LBL-tvc. Specifically, we performed a logistic regression analysis for each SNP in the MAP4 gene and included age, sex, and smoking in the model as follows:

logit ( p ( Y 1 = 1 | covariates ) ) = β 0 + β 1 SN P k + β 2 SMOK E 1 + β 3 AG E 1 + β 4 SEX
(1)

where Y 1 , SMOK E 1 , AG E 1 , and SEX are the hypertension status (1 if hypertension), smoking status, age, and sex at the first examination, respectively, and S N P k is the genotype at the kth SNP. A chi-square analysis of variance (ANOVA) test is performed and the p value is recorded.

We choose SNPs corresponding to the 4 smallest p values as the anchors of our 4 regions for further analysis. Each chosen SNP and its 4 adjacent SNPs (2 on each side) form a 5-SNP-haplotype block. We used the Hapassoc software (http://cran.r-project.org/web/packages/hapassoc/index.html) to estimate haplotype frequencies, which were then used as the starting values for our MCMC analysis.

Prospective likelihood formulation

Let Y i , j be the affection status and E i , j the smoking status of the ith individual at the jth examination (i = 1,2,..., n; j = 1,2,3, j i ≤ 4), where n is the number of individuals with observed data. Further, let Z i be the phased (missing) haplotype pair and S i the sex of the ith individual. The probability of a certain haplotype pair can be written as

P ( Z | λ ) = a z ( λ ) = P ( Z = z k / z k | λ ) = δ k k d f k + ( 2 - δ k k ) ( 1 - d ) f k f k ,
(2)

where f 1 , , f m denote the m haplotype frequencies, δ k k is the indicator function that equals to 1 if k = k , and d ∈ (−1,1) is the inbreeding coefficient that can capture the excess or reduction of homozygosity [5]. When d = 0, Hardy-Weinberg equilibrium holds. The vector λ is the collection of all parameters. Suppose the disease statuses Y i , j are independent, conditional on the mean effect at time point j. In addition, assume that smoking status and haplotype are independent. Let Ψ be the collection of haplotype, smoking, sex, and age effect plus the parameters associated with haplotype frequencies, from which the mean effect can be constructed (see B-splines section). Then, the complete data prospective likelihood can be written as

L c ( Ψ ) = i , j P ( Y i , j | E i , j , Z i , S i , Ψ ) i P ( E i , 1 , Ψ ) j 2 P ( E i , j | E i , j - 1 , Ψ ) i P ( Z i | Ψ ) i P ( S i | Ψ ) .
(3)

Let θ z , E = P ( Y = 1 | Z , E , S ) / P ( Y = 0 | Z , E , S ) . A logistic regression model leads to log ( θ Z , E ) = β 0 + β h ( t ) X h + β E X E ( t ) + β E , h X E , h + β s X s , where β h ( t ) is the vector of haplotype effects at age t; β E is the smoking effect; β E , h is the interaction effect; and β s is the sex effect. Furthermore, X h is the design vector that gives the number of copies of each haplotype in Z; X E ( t ) is the smoking status at age t (age at examination); X E , h ( t ) is the interaction of smoking and haplotype; and X s is the sex.

B-splines

We consider the natural cubic B-spline to express the haplotype effects over time. We write β h ( t ) = l = 4 L + 4 β h l B l ( t ) , where β h l ( t ) is the natural cubic B-spline basis function, and L is the number of interior knots. Because the age range is from 22 to 97, we let L = 2 and choose interior knots to be (40, 60) and the boundary knots to be (20, 100). We then rewrite the logistic model as

log ( θ Z , E ) = β 0 l = 1 L + 4 β h l B l ( t ) X h + β E X E ( t ) + β E , h X E , h ( t ) + β s X s = β 0 + β ̃ X ̃ ( t ) ,
(4)

where β ̃ = ( β h l , β E , β E , h , β s ) , X ̃ ( t ) = ( B l ( t ) X h , X E , h ( t ) , X s ) . The likelihood function is now completely specified in terms of the parameter vector Ψ.

MCMC estimation of parameters

We follow the LBL [4] methodology for estimating the parameters. A double exponential distribution with mean 0 is set to be the prior distribution for each parameter in β ̃ , with the intensity parameter set to be gamma, to control shrinkage. Uninformative priors are set for haplotype frequencies and inbreeding coefficient. We use MCMC methods to sample the parameters from the appropriate posterior distributions. If it is feasible to sample directly from the conditional distribution of a parameter, then we use the Gibbs sampler; otherwise, we use the Metropolis-Hastings algorithm with appropriate proposal distribution.

Results

LBL-tvc was applied to each of the four 5-SNP-haplotype blocks/regions in the MAP4 gene. For each region, at least 1 haplotype shows significant effect at the age range of 55 to 85 years (Table 1). It is interesting to see that the effect over time is consistent over all 4 regions. Specifically, the associated haplotypes do not confer risk until an individual turns 55 to 60 years of age. The risk continues to rise and reaches the maximum at an age of 70 to 75 years. To see this more clearly, we have plotted the haplotype effects over time (Figure 1) for the associated haplotype GGTCC in region 2, which spans the region from 47964587 to 47985074 base pairs (bp) on chromosome 3, covering 2 introns and 1 exon. The haplotype appears to be protective at a young age and gradually becomes a risk haplotype when an individual reaches 60 years of age. The effect is the strongest at 65 to 70 years of age, with an estimated odds ratio of 2.52, indicating a fairly strong association. The effect appears to diminish at an older age. However, note that the number of subjects in the young (25 to 40 years) and the old (90 to 95 years) categories are very small, and thus the corresponding results in these categories should be interpreted with great caution. This phenomenon also occurs in the other haplotypes. Also note that for the identified haplotype CGAGG in region 47911271 to 47915368 bp, which covers 3 exons and 3 introns, the estimated haplotype frequency is less than 0.05, considered to be a rare haplotype. The effect of this haplotype would have been overlooked had one used a pooling method by combining this haplotype with other rare haplotypes or with a common similar haplotype [4]. Although sex and smoking are also included in the model, neither effect is deemed to be significant according to our model.

Table 1 Significant haplotypes and their effect estimates in 4 regions of the MAP4 gene on chromosome 3
Figure 1
figure 1

Effect of Haplotype GGTCC (47964587 to 47985074 bp) over the specified age range. The solid line is the odds ratio (OR) of the effects and the 2 red dotted lines are upper and lower bounds of 95% credible intervals.

Discussion

The longitudinal nature of the GAW18 data calls for methodology that is able to take the correlated measurements into account. Furthermore, there is a great deal of treasures that are yet to be mined from the common SNP data collected in genome-wide association studies. To this end, we have proposed LBL-tvc, a logistic regression model, to handle the correlated measurements over 4 time points. LBL-tvc considers the effects of haplotypes, which can be rare even if all the underlying SNPs are common. Application of LBL-tvc to the MAP4 gene yielded results that are consistent in all 4 regions of the MAP4 gene and appear to be useful. As one may expect, the effect of an associated haplotype would confer risk only when an individual reaches the age of 55 to 60 years, when hypertension typically strikes. The results further demonstrate the utility of the methodology for its ability to detect the effects of rare associated haplotypes.

To evaluate the performance of LBL-tvc, we carried out a preliminary simulation study with the effect mimicking that of what we see in the real data. More specifically, the simulation model considers a 5-SNP-haplotype block in which there are 5 common haplotypes and 2 rare haplotypes, with 1 of each type being associated with the hypertensive status. The strength of association across the age range of 20 to 90 years varies in a fashion similar to the pattern in the fitted real data. We also entertained an interaction effect between smoking and the common risk haplotype. Affection and smoking status are simulated at 4 time points for 250 individuals. The results, based on 100 replications, show that the type I error is well controlled, and there is overwhelming power (>90%) for detecting the common haplotype effects in the mid-age range. The power is much lower (approximately 50%), although still reasonable, for the rare haplotype effect. The power for detecting the haplotype-smoking interaction is also very high (>90%); we note, however, that the power will likely be much smaller had the interaction been with a rare haplotype. Overall, the simulation results are encouraging and to some extent validate our findings in the real data. Nevertheless, further investigation is needed to fully evaluate the properties of the method.

Because MCMC is applied for estimating the parameters, the procedure is computationally intensive. For example, analysis of each simulation replicate on a 5-SNP-haplotype block with 250 individuals and data on 4 time points as described above took about 35 minutes to complete. Therefore, our method should be primarily used for follow-up studies in interesting gene/regions. In our real data analysis, we simply use a prescreening procedure to find single SNP signals to form haplotypes with 4 neighboring SNPs. This construction of haplotype block is somewhat arbitrary. An alternative would be to select additional SNPs based on linkage disequilibrium plots.