Background

Hypertension is a common chronic medical condition characterized by elevated arterial blood pressure. High blood pressure is associated with an increased risk of stroke, heart attack, and other serious diseases. Age, gender, tobacco smoking, alcohol consumption, and high body mass index constitute established risk factors for hypertension [1]. A genetic component has also been postulated. It has been shown that individuals with a family history of hypertension have on average a higher blood pressure than individuals without a family history. Yanek et al found a 44% higher prevalence of hypertension in siblings of affected persons than in the general reference population [2]. In a Canadian study, standardized risk ratios of hypertension were higher for first-degree relatives than for spouses of probands with hypertension [3]. In genetic studies, a large number of polymorphisms has been associated with hypertension and validated in independent collectives; 14 loci have been identified (as of 2010) and many genetic studies are currently in progress [48].

The relationship between inherited genetic polymorphisms and a binary response variable (with/without hypertension) can be investigated using logistic regression models that simultaneously consider the effects of multiple risk factors. Standard methods used to estimate the parameters of logistic regression models--for example, iteratively reweighted least squares--are limited by their dependence on a few observations departing from the majority of the data. This contrasts with the purpose of genetic risk models that aim to predict a particular health outcome that holds for the bulk of individuals, and to identify persons with a deviating high risk of disease. We use data from the Genetic Analysis Workshop (GAW18) to explore the possible benefit of robust parameter estimates in logistic regression models for the genetic prediction of hypertension risk.

Methods

The analysed data (real phenotypes) were derived from 142 unrelated individuals who participated in the San Antonio Family Heart or Family Diabetes/Gallbladder studies. Longitudinal information on hypertension, age, gender, and current tobacco smoking was measured up to 4 times per individual; the present analyses relied on the first available measurement. Further information is provided in several articles [912].

The original data was filtered according to the following criteria: (a) at least 1 measurement with complete information on hypertension and age, (b) monomorphisms were excluded and each polymorphism had to be represented by at least 2 individuals, (c) individuals with more than 5% missing genotypes were excluded, and, finally, (d) variants with missing data in any individual were removed.

The relationship between hypertension and age, gender, and current tobacco smoking was first investigated by χ2 tests. Covariates significantly associated at the 5% confidence level entered the intercept-only model to build the baseline model. Subsequently, standard logistic regression (iteratively reweighted least squares) was used to identify possible hypertension-associated single-nucleotide polymorphisms (SNPs) with minimal deviance, taking into account associated covariates. The deviance is defined as minus twice the logarithm of the likelihood. Genotypes were coded according to an additive penetrance model; that is, 0, 1, and 2. Departing observations (outliers) according to standard logistic regression were identified based on the Cook's distance in the baseline model. The Cook's distance for observation i is defined as

D i = j = 1 n y ^ j - y ^ j i 2 q MSE

where y ^ j denotes the full regression model prediction for observation j , y ^ j i represents the regression model prediction for observation j estimated omitting observation i , and MSE indicates the mean square error of the regression model with q explanatory variables.

To investigate the possible benefit of robust parameter estimates in logistic regression, model coefficients were also estimated by solving

i = 1 n Ψ y i ; μ i = i = 1 n v y i ; μ i w x i μ i - α β = 0

where v y i ; μ i = ψ c ϵ i V 1 / 2 μ i with the Pearson residuals ϵ i and the Huber function

ψ c ( r i ) = r i for r i c c sign r i for r i > c , w x i = 1 - h i i 1 / 2 with h i i the ith diagonal element of the matrix H=X X T X - 1 X T , μ i ' = μ i β and α β = 1 2 i = 1 n E v y i ; μ i w x i μ i ' .

This estimator is based on a quasi-likelihood, asymptotically normally distributed and Fisher consistent [13]. The objective of the Huber function is to downweight the influence of outliers and to assign inliers the usual weight. Variable selection under robust logistic regression relied on the minimal quasideviance as described by Cantoni and Ronchetti, which is a robust test statistic for model selection [13]. The quasideviance between 2 nested models is defined as

Λ Q M = 2 i = 1 n Q M y i , μ ^ i - i = 1 n Q M y i , μ . i

where Q M y i , μ i = s μ i v y i , t w ( x i ) d t - 1 n j = 1 n t μ j E v y j , t w x j d t with s such that v y i , s = 0 and t such that E v y i , t =0 and the estimated linear predictor μ ^ is associated to the estimate β ^ of β and μ . is associated to β . which is the estimate of ( β ( 1 ) , 0 ) . Linkage disequilibrium was not accounted for during variant selection neither for standard logistic regression nor for robust logistic regression.

Our comparison of the performance of standard and robust logistic regression was based on different statistics. First, standard and robust estimates of age effects were used to exemplify the potential influence of departing observations. Because of a different handling of outliers, it was expected that different age-genotype models were selected under standard and robust logistic regression. Consequently, the areas under the receiver operating characteristic curves (AUCs) were subsequently compared in order to investigate the discriminative performance of the selected models. Comparisons were conducted for the complete data set and after exclusion of potential outliers.

In addition, concordance, sensitivity, specificity, clinical net benefit, and AUCs were estimated for age-genotype models using a leave-one-out cross-validation approach [14]. Concordance was defined as the proportion of correctly estimated hypertension statuses using several cutoff values for the predicted affection probability. The clinical net benefit (NB) was defined by

NB c = True positive counts Sample size - False positive counts Sample size c 1 - c = Sensitivity Hypertensive - 1 - Specificity Normotensive c 1 - c

where c is the chosen threshold for allocating an individual to the cases based on the logistic regression probability estimate. Note that the net benefit depends on the hypertension prevalence in the study population. The standard and robust logistic regression models were also compared based on the integrated discrimination index (IDI) estimated by cross-validation

IDI = 1 n cases i = 1 n cases p ^ rob , i - 1 n contr j = 1 n contr p ^ rob , j - 1 n cases i = 1 n cases p ^ stand , i - 1 n contr j = 1 n contr p ^ stand , j

where p ^ rob , i , p ^ rob , j , p ^ stand , i , and p ^ stand , j denote the probability estimates from the robust and standard logistic regression models for cases and controls [15]. This index represents the difference in the discrimination slopes of the 2 compared models. A positive IDI indicates that the robust model discriminates better between hypertensive and normotensive individuals than the standard model. Statistical analyses were carried out using the statistical language R, version 2.15.1 [16].

Results

χ2 tests revealed no influence of gender (p = 0.95) and tobacco smoking (p = 1.00) on hypertension risk. Hence, only age was included in the logistic regression models as covariate. Filter criteria resulted in 130 individuals (43 cases and 87 controls) with complete genotype and phenotype information. The age of the individuals ranged between 20 and 95 years with a median age of 52 years. The total number of measured SNPs on chromosome 3 in the investigated GAW18 data set was 35,045.

A plot of Cook's distances under the age-only standard logistic regression model revealed several observations (Figure 1) that departed from the majority of the sample. Considering a threshold of 0.05 for the Cook's distance, 4 observations could be defined as outliers. Information on disease status and age of deviating individuals is shown in Table 1. Individuals 62, 58, and 24 were older than 80 years and normotensive. Individual number 60 was affected by the condition early in life, at 38 years of age. Table 1 shows the influence of the 4 identified outliers on standard and robust parameter estimates of age effects. For example, the exclusion of individual 62 resulted in an 11.2% increase of the excess risk of hypertension per year according to standard logistic regression, compared to a 7.8% increase for robust logistic regression. Table 2 shows the odds of hypertension by age interval.

Figure 1
figure 1

Cook's distances from the age-only standard logistic regression model. The 4 most prominent outliers are indicated by their observation number.

Table 1 Estimated odds ratios per year of age
Table 2 Overall odds of hypertension per age interval

Standard logistic regression identified SNP rs3934103 located in the ULK4 gene as the variant that most improved the model fit. Robust logistic regression identified SNP rs11918360 in RP11-408H1.3 as the variant with the strongest association signal. Under both standard and robust regression, model selection clearly favored the 2 identified SNPs as represented in Figure 2. The pairwise r2 between SNP rs3934103 and SNP rs11918360 was 0.003.

Figure 2
figure 2

Quantile-quantile plots from the age-genotype standard and robust logistic regression models. The 2 selected SNPs are indicated by their reference SNP ID number.

Table 3 shows the influence of the 4 outliers on the AUCs from the standard and robust logistic regression models. Robust and standard AUCs for the age-only models were identical. For the age-genotype models, the AUCs were slightly smaller and also slightly less outlier-dependent for robust logistic regression than for standard logistic regression.

Table 3 Area under the receiver operating characteristic curve (AUC)

Table 4 summarizes the results from the leave-one-out cross-validation. The concordance was better for the robust logistic regression model at every cutoff probability. Both models allocated best at probability 0.5 and almost identically at probability 0.3 (the investigated population included 43 cases and 87 controls; that is 33% hypertension prevalence). At a probability of 0.3, sensitivities were identical and the specificity was slightly higher under robust regression. Standard and robust estimates showed similar discriminative performances supported by an IDI of −0.07 at every cutoff probability. AUCs were also almost identical. The clinical net benefit was slightly larger for the robust logistic regression model in the probability range between 0.2 and 0.6.

Table 4 Concordance, sensitivity, specificity, clinical net benefit, and overall AUCs.

Discussion

Present results confirmed that single individuals (1/130 = 0.8% of the observations) with a departing risk of hypertension may substantially affect the overall risk estimates in the baseline model, causing up to an 11.2% change in the estimated excess risk of hypertension per year according to standard logistic regression in the present exercise.

The identification of outliers is relatively straightforward using routine diagnostic plots, but outlier management is extremely challenging. For example, the specification of thresholds for outlier definition is often arbitrary. Robust statistics aim to generate estimates that hold for the majority of the population using complete data. The unequal weighting of outliers by standard and robust regression resulted in prediction models that included different genetic variants.

Although robust estimates of age effects and AUCs for age-genotype models were less sensitive to outliers than standard estimates in the investigated sample, cross-validation AUCs based on standard and robust logistic regression, as well as IDI, were almost identical. The other investigated performance characteristics (concordance, sensitivity, specificity, and clinical net benefit) were equal or better for robust logistic regression around the probability that reflects the case-control ratio.

The standard logistic regression model selected 1 variant in the ULK4 gene. It was previously shown that variants in this gene are associated with hypertension [4, 17]. Among others, 4 variants (rs2272007, rs3774372, rs1716975, rs1052501) mentioned in the 2 publications were also genotyped in the GAW18 collective, and we found them to be in linkage disequilibrium (r2 values 0.83, 0.73, 0.83, and 0.83, respectively) with the associated SNP rs3934103.

Conclusions

Preliminary findings suggest some advantage of robust statistics in the context of genetic association studies. However, present results were limited to a given sample size, as well as to particular genetic effect sizes and proportions of outliers. Additional analyses based on both real data and more general simulated scenarios should be conducted to validate initial findings.