Practical investigation of the performance of robust logistic regression to predict the genetic risk of hypertension

Kesselmeier, Miriam; Legrand, Carine; Peil, Barbara; Kabisch, Maria; Fischer, Christine; Hamann, Ute; Lorenzo Bermejo, Justo

doi:10.1186/1753-6561-8-S1-S65

Practical investigation of the performance of robust logistic regression to predict the genetic risk of hypertension

Proceedings
Open access
Published: 17 June 2014

Volume 8, article number S65, (2014)
Cite this article

Download PDF

You have full access to this open access article

BMC Proceedings Aims and scope

Practical investigation of the performance of robust logistic regression to predict the genetic risk of hypertension

Download PDF

Miriam Kesselmeier^1,4,
Carine Legrand¹,
Barbara Peil¹,
Maria Kabisch²,
Christine Fischer³,
Ute Hamann² &
…
Justo Lorenzo Bermejo¹

2758 Accesses
3 Citations
Explore all metrics

Abstract

Logistic regression is usually applied to investigate the association between inherited genetic variants and a binary disease phenotype. A limitation of standard methods used to estimate the parameters of logistic regression models is their strong dependence on a few observations deviating from the majority of the data.

We used data from the Genetic Analysis Workshop 18 to explore the possible benefit of robust logistic regression to estimate the genetic risk of hypertension. The comparison between standard and robust methods relied on the influence of departing hypertension profiles (outliers) on the estimated odds ratios, areas under the receiver operating characteristic curves, and clinical net benefit.

Our results confirmed that single outliers may substantially affect the estimated genotype relative risks. The ranking of variants by probability values was different in standard and in robust logistic regression. For cutoff probabilities between 0.2 and 0.6, the clinical net benefit estimated by leave-one-out cross-validation in the investigated sample was slightly larger under robust regression, but the overall area under the receiver operating characteristic curve was larger for standard logistic regression. The potential advantage of robust statistics in the context of genetic association studies should be investigated in future analyses based on real and simulated data.

Background

Hypertension is a common chronic medical condition characterized by elevated arterial blood pressure. High blood pressure is associated with an increased risk of stroke, heart attack, and other serious diseases. Age, gender, tobacco smoking, alcohol consumption, and high body mass index constitute established risk factors for hypertension [1]. A genetic component has also been postulated. It has been shown that individuals with a family history of hypertension have on average a higher blood pressure than individuals without a family history. Yanek et al found a 44% higher prevalence of hypertension in siblings of affected persons than in the general reference population [2]. In a Canadian study, standardized risk ratios of hypertension were higher for first-degree relatives than for spouses of probands with hypertension [3]. In genetic studies, a large number of polymorphisms has been associated with hypertension and validated in independent collectives; 14 loci have been identified (as of 2010) and many genetic studies are currently in progress [4–8].

The relationship between inherited genetic polymorphisms and a binary response variable (with/without hypertension) can be investigated using logistic regression models that simultaneously consider the effects of multiple risk factors. Standard methods used to estimate the parameters of logistic regression models--for example, iteratively reweighted least squares--are limited by their dependence on a few observations departing from the majority of the data. This contrasts with the purpose of genetic risk models that aim to predict a particular health outcome that holds for the bulk of individuals, and to identify persons with a deviating high risk of disease. We use data from the Genetic Analysis Workshop (GAW18) to explore the possible benefit of robust parameter estimates in logistic regression models for the genetic prediction of hypertension risk.

Methods

The analysed data (real phenotypes) were derived from 142 unrelated individuals who participated in the San Antonio Family Heart or Family Diabetes/Gallbladder studies. Longitudinal information on hypertension, age, gender, and current tobacco smoking was measured up to 4 times per individual; the present analyses relied on the first available measurement. Further information is provided in several articles [9–12].

The original data was filtered according to the following criteria: (a) at least 1 measurement with complete information on hypertension and age, (b) monomorphisms were excluded and each polymorphism had to be represented by at least 2 individuals, (c) individuals with more than 5% missing genotypes were excluded, and, finally, (d) variants with missing data in any individual were removed.

The relationship between hypertension and age, gender, and current tobacco smoking was first investigated by χ² tests. Covariates significantly associated at the 5% confidence level entered the intercept-only model to build the baseline model. Subsequently, standard logistic regression (iteratively reweighted least squares) was used to identify possible hypertension-associated single-nucleotide polymorphisms (SNPs) with minimal deviance, taking into account associated covariates. The deviance is defined as minus twice the logarithm of the likelihood. Genotypes were coded according to an additive penetrance model; that is, 0, 1, and 2. Departing observations (outliers) according to standard logistic regression were identified based on the Cook's distance in the baseline model. The Cook's distance for observation $i$ is defined as

D_{i} = \frac{\sum_{j = 1}^{n} {({\hat{y}}_{j} - {\hat{y}}_{j (i)})}^{2}}{q MSE}

where ${\hat{y}}_{j}$ denotes the full regression model prediction for observation $j$ , ${\hat{y}}_{j (i)}$ represents the regression model prediction for observation $j$ estimated omitting observation $i$ , and MSE indicates the mean square error of the regression model with $q$ explanatory variables.

To investigate the possible benefit of robust parameter estimates in logistic regression, model coefficients were also estimated by solving

\sum_{i = 1}^{n} Ψ (y_{i}; μ_{i}) = \sum_{i = 1}^{n} v (y_{i}; μ_{i}) w (x_{i}) {μ_{i}}^{'} - α (β) = 0

where $v (y_{i}; μ_{i}) = \frac{ψ_{c} (ϵ_{i})}{V^{1 / 2} (μ_{i})}$ with the Pearson residuals $ϵ_{i}$ and the Huber function

$ψ_{c} (r_{i}) = \{\begin{matrix} r_{i} & for |r_{i}| \leq c \\ c sign (r_{i}) & for |r_{i}| > c, \end{matrix}$ $w (x_{i}) = {(1 - h_{i i})}^{1 / 2}$ with $h_{i i}$ the i^th diagonal element of the matrix $H = X {(X^{T} X)}^{- 1} X^{T}$ , $μ_{i}^{'} = \frac{\partial μ_{i}}{\partial β}$ and $α (β) = \frac{1}{2} \sum_{i = 1}^{n} E [v (y_{i}; μ_{i})] w (x_{i}) μ_{i}^{'}$ .

This estimator is based on a quasi-likelihood, asymptotically normally distributed and Fisher consistent [13]. The objective of the Huber function is to downweight the influence of outliers and to assign inliers the usual weight. Variable selection under robust logistic regression relied on the minimal quasideviance as described by Cantoni and Ronchetti, which is a robust test statistic for model selection [13]. The quasideviance between 2 nested models is defined as

Λ_{Q M} = 2 [\sum_{i = 1}^{n} Q_{M} (y_{i}, {\hat{μ}}_{i}) - \sum_{i = 1}^{n} Q_{M} (y_{i}, {\dot{μ}}_{i})]

where $Q_{M} (y_{i}, μ_{i}) = \int_{\tilde{s}}^{μ_{i}} v (y_{i}, t) w (x_{i}) d t - \frac{1}{n} \sum_{j = 1}^{n} \int_{\tilde{t}}^{μ_{j}} E [v (y_{j}, t) w (x_{j})] d t$ with $\tilde{s}$ such that $v (y_{i}, \tilde{s}) = 0$ and $\tilde{t}$ such that $E [v (y_{i}, \tilde{t})] = 0$ and the estimated linear predictor $\hat{μ}$ is associated to the estimate $\hat{β}$ of $β$ and $\dot{μ}$ is associated to $\dot{β}$ which is the estimate of $(β_{(1)}, 0)$ . Linkage disequilibrium was not accounted for during variant selection neither for standard logistic regression nor for robust logistic regression.

Our comparison of the performance of standard and robust logistic regression was based on different statistics. First, standard and robust estimates of age effects were used to exemplify the potential influence of departing observations. Because of a different handling of outliers, it was expected that different age-genotype models were selected under standard and robust logistic regression. Consequently, the areas under the receiver operating characteristic curves (AUCs) were subsequently compared in order to investigate the discriminative performance of the selected models. Comparisons were conducted for the complete data set and after exclusion of potential outliers.

In addition, concordance, sensitivity, specificity, clinical net benefit, and AUCs were estimated for age-genotype models using a leave-one-out cross-validation approach [14]. Concordance was defined as the proportion of correctly estimated hypertension statuses using several cutoff values for the predicted affection probability. The clinical net benefit (NB) was defined by

\begin{aligned} NB (c) & = \frac{True positive counts}{Sample size} - \frac{False positive counts}{Sample size} \cdot \frac{c}{1 - c} \\ = Sensitivity \cdot (% Hypertensive) - (1 - Specificity) \cdot (% Normotensive) \cdot \frac{c}{1 - c} \end{aligned}

where $c$ is the chosen threshold for allocating an individual to the cases based on the logistic regression probability estimate. Note that the net benefit depends on the hypertension prevalence in the study population. The standard and robust logistic regression models were also compared based on the integrated discrimination index (IDI) estimated by cross-validation

IDI = (\frac{1}{n_{cases}} \sum_{i = 1}^{n_{cases}} {\hat{p}}_{rob, i} - \frac{1}{n_{contr}} \sum_{j = 1}^{n_{contr}} {\hat{p}}_{rob, j}) - (\frac{1}{n_{cases}} \sum_{i = 1}^{n_{cases}} {\hat{p}}_{stand, i} - \frac{1}{n_{contr}} \sum_{j = 1}^{n_{contr}} {\hat{p}}_{stand, j})

where ${\hat{p}}_{rob, i}$ , ${\hat{p}}_{rob, j}$ , ${\hat{p}}_{stand, i}$ , and ${\hat{p}}_{stand, j}$ denote the probability estimates from the robust and standard logistic regression models for cases and controls [15]. This index represents the difference in the discrimination slopes of the 2 compared models. A positive IDI indicates that the robust model discriminates better between hypertensive and normotensive individuals than the standard model. Statistical analyses were carried out using the statistical language R, version 2.15.1 [16].

Results

χ² tests revealed no influence of gender (p = 0.95) and tobacco smoking (p = 1.00) on hypertension risk. Hence, only age was included in the logistic regression models as covariate. Filter criteria resulted in 130 individuals (43 cases and 87 controls) with complete genotype and phenotype information. The age of the individuals ranged between 20 and 95 years with a median age of 52 years. The total number of measured SNPs on chromosome 3 in the investigated GAW18 data set was 35,045.

A plot of Cook's distances under the age-only standard logistic regression model revealed several observations (Figure 1) that departed from the majority of the sample. Considering a threshold of 0.05 for the Cook's distance, 4 observations could be defined as outliers. Information on disease status and age of deviating individuals is shown in Table 1. Individuals 62, 58, and 24 were older than 80 years and normotensive. Individual number 60 was affected by the condition early in life, at 38 years of age. Table 1 shows the influence of the 4 identified outliers on standard and robust parameter estimates of age effects. For example, the exclusion of individual 62 resulted in an 11.2% increase of the excess risk of hypertension per year according to standard logistic regression, compared to a 7.8% increase for robust logistic regression. Table 2 shows the odds of hypertension by age interval.

Table 1 Estimated odds ratios per year of age

Full size table

Table 2 Overall odds of hypertension per age interval

Full size table

Standard logistic regression identified SNP rs3934103 located in the ULK4 gene as the variant that most improved the model fit. Robust logistic regression identified SNP rs11918360 in RP11-408H1.3 as the variant with the strongest association signal. Under both standard and robust regression, model selection clearly favored the 2 identified SNPs as represented in Figure 2. The pairwise r² between SNP rs3934103 and SNP rs11918360 was 0.003.

Table 3 shows the influence of the 4 outliers on the AUCs from the standard and robust logistic regression models. Robust and standard AUCs for the age-only models were identical. For the age-genotype models, the AUCs were slightly smaller and also slightly less outlier-dependent for robust logistic regression than for standard logistic regression.

Table 3 Area under the receiver operating characteristic curve (AUC)

Full size table

Table 4 summarizes the results from the leave-one-out cross-validation. The concordance was better for the robust logistic regression model at every cutoff probability. Both models allocated best at probability 0.5 and almost identically at probability 0.3 (the investigated population included 43 cases and 87 controls; that is 33% hypertension prevalence). At a probability of 0.3, sensitivities were identical and the specificity was slightly higher under robust regression. Standard and robust estimates showed similar discriminative performances supported by an IDI of −0.07 at every cutoff probability. AUCs were also almost identical. The clinical net benefit was slightly larger for the robust logistic regression model in the probability range between 0.2 and 0.6.

Table 4 Concordance, sensitivity, specificity, clinical net benefit, and overall AUCs.

Full size table

Discussion

Present results confirmed that single individuals (1/130 = 0.8% of the observations) with a departing risk of hypertension may substantially affect the overall risk estimates in the baseline model, causing up to an 11.2% change in the estimated excess risk of hypertension per year according to standard logistic regression in the present exercise.

The identification of outliers is relatively straightforward using routine diagnostic plots, but outlier management is extremely challenging. For example, the specification of thresholds for outlier definition is often arbitrary. Robust statistics aim to generate estimates that hold for the majority of the population using complete data. The unequal weighting of outliers by standard and robust regression resulted in prediction models that included different genetic variants.

Although robust estimates of age effects and AUCs for age-genotype models were less sensitive to outliers than standard estimates in the investigated sample, cross-validation AUCs based on standard and robust logistic regression, as well as IDI, were almost identical. The other investigated performance characteristics (concordance, sensitivity, specificity, and clinical net benefit) were equal or better for robust logistic regression around the probability that reflects the case-control ratio.

The standard logistic regression model selected 1 variant in the ULK4 gene. It was previously shown that variants in this gene are associated with hypertension [4, 17]. Among others, 4 variants (rs2272007, rs3774372, rs1716975, rs1052501) mentioned in the 2 publications were also genotyped in the GAW18 collective, and we found them to be in linkage disequilibrium (r² values 0.83, 0.73, 0.83, and 0.83, respectively) with the associated SNP rs3934103.

Conclusions

Preliminary findings suggest some advantage of robust statistics in the context of genetic association studies. However, present results were limited to a given sample size, as well as to particular genetic effect sizes and proportions of outliers. Additional analyses based on both real data and more general simulated scenarios should be conducted to validate initial findings.

References

Jonas BS, Franks P, Ingram DD: Are symptoms of anxiety and depression risk factors for hypertension? Longitudinal evidence from the National Health and Nutrition Examination Survey I Epidemiologic Follow-up Study. Arch Fam Med. 1997, 6: 43-49. 10.1001/archfami.6.1.43.
Article CAS PubMed Google Scholar
Yanek LR, Moy TF, Blumenthal RS, Raqueño JV, Yook RM, Hill MN, Becker LC, Becker DM: Hypertension among siblings of persons with premature coronary heart disease. Hypertension. 1998, 32: 123-128. 10.1161/01.HYP.32.1.123.
Article CAS PubMed Google Scholar
Katzmarzyk PT, Rankinen T, Pérusse L, Rao DC, Bouchard C: Familial risk of high blood pressure in the Canadian population. Am J Hum Biol. 2001, 13: 620-625. 10.1002/ajhb.1100.
Article CAS PubMed Google Scholar
Levy D, Ehret GB, Rice K, Verwoert GC, Launer LJ, Dehghan A, Glazer NL, Morrison AC, Johnson AD, Aspelund T, et al: Genome-wide association study of blood pressure and hypertension. Nat Genet. 2009, 41: 677-687. 10.1038/ng.384.
Article PubMed Central CAS PubMed Google Scholar
Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, Coin L, Najjar SS, Zhao JH, Heath SC, Eyheramendy S, et al: Genome-wide association study identifies eight loci associated with blood pressure. Nat Genet. 2009, 41: 666-676. 10.1038/ng.361.
Article PubMed Central CAS PubMed Google Scholar
Wang Y, O'Connell JR, McArdle PF, Wade JB, Dorff SE, Shah SJ, Shi X, Pan L, Rampersaud E, Shen H, et al: Whole-genome association study identifies STK39 as a hypertension susceptibility gene. Proc Natl Acad Sci U S A. 2009, 106: 226-231. 10.1073/pnas.0808358106.
Article PubMed Central CAS PubMed Google Scholar
Ehret GB: Genome-wide association studies: contribution of genomics to understanding blood pressure and essential hypertension. Curr Hypertens Rep. 2010, 12: 17-25. 10.1007/s11906-009-0086-6.
Article PubMed Central PubMed Google Scholar
Padmanabhan S, Newton-Cheh C, Dominiczak AF: Genetic basis of blood pressure and hypertension. Trends Genet. 2012, 28: 397-408. 10.1016/j.tig.2012.04.001.
Article CAS PubMed Google Scholar
Mitchell BD, Kammerer CM, Blangero J, Mahaney MC, Rainwater DL, Dyke B, Hixson JE, Henkel RD, Sharp RM, Comuzzie AG, et al: Genetic and environmental contributions to cardiovascular risk factors in Mexican Americans: The San Antonio Family Heart Study. Circulation. 1996, 94: 2159-2170. 10.1161/01.CIR.94.9.2159.
Article CAS PubMed Google Scholar
Duggirala R, Blangero J, Almasy L, Dyer TD, Williams KL, Leach RJ, O'Connell P, Stern MP: Linkage of type 2 diabetes mellitus and of age at onset to a genetic location on chromosome 10q in Mexican Americans. Am J Hum Genet. 1999, 64: 1127-1140. 10.1086/302316.
Article PubMed Central CAS PubMed Google Scholar
Hunt KJ, Lehman DM, Arya R, Fowler S, Leach RJ, Göring HH, Almasy L, Blangero J, Dyer TD, Duggirala R, et al: Genome-wide linkage analyses of type 2 diabetes in Mexican Americans: the San Antonio Family Diabetes/Gallbladder Study. Diabetes. 2005, 54: 2655-2662. 10.2337/diabetes.54.9.2655.
Article CAS PubMed Google Scholar
Almasy L, Dyer TD, Peralta JM, Jun G, Fuchsberger C, Almeida MA, Kent JW, Fowler S, Duggirala R, Blangero J: Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC Proc. 2014, 8 (suppl 2): S2-
Article PubMed Central PubMed Google Scholar
Cantoni E, Ronchetti E: Robust inference for generalized linear models. J Am Stat Assoc. 2001, 96: 1022-1030. 10.1198/016214501753209004.
Article Google Scholar
Vickers AJ, Elkin EB: Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006, 26: 565-574. 10.1177/0272989X06295361.
Article PubMed Central PubMed Google Scholar
Pencina MJ, D'Agostino RB, D'Agostino RB, Vasan RS: Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008, 27: 157-172. 10.1002/sim.2929.
Article PubMed Google Scholar
R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2012, [http://www.R-project.org/]
Google Scholar
Ho JE, Levy D, Rose L, Johnson AD, Ridker PM, Chasman DI: Discovery and replication of novel blood pressure genetic loci in the Women's Genome Health Study. J Hypertens. 2011, 29: 62-69. 10.1097/HJH.0b013e3283406927.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by the Deutsche Forschungsgemeinschaft (DFG) grant SFB/TRR77 (Project Z2).

The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

Author information

Authors and Affiliations

Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 305, 69120, Heidelberg, Germany
Miriam Kesselmeier, Carine Legrand, Barbara Peil & Justo Lorenzo Bermejo
Molecular Genetics of Breast Cancer, Deutsches Krebsforschungszentrum (DFKZ), Im Neuenheimer Feld 580, 69120, Heidelberg, Germany
Maria Kabisch & Ute Hamann
Institute of Human Genetics, University Hospital Heidelberg, Im Neuenheimer Feld 366, 69120, Heidelberg, Germany
Christine Fischer
Clinical Epidemiology, Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Erlanger Allee 101, 07747, Jena, Germany
Miriam Kesselmeier

Authors

Miriam Kesselmeier
View author publications
You can also search for this author in PubMed Google Scholar
Carine Legrand
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Peil
View author publications
You can also search for this author in PubMed Google Scholar
Maria Kabisch
View author publications
You can also search for this author in PubMed Google Scholar
Christine Fischer
View author publications
You can also search for this author in PubMed Google Scholar
Ute Hamann
View author publications
You can also search for this author in PubMed Google Scholar
Justo Lorenzo Bermejo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Justo Lorenzo Bermejo.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MK analysed and interpreted the data, created figures and tables, searched literature, and drafted the manuscript. CL identified relevant literature, supported data analysis and interpretation, and reviewed the manuscript. BP and MKa supported data analysis and interpretation and reviewed the manuscript. CF and UH supported interpretation and reviewed the manuscript. JLB formulated study goals, supported data analysis and interpretation, and reviewed the manuscript. All authors read and approved the final version.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Kesselmeier, M., Legrand, C., Peil, B. et al. Practical investigation of the performance of robust logistic regression to predict the genetic risk of hypertension. BMC Proc 8 (Suppl 1), S65 (2014). https://doi.org/10.1186/1753-6561-8-S1-S65

Download citation

Published: 17 June 2014
DOI: https://doi.org/10.1186/1753-6561-8-S1-S65

Practical investigation of the performance of robust logistic regression to predict the genetic risk of hypertension

Abstract

Background

Methods

Results

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Practical investigation of the performance of robust logistic regression to predict the genetic risk of hypertension

Abstract

Background

Methods

Results

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation