Regression Modeling Strategies pp 219-274 | Cite as

# Binary Logistic Regression

## Abstract

Binary responses are commonly studied in many fields. Examples include 1 the presence or absence of a particular disease, death during surgery, or a consumer purchasing a product. Often one wishes to study how a set of predictor variables *X* is related to a dichotomous response variable *Y*. The predictors may describe such quantities as treatment assignment, dosage, risk factors, and calendar time. For convenience we define the response to be *Y* = 0 or 1, with *Y* = 1 denoting the occurrence of the event of interest. Often a dichotomous outcome can be studied by calculating certain proportions, for example, the proportion of deaths among females and the proportion among males. However, in many situations, there are multiple descriptors, or one or more of the descriptors are continuous. Without a statistical model, studying patterns such as the relationship between age and occurrence of a disease, for example, would require the creation of arbitrary age groups to allow estimation of disease prevalence as a function of age.

## Keywords

Logistic Model Binary Logistic Regression Spline Function Wald Statistic Brier Score## References

- 6.A. Agresti.
*Categorical data analysis*. Wiley, Hoboken, NJ, second edition, 2002.CrossRefMATHGoogle Scholar - 30.H. R. Arkes, N. V. Dawson, T. Speroff, F. E. Harrell, C. Alzola, R. Phillips, N. Desbiens, R. K. Oye, W. Knaus, A. F. Connors, and T. Investigators. The covariance decomposition of the probability score and its use in evaluating prognostic estimates.
*Med Decis Mak*, 15:120–131, 1995.CrossRefGoogle Scholar - 39.D. Bamber. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph.
*J Mathe Psych*, 12:387–415, 1975.MathSciNetCrossRefMATHGoogle Scholar - 40.J. Banks. Nomograms. In S. Kotz and N. L. Johnson, editors,
*Encyclopedia of Stat Scis*, volume 6. Wiley, New York, 1985.Google Scholar - 51.K. N. Berk and D. E. Booth. Seeing a curve in multiple regression.
*Technometrics*, 37:385–398, 1995.CrossRefMATHGoogle Scholar - 73.G. W. Brier. Verification of forecasts expressed in terms of probability.
*Monthly Weather Rev*, 78:1–3, 1950.CrossRefGoogle Scholar - 86.M. Buyse.
*R*^{2}: A useful measure of model performance when predicting a dichotomous outcome.*Stat Med*, 19:271–274, 2000. Letter to the Editor regarding*Stat Med*18:375–384; 1999.Google Scholar - 95.M. S. Cepeda, R. Boston, J. T. Farrar, and B. L. Strom. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders.
*Am J Epi*, 158:280–287, 2003.CrossRefGoogle Scholar - 96.J. M. Chambers and T. J. Hastie, editors.
*Statistical Models in S*. Wadsworth and Brooks/Cole, Pacific Grove, CA, 1992.MATHGoogle Scholar - 111.W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots.
*J Am Stat Assoc*, 74:829–836, 1979.MathSciNetCrossRefMATHGoogle Scholar - 115.D. Collett.
*Modelling Binary Data*. Chapman and Hall, London, second edition, 2002.MATHGoogle Scholar - 117.E. F. Cook and L. Goldman. Asymmetric stratification: An outline for an efficient method for controlling confounding in cohort studies.
*Am J Epi*, 127:626–639, 1988.Google Scholar - 118.N. R. Cook. Use and misues of the receiver operating characteristic curve in risk prediction.
*Circulation*, 115:928–935, 2007.CrossRefGoogle Scholar - 121.J. Copas. The effectiveness of risk scores: The logit rank plot.
*Appl Stat*, 48:165–183, 1999.MATHGoogle Scholar - 123.J. B. Copas. Cross-validation shrinkage of regression predictors.
*J Roy Stat Soc B*, 49:175–183, 1987.MathSciNetMATHGoogle Scholar - 124.J. B. Copas. Unweighted sum of squares tests for proportions.
*Appl Stat*, 38:71–80, 1989.MathSciNetCrossRefGoogle Scholar - 129.D. R. Cox. The regression analysis of binary sequences (with discussion).
*J Roy Stat Soc B*, 20:215–242, 1958.MATHGoogle Scholar - 130.D. R. Cox. Two further applications of a model for binary regression.
*Biometrika*, 45(3/4):562–565, 1958.CrossRefMATHGoogle Scholar - 136.D. R. Cox and N. Wermuth. A comment on the coefficient of determination for binary responses.
*Am Statistician*, 46:1–4, 1992.Google Scholar - 137.J. G. Cragg and R. Uhler. The demand for automobiles.
*Canadian Journal of Economics*, 3:386–406, 1970.CrossRefGoogle Scholar - 147.C. E. Davis, J. E. Hyde, S. I. Bangdiwala, and J. J. Nelson. An example of dependencies among variables in a conditional logistic regression. In S. H. Moolgavkar and R. L. Prentice, editors,
*Modern Statistical Methods in Chronic Disease Epi*, pages 140–147. Wiley, New York, 1986.Google Scholar - 172.B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation.
*J Am Stat Assoc*, 78:316–331, 1983.MathSciNetCrossRefMATHGoogle Scholar - 199.E. B. Fowlkes. Some diagnostics for binary logistic regression via smoothing.
*Biometrika*, 74:503–515, 1987.MathSciNetCrossRefGoogle Scholar - 207.J. H. Friedman. A variable span smoother. Technical Report 5, Laboratory for Computational Statistics, Department of Statistics, Stanford University, 1984.Google Scholar
- 219.T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation.
*J Am Stat Assoc*, 102:359–378, 2007.MathSciNetCrossRefMATHGoogle Scholar - 251.M. Halperin, W. C. Blackwelder, and J. I. Verter. Estimation of the multivariate logistic risk function: A comparison of the discriminant function and maximum likelihood approaches.
*J Chron Dis*, 24:125–158, 1971.CrossRefMATHGoogle Scholar - 253.D. J. Hand.
*Construction and Assessment of Classification Rules*. Wiley, Chichester, 1997.MATHGoogle Scholar - 254.T. L. Hankins. Blood, dirt, and nomograms.
*Chance*, 13(1):26–37, 2000.CrossRefGoogle Scholar - 255.J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve.
*Radiology*, 143:29–36, 1982.CrossRefGoogle Scholar - 259.F. E. Harrell. Comparison of strategies for validating binary logistic regression models. Unpublished manuscript, 1991.Google Scholar
- 264.F. E. Harrell and K. L. Lee. A comparison of the
*discrimination*of discriminant analysis and logistic regression under multivariate normality. In P. K. Sen, editor,*Biostatistics: Statistics in Biomedical, Public Health, and Environmental Sciences. The Bernard G. Greenberg Volume*, pages 333–343. North-Holland, Amsterdam, 1985.Google Scholar - 265.F. E. Harrell and K. L. Lee. The practical value of logistic regression. In
*Proceedings of the Tenth Annual SAS Users Group International Conference*, pages 1031–1036, 1985.Google Scholar - 267.F. E. Harrell and K. L. Lee. Using logistic model calibration to assess the quality of probability predictions. Unpublished manuscript, 1987.Google Scholar
- 278.W. W. Hauck and A. Donner. Wald’s test as applied to hypotheses in logit analysis.
*J Am Stat Assoc*, 72:851–863, 1977.MathSciNetMATHGoogle Scholar - 284.A. V. Hernández, M. J. Eijkemans, and E. W. Steyerberg. Randomized controlled trials with time-to-event outcomes: how much does prespecified covariate adjustment increase power?
*Annals of epidemiology*, 16(1):41–48, Jan. 2006.Google Scholar - 285.A. V. Hernández, E. W. Steyerberg, and J. D. F. Habbema. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements.
*J Clin Epi*, 57:454–460, 2004.CrossRefMATHGoogle Scholar - 303.D. W. Hosmer, T. Hosmer, S. le Cessie, and S. Lemeshow. A comparison of goodness-of-fit tests for the logistic regression model.
*Stat Med*, 16:965–980, 1997.CrossRefGoogle Scholar - 304.D. W. Hosmer and S. Lemeshow. Goodness-of-fit tests for the multiple logistic regression model.
*Comm Stat Th Meth*, 9:1043–1069, 1980.CrossRefGoogle Scholar - 305.D. W. Hosmer and S. Lemeshow.
*Applied Logistic Regression*. Wiley, New York, 1989.Google Scholar - 306.D. W. Hosmer and S. Lemeshow. Confidence interval estimates of an index of quality performance based on logistic regression models.
*Stat Med*, 14:2161–2172, 1995. See letter to editor 16:1301-3,1997.Google Scholar - 309.B. Hu, M. Palta, and J. Shao. Properties of
*R*^{2}statistics for logistic regression.*Stat Med*, 25:1383–1395, 2006.MathSciNetCrossRefGoogle Scholar - 341.R. Kay and S. Little. Assessing the fit of the logistic model: A case study of children with the haemolytic uraemic syndrome.
*Appl Stat*, 35:16–30, 1986.CrossRefMATHGoogle Scholar - 366.E. L. Korn and R. Simon. Explained residual variation, explained risk, and goodness of fit.
*Am Statistician*, 45:201–206, 1991.Google Scholar - 373.J. M. Landwehr, D. Pregibon, and A. C. Shoemaker. Graphical methods for assessing logistic regression models (with discussion).
*J Am Stat Assoc*, 79:61–83, 1984.CrossRefMATHGoogle Scholar - 380.P. W. Lavori, R. Dawson, and T. B. Mueller. Causal estimation of time-varying treatment effects in observational studies: Application to depressive disorder.
*Stat Med*, 13:1089–1100, 1994.CrossRefGoogle Scholar - 387.S. le Cessie and J. C. van Houwelingen. A goodness-of-fit test for binary regression models, based on smoothing methods.
*Biometrics*, 47:1267–1282, 1991.CrossRefMATHGoogle Scholar - 406.J. G. Liao and D. McGee. Adjusted coefficients of determination for logistic regression.
*Am Statistician*, 57:161–165, 2003.MathSciNetCrossRefMATHGoogle Scholar - 416.K. Linnet. Assessing diagnostic tests by a strictly proper scoring rule.
*Stat Med*, 8:609–618, 1989.CrossRefGoogle Scholar - 424.K. Liu and A. R. Dyer. A rank statistic for assessing the amount of variation explained by risk factors in epidemiologic studies.
*Am J Epi*, 109:597–606, 1979.Google Scholar - 431.G. S. Maddala.
*Limited-Dependent and Qualitative Variables in Econometrics*. Cambridge University Press, Cambridge, UK, 1983.CrossRefMATHGoogle Scholar - 432.L. Magee.
*R*^{2}measures based on Wald and likelihood ratio joint significance tests.*Am Statistician*, 44:250–253, 1990.Google Scholar - 454.S. Menard. Coefficients of determination for multiple logistic regression analysis.
*Am Statistician*, 54:17–24, 2000.Google Scholar - 457.M. E. Miller, S. L. Hui, and W. M. Tierney. Validation techniques for logistic regression models.
*Stat Med*, 10:1213–1226, 1991.CrossRefGoogle Scholar - 461.M. Mittlböck and M. Schemper. Explained variation for logistic regression.
*Stat Med*, 15:1987–1997, 1996.CrossRefMATHGoogle Scholar - 462.K. G. M. Moons, Donders, E. W. Steyerberg, and F. E. Harrell. Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example.
*J Clin Epi*, 57:1262–1270, 2004.Google Scholar - 471.N. J. D. Nagelkerke. A note on a general definition of the coefficient of determination.
*Biometrika*, 78:691–692, 1991.MathSciNetCrossRefMATHGoogle Scholar - 473.R. Newson. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences.
*Stata Journal*, 2(1), 2002. http://www.stata-journal.com/article.html?article=st0007. - 474.R. Newson. Confidence intervals for rank statistics: Somers’ D and extensions.
*Stata J*, 6(3):309–334, 2006.Google Scholar - 479.P. C. O’Brien. Comparing two samples: Extensions of the
*t*, rank-sum, and log-rank test.*J Am Stat Assoc*, 83:52–61, 1988.Google Scholar - 490.M. J. Pencina, R. B. D’Agostino, and O. V. Demler. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models.
*Stat Med*, 31(2):101–113, 2012.MathSciNetCrossRefGoogle Scholar - 493.M. J. Pencina, R. B. D’Agostino Sr, R. B. D’Agostino Jr, and R. S. Vasan. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond.
*Stat Med*, 27:157–172, 2008.MathSciNetCrossRefGoogle Scholar - 511.D. Pregibon. Logistic regression diagnostics.
*Ann Stat*, 9:705–724, 1981.MathSciNetCrossRefMATHGoogle Scholar - 512.D. Pregibon. Resistant fits for some commonly used logistic models with medical applications.
*Biometrics*, 38:485–498, 1982.CrossRefGoogle Scholar - 514.S. J. Press and S. Wilson. Choosing between logistic regression and discriminant analysis.
*J Am Stat Assoc*, 73:699–705, 1978.CrossRefMATHGoogle Scholar - 515.D. B. Pryor, F. E. Harrell, K. L. Lee, R. M. Califf, and R. A. Rosati. Estimating the likelihood of significant coronary artery disease.
*Am J Med*, 75:771–780, 1983.CrossRefMATHGoogle Scholar - 526.J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by modeling the expectation of exposure conditional on confounders.
*Biometrics*, 48:479–495, 1992.MathSciNetCrossRefMATHGoogle Scholar - 527.L. D. Robinson and N. P. Jewell. Some surprising results about covariate adjustment in logistic regression models.
*Int Stat Rev*, 59:227–240, 1991.CrossRefMATHGoogle Scholar - 530.P. R. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects.
*Biometrika*, 70:41–55, 1983.MathSciNetCrossRefMATHGoogle Scholar - 531.P. R. Rosenbaum and D. B. Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome.
*J Roy Stat Soc B*, 45:212–218, 1983.Google Scholar - 573.J. C. Sinclair and M. B. Bracken. Clinically useful measures of effect in binary analyses of randomized trials.
*J Clin Epi*, 47:881–889, 1994.CrossRefGoogle Scholar - 579.R. H. Somers. A new asymmetric measure of association for ordinal variables.
*Am Soc Rev*, 27:799–811, 1962.CrossRefGoogle Scholar - 580.A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations.
*JAMA*, 262:2700–2707, 1989.CrossRefGoogle Scholar - 584.N. Stallard. Simple tests for the external validation of mortality prediction scores.
*Stat Med*, 28:377–388, 2009.MathSciNetCrossRefGoogle Scholar - 588.E. W. Steyerberg, P. M. M. Bossuyt, and K. L. Lee. Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics?
*Am Heart J*, 139:745–751, 2000. Editorial, pp. 761–763.Google Scholar - 590.E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema. Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets.
*Med Decis Mak*, 21:45–56, 2001.CrossRefGoogle Scholar - 613.T. Tjur. Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination.
*Am Statistician*, 63(4):366–372, 2009.MathSciNetCrossRefMATHGoogle Scholar - 632.J. C. van Houwelingen and S. le Cessie. Logistic regression, a review.
*Statistica Neerlandica*, 42:215–232, 1988.MathSciNetCrossRefGoogle Scholar - 633.J. C. van Houwelingen and S. le Cessie. Predictive value of statistical models.
*Stat Med*, 9:1303–1325, 1990.CrossRefGoogle Scholar - 647.S. H. Walker and D. B. Duncan. Estimation of the probability of an event as a function of several independent variables.
*Biometrika*, 54:167–178, 1967.MathSciNetCrossRefMATHGoogle Scholar - 654.Y. Wax. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of diet-cancer relationship in epidemiological studies.
*Stat Med*, 11:1273–1287, 1992.CrossRefGoogle Scholar - 658.T. L. Wenger, F. E. Harrell, K. K. Brown, S. Lederman, and H. C. Strauss. Ventricular fibrillation following canine coronary reperfusion: Different outcomes with pentobarbital and
*α*-chloralose.*Can J Phys Pharm*, 62:224–228, 1984.CrossRefGoogle Scholar - 684.B. Zheng and A. Agresti. Summarizing the predictive power of a generalized linear model.
*Stat Med*, 19:1771–1781, 2000.CrossRefGoogle Scholar