Skip to main content

Logistic Regression

  • Chapter
  • First Online:
Multivariate Analysis

Abstract

In many problems in science and practice, the following questions arise: Which one of two or more alternative states is present or which event will occur? Which factors are suitable for the decision or prognosis and what influence do they have on the occurrence of a state or event? Often, only two alternative states or events are involved, as in the question whether a patient has a certain disease or not. Logistic regression can be used to answer such questions. The logistic regression is similar to discriminant analysis with regard to the problem definition. The main difference between the two methods is that logistic regression directly provides probabilities for the occurrence of the alternative states or the affiliations to the individual groups.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Cf. Hastie et al. (2011, pp. 2, 300). The data set “Spambase” contains information on 4601 emails and is publicly available at https://archive.ics.uci.edu.

  2. 2.

    Such a variable is called a Bernoulli variable and the events can be seen as outcomes of a Bernoulli trial. The resulting probability distribution is called Bernoulli distribution. The name goes back to Jacob Bernoulli (1656–1705). The simplest example of a Bernoulli trial is the tossing of a coin with the expected result E(Y) = π = 0.5 and the variance V(Y) = π(1 – π). The Bernoulli distribution is a special case of the binomial distribution for N = 1 trials. The binomial distribution results from a sequence of N Bernoulli trials. Correspondingly, the buying frequency (sum of buyers) is binomially distributed with sample size N. With increasing N the binomial distribution converges with the normal distribution.

  3. 3.

    This is the reason for the broad usage and the importance of the logistic function, since it is much easier to handle than the distribution function of the normal distribution, which can only be expressed as an integral and is therefore difficult to calculate. The logistic function was developed by the Belgian mathematician Pierre-Francois Verhulst (1804–1849) to describe and predict population growth as an improved alternative to the exponential function. The constant e = 2.71828 is Euler’s number, which also serves as the basis of the natural logarithm.

  4. 4.

    Categorical independent variables with more than two categories must be decomposed into binary variables, as in linear regression analysis.

  5. 5.

    Within the framework of generalized linear models (GLM), logit(π) forms a so-called link function by means of which a linear relationship is established between the expected value of a dependent variable and the systematic component of the model. The logit link is used in particular when a binomial distribution of the dependent variable is assumed (cf. Agresti 2013, pp. 112–122; Fox 2015, p. 418 ff.).

  6. 6.

    On the website www.multivariate-methods.info, we provide supplementary material (e.g., Excel files) to deepen the reader’s understanding of the methodology.

  7. 7.

    More information on the linear probability model can be found in Agresti (2013, p. 117; 1996, p. 74); Hosmer and Lemeshow (2000, p. 5).

  8. 8.

    These groups (classes) must be distinguished from the category groups of the dependent variable Y.

  9. 9.

    The concept of the ROC curve originates from communications engineering. It was originally developed during the Second World War for the detection of radar signals or enemy objects and is used today in many scientific fields (see, e.g., Agresti 2013, pp. 224 ff.; Hastie et al. 2011, pp. 313 ff.; Hosmer et al. 2013, pp. 173 ff.). SPSS offers a procedure for creating ROC curves for given classification probabilities or discriminant values. The above ROC curve was created with Excel.

  10. 10.

    We also get the same value for AUC if we apply discriminant analysis to our data. Alternatively, one can create the ROC curve based on discriminant values or classification probabilities.

  11. 11.

    Another danger of “false negatives” is the risk that sick persons may spread an infectious disease. The high rate of “false negative” test results contributed to the rapid spread of the corona virus at the beginning of the pandemic in 2020 (cf. Watson et al. 2020).

  12. 12.

    The principle of the ML method goes back to Daniel Bernoulli (1700–1782), a nephew of Jakob Bernoulli. Ronald A. Fisher (1890–1962) analyzed the statistical properties of the ML method and paved the way for its practical application and dissemination. Besides the Least-Squares method, the ML method is the most important statistical estimation method.

  13. 13.

    For logistic regression, quasi-newton methods are primarily used, which converge quite quickly. These methods are based on Newton’s method for finding the zero of a function. They use the first and second partial derivatives of the LL function according to the unknown parameters to find the optimum. The derivatives are approximated differently depending on the method. Special methods are the Gauss–Newton method and its further development, the Newton–Raphson method. In the meantime, the method of Iteratively Reweighted Least Squares (IRLS) is also widely used. Cf. e.g. Agresti (2013, pp. 149 ff.); Fox (2015, pp. 431 ff.); Press et al. (2007, pp. 521 ff.).

  14. 14.

    McFadden (1974) has shown that with a linear systematic component of the logistic model, the LL function is globally convex, which makes maximization much easier.

  15. 15.

    The term “odds” is used only in plural. The concept of odds and its usefulness was described by the Italian mathematician and physician Gerolano Cardano (1501–1576), who had to support his life by gambling. In his “Book on Games of Chance” he wrote the first treatment on probability. The theory of probability ermerged only later in in the seventeenth century with the works of the scientists Pierre de Fermat (1601–1665), Blaise Pascal (1623–1662) and Jakob Bernoulli (1655–1705).

  16. 16.

    The name “logit” was introduced by Joseph Berkson 1944, who used it as an abbreviation for “logistic unit”, in analogy to the abbreviation “probit” for “probability unit”. Berkson contributed strongly to the development and popularization of logistic regression.

  17. 17.

    That is the reason why we used the “equal by definition” sign in Eqs. (5.33) and (5.34).

  18. 18.

    Alternatively, we may calculate the odds ratios with Eq. (5.33):

    \(\text{OR}_{m} \,\, = \,\,\,e^{{b_{2} }} = \,\,\,e^{1,751} = \,\,\,5.76\) and \(\text{OR}_{w} \,\, = \,\,\,e^{{ - b_{2} }} = \,\,\,e^{ - 1.751} = \,\,\,0.174\).

  19. 19.

    In common language, the term risk is associated with negative events, such as accidents, illness or death. Here the term risk refers to the probability of any uncertain event.

  20. 20.

    This can be the case in so-called case-control studies where groups are not formed by random sampling. Thus the size of the groups cannot be used for the estimation of probabilities. Such studies are often carried out for the analysis of rare events, e.g. in epidemiology, medicine or biology (cf. Agresti 2013, pp. 42–43; Hosmer et al. 2013, pp. 229–230).

  21. 21.

    Thus, in SPSS the LLR statistic is denoted as chi-square. For the likelihood ratio test statistic see, e.g., Agresti (2013, p. 11); Fox (2015, pp. 346–348).

  22. 22.

    For a brief summary of the basics of statistical testing see Sect. 1.3.

  23. 23.

    We can calculate the p-value with Excel by using the function CHISQ.DIST.RT(x;df). Here, we get CHISQ.DIST.RT(9.35;2) = 0.009.

  24. 24.

    Both tests are used in SPSS, but the LR test is only used in the NOMREG procedure for multinomial logistic regression, not in binary Logistic Regression.

  25. 25.

    Named after the Hungarian mathematician Abraham Wald (1902–1950). For the Wald test see Agresti (2013, p. 10); Hosmer et al. (2013, pp. 42–44).

  26. 26.

    The reason is that the standard error becomes too large, especially if the absolute value of the coefficient is large. This makes the Wald statistics too small and the p-value too large (as found by Hauck und Donner 1977). Agresti (2013, p. 169), points out that the likelihood ratio test uses more information than the Wald test and is therefore preferable.

  27. 27.

    The user can find all Excel files used in this chapter on the website www.multivariate-methods.info.

  28. 28.

    Binary logistic regression can also be performed using the SPSS syntax as shown in Sect. 5.4.4 (Fig. 5.42).

  29. 29.

    An alternative is to center the parameters so that their sum across the two categories is zero.

  30. 30.

    For this example we used a second data set with 50 observations.

  31. 31.

    In SPSS (procedure NOMREG) the user can choose any category as the reference category and thus determine the odds using the baseline logit model. This is done in the dialog box by choosing the option “Reference Category” and “Custom”. By default, the last category G is chosen. The category with the lowest coding is chosen if the user chooses the category order “Descending” (the default setting is “Ascending”).

  32. 32.

    For X2 = 0, the p-value has to be 1.0, but it cannot be calculated as there are no degrees of freedom for this model. It just serves to demonstrate the principle of the calculation. The predicted (expected) probabilities here are equal to the relative frequencies of the observed values in the respective subpopulation, i.e. for men or for women.

  33. 33.

    We will use the dataset introduced in Chap. 4 for discriminant analysis in order to better illustrate similarities and differences between the two methods.

  34. 34.

    Missing values are a frequent and unfortunately unavoidable problem when conducting surveys (e.g. because people cannot or do not want to answer some question(s), or as a result of mistakes by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.

  35. 35.

    There is a third group of variables that vary over persons and alternatives, e.g.,. the perceived attributes that we encountered in the case study.

  36. 36.

    Logit Choice Models became popular by the work of Daniel McFadden (1974), who laid the foundations for these models and their applications. In 2000 he won the Nobel Prize in economics. More information on these models can be found in the books of Ben-Akiva and Lerman 1985; Hensher et al. 2015; Train 2009. Examples of their application are the use of transport alternatives (e.g. car, tram, bus, bicycle, walking; Mc Fadden 1984) or the interpretation of market data derived from scanner panels (e.g. Guadagni and Little 1983; Jain et al. 1994).

  37. 37.

    SPSS has no special procedure for logit choice analysis but the procedure COXREG for Cox-Regression may be used for this calculation.

  38. 38.

    DA assumes that the independent variables follow a multivariate normal distribution, whereas LRA assumes that the dependent variable follows a binomial or multinomial distribution.

References

  • Agresti, A. (2013). Categorical data analysis. New Jersey: John Wiley.

    Google Scholar 

  • Ben-Akiva, M., & Lerman, S. (1985). Discrete choice analysis. Cambridge: MIT Press.

    Google Scholar 

  • Fox, J. (2015). Applied regression analysis and generalized linear models. Los Angeles: SAGE.

    Google Scholar 

  • Gigerenzer, G. (2002). Calculated Risks. How to know when numbers deceive you. New York: Simon & Schuster.

    Google Scholar 

  • Guadagni, P., & Little, J. (1983). A logit model of brand choice calibrated on scanner data. Marketing Science, 2(3), 203–238.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2011). The elements of statistical learning. New York: Springer.

    Google Scholar 

  • Hauck, W., & Donner, A. (1977). Wald’s test as applied to hypotheses in logit analysis. Journal of the American Statistical Association, 72, 851–853.

    Google Scholar 

  • Hensher, D., Rose, J., & Greene, W. (2015). Applied choice analysis. Cambridge: Cambridge University Press.

    Google Scholar 

  • Hosmer, D., & Lemeshow, S. (2000). Applied logistic regression. New York: Wiley.

    Google Scholar 

  • Hosmer, D., Lemeshow, S., & Sturdivant, R. (2013). Applied logistic regression. New York: Wiley.

    Google Scholar 

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). An Introduction to Statistical Learning. New York: Springer.

    Google Scholar 

  • Jain, D., Vilcassim, N., & Chintagunta, P. (1994). A random-coefficients logit brand-choice model applied to panel data. J. O. Business & Economic Statistics, 13(3), 317–326.

    Google Scholar 

  • Lim, T., Loh, W., & Shih, Y. (2000). A comparison of predicting accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3), 203–229.

    Google Scholar 

  • Louviere, J., Hensher, D., & Swait, J. (2000). Stated choice methods. Cambridge: Cambridge University Press.

    Google Scholar 

  • McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka (Ed.), Frontiers in econometrics, 40 (pp. 105–142). Cambridge: Academic Press.

    Google Scholar 

  • McFadden, Daniel L. (1984). Econometric analysis of qualitative response models. Handbook of Econometrics, Volume II. Chapter 24. Elsevier Science Publishers BV.

    Google Scholar 

  • Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classification. Chichester: Ellis Horwood Limited.

    Google Scholar 

  • Pearl, J. (2018). The Book of Why. The new science of cause and effect. New York: Basic Books.

    Google Scholar 

  • Press, W., Flannery, B., Teukolsky, S., & Vetterling, W. (2007). Numerical recipes – The art of scientific computing. Cambridge: Cambridge University Press.

    Google Scholar 

  • Train, K. (2009). Discrete choice methods with simulation. Cambridge: Cambridge University Press.

    Google Scholar 

  • Watson, J., Whiting, P., & Brush, J. (2020). Practice pointer: Interpreting a covid-19 test result. British Medical Journal, 369, m1808.

    Google Scholar 

Further Reading

  • Corporation, I. B. M. (2017). IBM SPSS regression 25. NY, US: Armonk.

    Google Scholar 

  • Hair, J., Black, W., Babin, B., & Anderson, R. (2010). Multivariate data analysis. Englewood Cliffs: Pearson.

    Google Scholar 

  • Maddala, G. (1983). Limited-dependent and qualitative variables in econometrics. Cambridge: Cambridge University Press.

    Google Scholar 

  • McCullagh, P., & Nelder, J. (1989). Generalized linear models. London: Chapman and Hall.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Klaus Backhaus .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Der/die Herausgeber bzw. der/die Autor(en), exklusiv lizenziert durch Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Backhaus, K., Erichson, B., Gensler, S., Weiber, R., Weiber, T. (2021). Logistic Regression. In: Multivariate Analysis. Springer Gabler, Wiesbaden. https://doi.org/10.1007/978-3-658-32589-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-658-32589-3_5

  • Published:

  • Publisher Name: Springer Gabler, Wiesbaden

  • Print ISBN: 978-3-658-32588-6

  • Online ISBN: 978-3-658-32589-3

  • eBook Packages: Business and Economics (German Language)

Publish with us

Policies and ethics