Skip to main content

Regression Analysis

  • Chapter
  • First Online:
Multivariate Analysis

Abstract

Regression analysis is one of the most flexible and most frequently used multivariate methods. It is employed to analyze relationships between a metrically scaled dependent variable and one or more metrically scaled independent variables. In particular, it is used to describe relationships quantitatively and to explain them. As a result, we can estimate or predict values of the dependent variable. Regression analysis is of eminent importance for science and practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Galton (1886) investigated the relationship between the body heights of parents and their adult children. He “regressed the height of children on the height of parents”.

  2. 2.

    Sales can also depend on environmental factors like competition, social-economic influences, or weather. Another difficulty is that advertising itself is a complex bundle of factors that cannot simply be reduced to expenditures. The impact of advertising depends on its quality, which is difficult to measure, and it also depends on the media that are used (e.g., print, radio, television, internet). These and other reasons make it very difficult to measure the effect of advertising.

  3. 3.

    On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel files) to deepen the reader’s understanding of the methodology.

  4. 4.

    See Sects. 2.2.3.3 and 2.2.5.

  5. 5.

    In regression analysis we encounter the problem of multicollinearity. We will deal with this problem in Sect. 2.2.5.7.

  6. 6.

    The terms association and correlation are widely and often interchangeably used in data analysis. But there are differences. Association of variables refers to any kind of relation between variables. Two variables are said to be associated if the values of one variable tend to change in some systematic way along with the values of the other variable. A scatterplot of the variables will show a systematic pattern. Correlation is a more specific term. It refers to associations in the form of a linear trend. And it is a measure of the strength of this association. Pearson's correlation coefficient measures the strength of a linear trend, i.e. how close the points are lying on a straight line. Spearman's rank correlation can also be used for non-linear trends.

  7. 7.

    These basic statistics can be easily calculated with the Excel functions AVERAGE(range) for mean, STDEV.S(range) for std-deviation, and CORREL(range1;range2) for correlation.

  8. 8.

    Blalock (1964, p. 51) writes: “A large correlation merely means a low degree of scatter …. It is the regression coefficients which give us the laws of science.”

  9. 9.

    With the optimization tool Solver of MS Excel it is easy to find this solution without differential calculus or knowing any formulas. One chooses the cell that contains the value of SSR (the sum at the bottom of the rightmost column in Table 2.5) as the target cell (objective). The cells that contain the parameters a and b are chosen as the changing cells. Then minimizing the objective will yield the least-squares estimates of the parameters within the changing cells.

  10. 10.

    Carl Friedrich Gauß (1777–1855) used the method in 1795 at the age of only 18 years for calculating the orbits of celestial bodies. This method was also developed independently by the French mathematician Adrien-Marie Legendre (1752–1833). G. Udny Yule (1871–1951) first applied it to regression analysis.

  11. 11.

    When using matrix algebra for calculation, the constant term is treated as the coefficient of a fictive variable with all values equal to 1. By this, it can be computed in the same way as the other coefficients and the calculation becomes easier.

  12. 12.

    This holds only for linear models and LS estimation. The principle is also of central importance for the analysis of variance (ANOVA, cf. Chap. 3) and for discriminant analysis (cf. Chap. 4).

  13. 13.

    This is called inferential statistics and has to be distinguished from descriptive statistics. Inferential statistics makes inferences and predictions about a population based on a sample drawn from the studied population.

  14. 14.

    For a simple coefficient of correlation r, we get \(F_{{\text{emp}}} = \frac{{r^{2} }}{{(1 - r^{2} )/(N - 2)}}\), as J = 1.

  15. 15.

    With Excel we can calculate the p-value by using the function F.DIST.RT(Femp;df1;df2). We get: F.DIST.RT (31.50;3;8) = 0.00009 or 0.009%.

  16. 16.

    The reader should be aware that other values for α are also possible. α = 5% is a kind of “gold” standard in statistics that goes back to Sir R. A. Fisher (1890–1962) who also created the F-distribution. But the researcher must also consider the consequences (costs) of making a wrong decision.

  17. 17.

    Other criteria for model assessment and selection are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). See, e.g., Agresti (2013, p. 212); Greene (2012, p. 179); Hastie et al. (2011, pp. 219–257).

  18. 18.

    For a brief summary of the basics of statistical testing see Sect. 1.3.

  19. 19.

    With Excel we can calculate the critical value \(t_{\alpha /2}\) for a two-tailed t-test by using the function T.INV.2 T(α;df). We get: T.INV.2 T(0.05;8) = 2.306.

  20. 20.

    The p-values can be calculated with Excel by using the function T.DIST.2 T(ABS(temp);df). For the variable price we get: T.DIST.2 T(3.20;8) = 0.0126 or 1.3%.

  21. 21.

    With Excel we can calculate the critical value \(t_{\alpha }\) for a one-tailed t-test by using the function T.INV(1 – α;df). We get: T.INV(0.95;8) = 1.860.

  22. 22.

    With Excel we can calculate the p-value for the right tail by the function T.DIST.RT(temp;df). For the variable advertising we get: T.DIST.RT(5.89;8) = 0.00018 or 0.018%.

  23. 23.

    Using Excel, we can calculate the critical value for a lower-tail t-test by T.INV(α;df).

    We get: T.INV(0.05;8) =  –1.860.

  24. 24.

    Cf. e.g., Kmenta (1997, p. 392); Fox (2008, p. 105); Greene (2012, p. 92); Wooldrige (2016, p. 79); Gelman and Hill (2018, p. 45). You will find slight differences between the formulations of the different authors.

  25. 25.

    This follows from the Gauss-Markov theorem. See e.g. Fox (2008, p. 103); Kmenta (1997, p. 216).

  26. 26.

    The central limit theorem plays an important role in statistical theory. It states that the sum or mean of n independent random variables tends toward a normal distribution if n is sufficiently large, even if the original variables themselves are not normally distributed. This is the reason why a normal distribution can be assumed for many phenomena.

  27. 27.

    Anscombe and Tukey (1963) demonstrated the power of graphical techniques in data analysis.

  28. 28.

    In an experiment the researcher actively changes the independent variable X and observes changes of the dependent variable Y. And, as far as possible, he tries to keep out any other influences on Y. For the design of experiments see e.g. Campbell and Stanley (1966); Green et al. (1988).

  29. 29.

    Switzerland was the top performer in chocolate consumption and number of Noble Prizes. See Messerli, F. H. (2012). Chocolate consumption, Cognitive Function and Nobel Laureates. The New England Journal of Medicine, 367(16), 1562–1564.

  30. 30.

    For causal inference in regression see Freedman (2012); Pearl and Mackenzie (2018, p. 72). Problems like this one are covered by path analysis, originally developed by Sewall Wright (1889–1988), and structural equation modeling (SEM), cf. e.g. Kline (2016); Hair et al. (2014).

  31. 31.

    “Mistaking a mediator for a confounder is one of the deadliest sins in causal inference.” (Pearl and Mackenzie 2018, p. 276).

  32. 32.

    The expression goes back to Francis Galton (1886), who called it “regression towards mediocrity”. Galton wrongly interpreted it as a causal effect in human heredity. It is ironic that the first and most important method of multivariate data analysis got its name from something that means the opposite of what regression analysis actually intends to do. Cf. Kahneman (2011, p. 175); Pearl and Mackenzie (2018, p. 53).

  33. 33.

    Cf. Freedman et al. (2007, p. 169). In econometric analysis this effect is called least squares attenuation or attenuation bias. Cf., e.g., Kmenta (1997, p. 346); Greene (2012, p. 280); Wooldridge (2016, p. 306).

  34. 34.

    In psychology great efforts have been undertaken, beginning with Charles Spearman in 1904, to measure empirically the reliability of measurement methods and thus derive corrections for attenuation. Cf., e.g., Hair et. al. (2014, p. 96); Charles (2005).

  35. 35.

    An overview of this test and other tests is given by Kmenta (1997, p. 292); Maddala and Lahiri (2009, p. 214).

  36. 36.

    From a Durbin-Watson table we derive the values dL = 0.97 and dU = 1.33 and thus 1.33 < DW < 2.67 (no autocorrelation).

  37. 37.

    If the errors are normally distributed, the y-values, which contain the errors as additive elements, are also normally distributed. And since the least-squares estimators form linear combinations of the y-values, the parameter estimates are normally distributed, too.

  38. 38.

    Numerical significance tests of normality: the Kolmogorov-Smirnov test and the Shapiro-Wilk test.

  39. 39.

    The matrix X′X is singular and cannot be inverted.

  40. 40.

    Numerically these areas can be expressed by their sum of squares: \(SS_{Y} \,\, = \,\,\sum {\left( {\,y_{k} \, - \,\overline{y}} \right)} \,^{2}\) and \(SS_{{X_{j} }} \,\, = \,\,\sum {\left( {\,x_{jk} \, - \,\overline{x}_{j} } \right)} \,^{2}\).

  41. 41.

    See Belsley et al. (1980, p. 93).

  42. 42.

    Very small tolerance values can lead to computational problems. By default, SPSS will not allow variables with Tj < 0.0001 to enter the model.

  43. 43.

    Another method to counter multicollinearity, which is beyond the scope of this text, is ridge regression. By this method one trades a small amount of bias in the estimators for a large reduction in variance. See Fox (2008, p. 325); Kmenta (1997, p. 440); Belsley et al. (1980, p. 219).

  44. 44.

    Excellent treatments of this topic may be found in Belsley et al. (1980); Fox (2008, p. 246). SPSS provides numerous statistics.

  45. 45.

    Before doing a regression analysis one can use exploratory techniques of data analysis, like box plots (box-and-whisker plots), for checking the data and detecting possible outliers. But these methods do not show the effects on regression.

  46. 46.

    This may be different when the number of variables is large. In this case the detection of multivariate outliers by scatterplots can be difficult (see Belsley et al. 1980, p. 17).

  47. 47.

    With Excel we can calculate: p(abs(z) ≥ 1.59) = 2*(1-NORM.S.DIST(1.59;1) = 0.112.

  48. 48.

    A modified measure of this distance is the centered leverage \(h^{\prime}_{i} = h_{i} - \frac{J + 1}{N}\) with \(0 \le h^{\prime}_{i} \le 1\).

  49. 49.

    By using s(−i) instead of standard error s, the numerator and the denominator in the formula for the studentized deleted residuals become stochastically independent. See Belsley et al. (1980, p. 14).

  50. 50.

    See Fox (2008, p. 246); Belsley et al. (1980, p. 20).

  51. 51.

    With Excel we can calculate: p(abs(t) ≥ 2.46) = T.DIST.2 T(2.46;9) = 0.036.

  52. 52.

    Missing values are a frequent and unfortunately unavoidable problem when conducting surveys (e.g., because people cannot or do not want to answer some question, or as a result of mistakes by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.

  53. 53.

    Since Albert Einstein (1879–1955) we know that this is not quite true. Relativity theory tells us that time slows down with increasing speed and even comes to a standstill at the speed of light. But for our problems we can neglect this.

References

  • Agresti, A. (2013). Categorical Data Analysis. New Jersey: John Wiley.

    Google Scholar 

  • Anscombe, F. J., & Tukey, J. W. (1963). The Examination and Analysis of Residuals. Technometrics, 5(2), 141–160.

    Google Scholar 

  • Belsley, D., Kuh, E., & Welsch, R. (1980). Regression diagnostics. New York: Wiley.

    Book  Google Scholar 

  • Blalock, H. M. (1964). Causal inferences in nonexperimental research. New York: The Norton Library.

    Google Scholar 

  • Campbell, D. T., & Stanley, J. C. (1966). Experimental and Quasi-experimental designs for research. Chicago: Rand McNelly.

    Google Scholar 

  • Charles, E. P. (2005). The correction for attenuation due to measurement error: Clarifying concepts and creating confidence sets. American Psychological Association, 10(2), 206–226.

    Google Scholar 

  • Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. Technometrics, 19, 15–18.

    Google Scholar 

  • Fox, J. (2008). Applied regression analysis and generalized linear models. Los Angeles: Sage.

    Google Scholar 

  • Freedman, D. (2002). From association to causation: Some remarks on the history of statistics (p. 521). Berkeley, Technical Report No: University of California.

    Google Scholar 

  • Freedman, D. (2012). Statistical models: Theory and practice. Cambridge: Cambridge University Press.

    Google Scholar 

  • Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). New York: Norton & Company.

    Google Scholar 

  • Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.

    Article  Google Scholar 

  • Gelman, A., & Hill, J. (2018). Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.

    Google Scholar 

  • Green, P. E., Tull, D. S., & Albaum, G. (1988). Research for Marketing Decisions (5th ed.). Prentice Hall: Englewood Cliffs.

    Google Scholar 

  • Greene, W. H. (2012). Econometric analysis (7th ed.). Essex: Pearson.

    Google Scholar 

  • Greene, W. H. (2020). Econometric analysis (8th ed.). Essex: Pearson.

    Google Scholar 

  • Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis (7th ed.). Englewood Cliffs: Pearson.

    Google Scholar 

  • Hair, J. F., Hult, G.T., Ringle, C. M., & Sarstedt, M. (2014). A primer on Partial Least Squares Structural Equation Modelling (PLS-SEM). Los Angeles: Sage.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2011). The Elements of Statistical Learning. New York: Springer.

    Google Scholar 

  • Izenman, A. L. (2013). Modern multivariate statistical techniques. New York.: Springer Texts. in Statistics.

    Google Scholar 

  • Kahneman, D. (2011). Thinking, fast and slow. London: Penguin.

    Google Scholar 

  • Kline, R. B. (2016). Principles and practice of structural equation modeling. New York: Guilford Press.

    Google Scholar 

  • Kmenta, J. (1997). Elements of econometrics (2nd ed.). New York: Macmillan.

    Book  Google Scholar 

  • Leeflang, P., Witting, D., Wedel, M., & Naert, P. (2000). Building models for marketing decisions. Boston: Kluwer Academic Publishers.

    Book  Google Scholar 

  • Little, J. D. C. (1970). Models and managers: The concept of a decision calculus. Management Science, 16(8), 466–485.

    Article  Google Scholar 

  • Maddala, G., & Lahiri, K. (2009). Introduction to econometrics (4th ed.). New York: Wiley.

    Google Scholar 

  • Pearl, J., & Mackenzie, D. (2018). The book of why—The new science of cause and effect. New York: Basic Books.

    Google Scholar 

  • Spearman, C. (1904). The proof and measurement of association between two things. the American Journal of Psychology, 15(1), 72–101.

    Article  Google Scholar 

  • Stigler, S. M. (1997). Regression towards the mean, historically considered. Statistical Methods in Medical Research, 6, 103–114.

    Article  Google Scholar 

  • Wooldridge, J. (2016). Introductory econometrics: A modern approach (6th ed.). Cincinnati: Thomson.

    Google Scholar 

Further Reading

  • Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2009). Regression—models, methods and aplications. Heidelberg: Springer.

    Google Scholar 

  • Hanke, J. E., & Wichern, D. (2013). Business forecasting (9th ed.). Upper Saddle River: Prentice-Hall.

    Google Scholar 

  • Härdle, W., & Simar, L. (2012). Applied multivariate analysis. Heidelberg: Springer.

    Book  Google Scholar 

  • Stigler, S. M. (1986). The history of statistics. Cambridge: Harvard University Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Klaus Backhaus .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Der/die Herausgeber bzw. der/die Autor(en), exklusiv lizenziert durch Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Backhaus, K., Erichson, B., Gensler, S., Weiber, R., Weiber, T. (2021). Regression Analysis. In: Multivariate Analysis. Springer Gabler, Wiesbaden. https://doi.org/10.1007/978-3-658-32589-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-658-32589-3_2

  • Published:

  • Publisher Name: Springer Gabler, Wiesbaden

  • Print ISBN: 978-3-658-32588-6

  • Online ISBN: 978-3-658-32589-3

  • eBook Packages: Business and Economics (German Language)

Publish with us

Policies and ethics