Regression Analysis

Backhaus, Klaus; Erichson, Bernd; Gensler, Sonja; Weiber, Rolf; Weiber, Thomas

doi:10.1007/978-3-658-32589-3_2

Klaus Backhaus⁶,
Bernd Erichson⁷,
Sonja Gensler⁸,
Rolf Weiber⁹ &
…
Thomas Weiber¹⁰

4304 Accesses

Abstract

Regression analysis is one of the most flexible and most frequently used multivariate methods. It is employed to analyze relationships between a metrically scaled dependent variable and one or more metrically scaled independent variables. In particular, it is used to describe relationships quantitatively and to explain them. As a result, we can estimate or predict values of the dependent variable. Regression analysis is of eminent importance for science and practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Galton (1886) investigated the relationship between the body heights of parents and their adult children. He “regressed the height of children on the height of parents”.
2.
Sales can also depend on environmental factors like competition, social-economic influences, or weather. Another difficulty is that advertising itself is a complex bundle of factors that cannot simply be reduced to expenditures. The impact of advertising depends on its quality, which is difficult to measure, and it also depends on the media that are used (e.g., print, radio, television, internet). These and other reasons make it very difficult to measure the effect of advertising.
3.
On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel files) to deepen the reader’s understanding of the methodology.
4.
See Sects. 2.2.3.3 and 2.2.5.
5.
In regression analysis we encounter the problem of multicollinearity. We will deal with this problem in Sect. 2.2.5.7.
6.
The terms association and correlation are widely and often interchangeably used in data analysis. But there are differences. Association of variables refers to any kind of relation between variables. Two variables are said to be associated if the values of one variable tend to change in some systematic way along with the values of the other variable. A scatterplot of the variables will show a systematic pattern. Correlation is a more specific term. It refers to associations in the form of a linear trend. And it is a measure of the strength of this association. Pearson's correlation coefficient measures the strength of a linear trend, i.e. how close the points are lying on a straight line. Spearman's rank correlation can also be used for non-linear trends.
7.
These basic statistics can be easily calculated with the Excel functions AVERAGE(range) for mean, STDEV.S(range) for std-deviation, and CORREL(range1;range2) for correlation.
8.
Blalock (1964, p. 51) writes: “A large correlation merely means a low degree of scatter …. It is the regression coefficients which give us the laws of science.”
9.
With the optimization tool Solver of MS Excel it is easy to find this solution without differential calculus or knowing any formulas. One chooses the cell that contains the value of SSR (the sum at the bottom of the rightmost column in Table 2.5) as the target cell (objective). The cells that contain the parameters a and b are chosen as the changing cells. Then minimizing the objective will yield the least-squares estimates of the parameters within the changing cells.
10.
Carl Friedrich Gauß (1777–1855) used the method in 1795 at the age of only 18 years for calculating the orbits of celestial bodies. This method was also developed independently by the French mathematician Adrien-Marie Legendre (1752–1833). G. Udny Yule (1871–1951) first applied it to regression analysis.
11.
When using matrix algebra for calculation, the constant term is treated as the coefficient of a fictive variable with all values equal to 1. By this, it can be computed in the same way as the other coefficients and the calculation becomes easier.
12.
This holds only for linear models and LS estimation. The principle is also of central importance for the analysis of variance (ANOVA, cf. Chap. 3) and for discriminant analysis (cf. Chap. 4).
13.
This is called inferential statistics and has to be distinguished from descriptive statistics. Inferential statistics makes inferences and predictions about a population based on a sample drawn from the studied population.
14.
For a simple coefficient of correlation r, we get \(F_{{\text{emp}}} = \frac{{r^{2} }}{{(1 - r^{2} )/(N - 2)}}\), as J = 1.
15.
With Excel we can calculate the p-value by using the function F.DIST.RT(F_emp;df1;df2). We get: F.DIST.RT (31.50;3;8) = 0.00009 or 0.009%.
16.
The reader should be aware that other values for α are also possible. α = 5% is a kind of “gold” standard in statistics that goes back to Sir R. A. Fisher (1890–1962) who also created the F-distribution. But the researcher must also consider the consequences (costs) of making a wrong decision.
17.
Other criteria for model assessment and selection are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). See, e.g., Agresti (2013, p. 212); Greene (2012, p. 179); Hastie et al. (2011, pp. 219–257).
18.
For a brief summary of the basics of statistical testing see Sect. 1.3.
19.
With Excel we can calculate the critical value \(t_{\alpha /2}\) for a two-tailed t-test by using the function T.INV.2 T(α;df). We get: T.INV.2 T(0.05;8) = 2.306.
20.
The p-values can be calculated with Excel by using the function T.DIST.2 T(ABS(t_emp);df). For the variable price we get: T.DIST.2 T(3.20;8) = 0.0126 or 1.3%.
21.
With Excel we can calculate the critical value \(t_{\alpha }\) for a one-tailed t-test by using the function T.INV(1 – α;df). We get: T.INV(0.95;8) = 1.860.
22.
With Excel we can calculate the p-value for the right tail by the function T.DIST.RT(temp;df). For the variable advertising we get: T.DIST.RT(5.89;8) = 0.00018 or 0.018%.
23.
Using Excel, we can calculate the critical value for a lower-tail t-test by T.INV(α;df).
We get: T.INV(0.05;8) = –1.860.
24.
Cf. e.g., Kmenta (1997, p. 392); Fox (2008, p. 105); Greene (2012, p. 92); Wooldrige (2016, p. 79); Gelman and Hill (2018, p. 45). You will find slight differences between the formulations of the different authors.
25.
This follows from the Gauss-Markov theorem. See e.g. Fox (2008, p. 103); Kmenta (1997, p. 216).
26.
The central limit theorem plays an important role in statistical theory. It states that the sum or mean of n independent random variables tends toward a normal distribution if n is sufficiently large, even if the original variables themselves are not normally distributed. This is the reason why a normal distribution can be assumed for many phenomena.
27.
Anscombe and Tukey (1963) demonstrated the power of graphical techniques in data analysis.
28.
In an experiment the researcher actively changes the independent variable X and observes changes of the dependent variable Y. And, as far as possible, he tries to keep out any other influences on Y. For the design of experiments see e.g. Campbell and Stanley (1966); Green et al. (1988).
29.
Switzerland was the top performer in chocolate consumption and number of Noble Prizes. See Messerli, F. H. (2012). Chocolate consumption, Cognitive Function and Nobel Laureates. The New England Journal of Medicine, 367(16), 1562–1564.
30.
For causal inference in regression see Freedman (2012); Pearl and Mackenzie (2018, p. 72). Problems like this one are covered by path analysis, originally developed by Sewall Wright (1889–1988), and structural equation modeling (SEM), cf. e.g. Kline (2016); Hair et al. (2014).
31.
“Mistaking a mediator for a confounder is one of the deadliest sins in causal inference.” (Pearl and Mackenzie 2018, p. 276).
32.
The expression goes back to Francis Galton (1886), who called it “regression towards mediocrity”. Galton wrongly interpreted it as a causal effect in human heredity. It is ironic that the first and most important method of multivariate data analysis got its name from something that means the opposite of what regression analysis actually intends to do. Cf. Kahneman (2011, p. 175); Pearl and Mackenzie (2018, p. 53).
33.
Cf. Freedman et al. (2007, p. 169). In econometric analysis this effect is called least squares attenuation or attenuation bias. Cf., e.g., Kmenta (1997, p. 346); Greene (2012, p. 280); Wooldridge (2016, p. 306).
34.
In psychology great efforts have been undertaken, beginning with Charles Spearman in 1904, to measure empirically the reliability of measurement methods and thus derive corrections for attenuation. Cf., e.g., Hair et. al. (2014, p. 96); Charles (2005).
35.
An overview of this test and other tests is given by Kmenta (1997, p. 292); Maddala and Lahiri (2009, p. 214).
36.
From a Durbin-Watson table we derive the values d_L = 0.97 and d_U = 1.33 and thus 1.33 < DW < 2.67 (no autocorrelation).
37.
If the errors are normally distributed, the y-values, which contain the errors as additive elements, are also normally distributed. And since the least-squares estimators form linear combinations of the y-values, the parameter estimates are normally distributed, too.
38.
Numerical significance tests of normality: the Kolmogorov-Smirnov test and the Shapiro-Wilk test.
39.
The matrix X′X is singular and cannot be inverted.
40.
Numerically these areas can be expressed by their sum of squares: \(SS_{Y} \,\, = \,\,\sum {\left( {\,y_{k} \, - \,\overline{y}} \right)} \,^{2}\) and \(SS_{{X_{j} }} \,\, = \,\,\sum {\left( {\,x_{jk} \, - \,\overline{x}_{j} } \right)} \,^{2}\).
41.
See Belsley et al. (1980, p. 93).
42.
Very small tolerance values can lead to computational problems. By default, SPSS will not allow variables with T_j < 0.0001 to enter the model.
43.
Another method to counter multicollinearity, which is beyond the scope of this text, is ridge regression. By this method one trades a small amount of bias in the estimators for a large reduction in variance. See Fox (2008, p. 325); Kmenta (1997, p. 440); Belsley et al. (1980, p. 219).
44.
Excellent treatments of this topic may be found in Belsley et al. (1980); Fox (2008, p. 246). SPSS provides numerous statistics.
45.
Before doing a regression analysis one can use exploratory techniques of data analysis, like box plots (box-and-whisker plots), for checking the data and detecting possible outliers. But these methods do not show the effects on regression.
46.
This may be different when the number of variables is large. In this case the detection of multivariate outliers by scatterplots can be difficult (see Belsley et al. 1980, p. 17).
47.
With Excel we can calculate: p(abs(z) ≥ 1.59) = 2*(1-NORM.S.DIST(1.59;1) = 0.112.
48.
A modified measure of this distance is the centered leverage \(h^{\prime}_{i} = h_{i} - \frac{J + 1}{N}\) with \(0 \le h^{\prime}_{i} \le 1\).
49.
By using s(−i) instead of standard error s, the numerator and the denominator in the formula for the studentized deleted residuals become stochastically independent. See Belsley et al. (1980, p. 14).
50.
See Fox (2008, p. 246); Belsley et al. (1980, p. 20).
51.
With Excel we can calculate: p(abs(t) ≥ 2.46) = T.DIST.2 T(2.46;9) = 0.036.
52.
Missing values are a frequent and unfortunately unavoidable problem when conducting surveys (e.g., because people cannot or do not want to answer some question, or as a result of mistakes by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
53.
Since Albert Einstein (1879–1955) we know that this is not quite true. Relativity theory tells us that time slows down with increasing speed and even comes to a standstill at the speed of light. But for our problems we can neglect this.

References

Agresti, A. (2013). Categorical Data Analysis. New Jersey: John Wiley.
Google Scholar
Anscombe, F. J., & Tukey, J. W. (1963). The Examination and Analysis of Residuals. Technometrics, 5(2), 141–160.
Google Scholar
Belsley, D., Kuh, E., & Welsch, R. (1980). Regression diagnostics. New York: Wiley.
Book Google Scholar
Blalock, H. M. (1964). Causal inferences in nonexperimental research. New York: The Norton Library.
Google Scholar
Campbell, D. T., & Stanley, J. C. (1966). Experimental and Quasi-experimental designs for research. Chicago: Rand McNelly.
Google Scholar
Charles, E. P. (2005). The correction for attenuation due to measurement error: Clarifying concepts and creating confidence sets. American Psychological Association, 10(2), 206–226.
Google Scholar
Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. Technometrics, 19, 15–18.
Google Scholar
Fox, J. (2008). Applied regression analysis and generalized linear models. Los Angeles: Sage.
Google Scholar
Freedman, D. (2002). From association to causation: Some remarks on the history of statistics (p. 521). Berkeley, Technical Report No: University of California.
Google Scholar
Freedman, D. (2012). Statistical models: Theory and practice. Cambridge: Cambridge University Press.
Google Scholar
Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). New York: Norton & Company.
Google Scholar
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.
Article Google Scholar
Gelman, A., & Hill, J. (2018). Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.
Google Scholar
Green, P. E., Tull, D. S., & Albaum, G. (1988). Research for Marketing Decisions (5th ed.). Prentice Hall: Englewood Cliffs.
Google Scholar
Greene, W. H. (2012). Econometric analysis (7th ed.). Essex: Pearson.
Google Scholar
Greene, W. H. (2020). Econometric analysis (8th ed.). Essex: Pearson.
Google Scholar
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis (7th ed.). Englewood Cliffs: Pearson.
Google Scholar
Hair, J. F., Hult, G.T., Ringle, C. M., & Sarstedt, M. (2014). A primer on Partial Least Squares Structural Equation Modelling (PLS-SEM). Los Angeles: Sage.
Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2011). The Elements of Statistical Learning. New York: Springer.
Google Scholar
Izenman, A. L. (2013). Modern multivariate statistical techniques. New York.: Springer Texts. in Statistics.
Google Scholar
Kahneman, D. (2011). Thinking, fast and slow. London: Penguin.
Google Scholar
Kline, R. B. (2016). Principles and practice of structural equation modeling. New York: Guilford Press.
Google Scholar
Kmenta, J. (1997). Elements of econometrics (2nd ed.). New York: Macmillan.
Book Google Scholar
Leeflang, P., Witting, D., Wedel, M., & Naert, P. (2000). Building models for marketing decisions. Boston: Kluwer Academic Publishers.
Book Google Scholar
Little, J. D. C. (1970). Models and managers: The concept of a decision calculus. Management Science, 16(8), 466–485.
Article Google Scholar
Maddala, G., & Lahiri, K. (2009). Introduction to econometrics (4th ed.). New York: Wiley.
Google Scholar
Pearl, J., & Mackenzie, D. (2018). The book of why—The new science of cause and effect. New York: Basic Books.
Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. the American Journal of Psychology, 15(1), 72–101.
Article Google Scholar
Stigler, S. M. (1997). Regression towards the mean, historically considered. Statistical Methods in Medical Research, 6, 103–114.
Article Google Scholar
Wooldridge, J. (2016). Introductory econometrics: A modern approach (6th ed.). Cincinnati: Thomson.
Google Scholar

Author information

Authors and Affiliations

Institute of Business-to-Business Marketing, Marketing Center Münster, University of Münster, Münster, Germany
Klaus Backhaus
Otto-von-Guericke-University Magdeburg, Magdeburg, Germany
Bernd Erichson
Chair for Value-Based-Marketing, Marketing Center Münster, University of Münster, Münster, Germany
Sonja Gensler
Chair of Marketing and Innovation, University of Trier, Trier, Germany
Rolf Weiber
Munich, Germany
Thomas Weiber

Authors

Klaus Backhaus
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Erichson
View author publications
You can also search for this author in PubMed Google Scholar
Sonja Gensler
View author publications
You can also search for this author in PubMed Google Scholar
Rolf Weiber
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Weiber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Klaus Backhaus .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Backhaus, K., Erichson, B., Gensler, S., Weiber, R., Weiber, T. (2021). Regression Analysis. In: Multivariate Analysis. Springer Gabler, Wiesbaden. https://doi.org/10.1007/978-3-658-32589-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-658-32589-3_2
Published: 14 October 2021
Publisher Name: Springer Gabler, Wiesbaden
Print ISBN: 978-3-658-32588-6
Online ISBN: 978-3-658-32589-3
eBook Packages: Business and Economics (German Language)

Publish with us

Policies and ethics

Regression Analysis

Abstract

Access this chapter

Notes

References

Further Reading

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Regression Analysis

Abstract

Access this chapter

Notes

References

Further Reading

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation