Abstract
Regression analysis is one of the most flexible and most frequently used multivariate methods. It is employed to analyze relationships between a metrically scaled dependent variable and one or more metrically scaled independent variables. In particular, it is used to describe relationships quantitatively and to explain them. As a result, we can estimate or predict values of the dependent variable. Regression analysis is of eminent importance for science and practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Galton (1886) investigated the relationship between the body heights of parents and their adult children. He “regressed the height of children on the height of parents”.
- 2.
Sales can also depend on environmental factors like competition, social-economic influences, or weather. Another difficulty is that advertising itself is a complex bundle of factors that cannot simply be reduced to expenditures. The impact of advertising depends on its quality, which is difficult to measure, and it also depends on the media that are used (e.g., print, radio, television, internet). These and other reasons make it very difficult to measure the effect of advertising.
- 3.
On the website www.multivariate-methods.info we provide supplementary material (e.g., Excel files) to deepen the reader’s understanding of the methodology.
- 4.
- 5.
In regression analysis we encounter the problem of multicollinearity. We will deal with this problem in Sect. 2.2.5.7.
- 6.
The terms association and correlation are widely and often interchangeably used in data analysis. But there are differences. Association of variables refers to any kind of relation between variables. Two variables are said to be associated if the values of one variable tend to change in some systematic way along with the values of the other variable. A scatterplot of the variables will show a systematic pattern. Correlation is a more specific term. It refers to associations in the form of a linear trend. And it is a measure of the strength of this association. Pearson's correlation coefficient measures the strength of a linear trend, i.e. how close the points are lying on a straight line. Spearman's rank correlation can also be used for non-linear trends.
- 7.
These basic statistics can be easily calculated with the Excel functions AVERAGE(range) for mean, STDEV.S(range) for std-deviation, and CORREL(range1;range2) for correlation.
- 8.
Blalock (1964, p. 51) writes: “A large correlation merely means a low degree of scatter …. It is the regression coefficients which give us the laws of science.”
- 9.
With the optimization tool Solver of MS Excel it is easy to find this solution without differential calculus or knowing any formulas. One chooses the cell that contains the value of SSR (the sum at the bottom of the rightmost column in Table 2.5) as the target cell (objective). The cells that contain the parameters a and b are chosen as the changing cells. Then minimizing the objective will yield the least-squares estimates of the parameters within the changing cells.
- 10.
Carl Friedrich Gauß (1777–1855) used the method in 1795 at the age of only 18 years for calculating the orbits of celestial bodies. This method was also developed independently by the French mathematician Adrien-Marie Legendre (1752–1833). G. Udny Yule (1871–1951) first applied it to regression analysis.
- 11.
When using matrix algebra for calculation, the constant term is treated as the coefficient of a fictive variable with all values equal to 1. By this, it can be computed in the same way as the other coefficients and the calculation becomes easier.
- 12.
- 13.
This is called inferential statistics and has to be distinguished from descriptive statistics. Inferential statistics makes inferences and predictions about a population based on a sample drawn from the studied population.
- 14.
For a simple coefficient of correlation r, we get \(F_{{\text{emp}}} = \frac{{r^{2} }}{{(1 - r^{2} )/(N - 2)}}\), as J = 1.
- 15.
With Excel we can calculate the p-value by using the function F.DIST.RT(Femp;df1;df2). We get: F.DIST.RT (31.50;3;8) = 0.00009 or 0.009%.
- 16.
The reader should be aware that other values for α are also possible. α = 5% is a kind of “gold” standard in statistics that goes back to Sir R. A. Fisher (1890–1962) who also created the F-distribution. But the researcher must also consider the consequences (costs) of making a wrong decision.
- 17.
- 18.
For a brief summary of the basics of statistical testing see Sect. 1.3.
- 19.
With Excel we can calculate the critical value \(t_{\alpha /2}\) for a two-tailed t-test by using the function T.INV.2 T(α;df). We get: T.INV.2 T(0.05;8) = 2.306.
- 20.
The p-values can be calculated with Excel by using the function T.DIST.2 T(ABS(temp);df). For the variable price we get: T.DIST.2 T(3.20;8) = 0.0126 or 1.3%.
- 21.
With Excel we can calculate the critical value \(t_{\alpha }\) for a one-tailed t-test by using the function T.INV(1 – α;df). We get: T.INV(0.95;8) = 1.860.
- 22.
With Excel we can calculate the p-value for the right tail by the function T.DIST.RT(temp;df). For the variable advertising we get: T.DIST.RT(5.89;8) = 0.00018 or 0.018%.
- 23.
Using Excel, we can calculate the critical value for a lower-tail t-test by T.INV(α;df).
We get: T.INV(0.05;8) = –1.860.
- 24.
- 25.
- 26.
The central limit theorem plays an important role in statistical theory. It states that the sum or mean of n independent random variables tends toward a normal distribution if n is sufficiently large, even if the original variables themselves are not normally distributed. This is the reason why a normal distribution can be assumed for many phenomena.
- 27.
Anscombe and Tukey (1963) demonstrated the power of graphical techniques in data analysis.
- 28.
- 29.
Switzerland was the top performer in chocolate consumption and number of Noble Prizes. See Messerli, F. H. (2012). Chocolate consumption, Cognitive Function and Nobel Laureates. The New England Journal of Medicine, 367(16), 1562–1564.
- 30.
- 31.
“Mistaking a mediator for a confounder is one of the deadliest sins in causal inference.” (Pearl and Mackenzie 2018, p. 276).
- 32.
The expression goes back to Francis Galton (1886), who called it “regression towards mediocrity”. Galton wrongly interpreted it as a causal effect in human heredity. It is ironic that the first and most important method of multivariate data analysis got its name from something that means the opposite of what regression analysis actually intends to do. Cf. Kahneman (2011, p. 175); Pearl and Mackenzie (2018, p. 53).
- 33.
- 34.
- 35.
- 36.
From a Durbin-Watson table we derive the values dL = 0.97 and dU = 1.33 and thus 1.33 < DW < 2.67 (no autocorrelation).
- 37.
If the errors are normally distributed, the y-values, which contain the errors as additive elements, are also normally distributed. And since the least-squares estimators form linear combinations of the y-values, the parameter estimates are normally distributed, too.
- 38.
Numerical significance tests of normality: the Kolmogorov-Smirnov test and the Shapiro-Wilk test.
- 39.
The matrix X′X is singular and cannot be inverted.
- 40.
Numerically these areas can be expressed by their sum of squares: \(SS_{Y} \,\, = \,\,\sum {\left( {\,y_{k} \, - \,\overline{y}} \right)} \,^{2}\) and \(SS_{{X_{j} }} \,\, = \,\,\sum {\left( {\,x_{jk} \, - \,\overline{x}_{j} } \right)} \,^{2}\).
- 41.
See Belsley et al. (1980, p. 93).
- 42.
Very small tolerance values can lead to computational problems. By default, SPSS will not allow variables with Tj < 0.0001 to enter the model.
- 43.
- 44.
- 45.
Before doing a regression analysis one can use exploratory techniques of data analysis, like box plots (box-and-whisker plots), for checking the data and detecting possible outliers. But these methods do not show the effects on regression.
- 46.
This may be different when the number of variables is large. In this case the detection of multivariate outliers by scatterplots can be difficult (see Belsley et al. 1980, p. 17).
- 47.
With Excel we can calculate: p(abs(z) ≥ 1.59) = 2*(1-NORM.S.DIST(1.59;1) = 0.112.
- 48.
A modified measure of this distance is the centered leverage \(h^{\prime}_{i} = h_{i} - \frac{J + 1}{N}\) with \(0 \le h^{\prime}_{i} \le 1\).
- 49.
By using s(−i) instead of standard error s, the numerator and the denominator in the formula for the studentized deleted residuals become stochastically independent. See Belsley et al. (1980, p. 14).
- 50.
- 51.
With Excel we can calculate: p(abs(t) ≥ 2.46) = T.DIST.2 T(2.46;9) = 0.036.
- 52.
Missing values are a frequent and unfortunately unavoidable problem when conducting surveys (e.g., because people cannot or do not want to answer some question, or as a result of mistakes by the interviewer). The handling of missing values in empirical studies is discussed in Sect. 1.5.2.
- 53.
Since Albert Einstein (1879–1955) we know that this is not quite true. Relativity theory tells us that time slows down with increasing speed and even comes to a standstill at the speed of light. But for our problems we can neglect this.
References
Agresti, A. (2013). Categorical Data Analysis. New Jersey: John Wiley.
Anscombe, F. J., & Tukey, J. W. (1963). The Examination and Analysis of Residuals. Technometrics, 5(2), 141–160.
Belsley, D., Kuh, E., & Welsch, R. (1980). Regression diagnostics. New York: Wiley.
Blalock, H. M. (1964). Causal inferences in nonexperimental research. New York: The Norton Library.
Campbell, D. T., & Stanley, J. C. (1966). Experimental and Quasi-experimental designs for research. Chicago: Rand McNelly.
Charles, E. P. (2005). The correction for attenuation due to measurement error: Clarifying concepts and creating confidence sets. American Psychological Association, 10(2), 206–226.
Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. Technometrics, 19, 15–18.
Fox, J. (2008). Applied regression analysis and generalized linear models. Los Angeles: Sage.
Freedman, D. (2002). From association to causation: Some remarks on the history of statistics (p. 521). Berkeley, Technical Report No: University of California.
Freedman, D. (2012). Statistical models: Theory and practice. Cambridge: Cambridge University Press.
Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). New York: Norton & Company.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.
Gelman, A., & Hill, J. (2018). Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.
Green, P. E., Tull, D. S., & Albaum, G. (1988). Research for Marketing Decisions (5th ed.). Prentice Hall: Englewood Cliffs.
Greene, W. H. (2012). Econometric analysis (7th ed.). Essex: Pearson.
Greene, W. H. (2020). Econometric analysis (8th ed.). Essex: Pearson.
Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis (7th ed.). Englewood Cliffs: Pearson.
Hair, J. F., Hult, G.T., Ringle, C. M., & Sarstedt, M. (2014). A primer on Partial Least Squares Structural Equation Modelling (PLS-SEM). Los Angeles: Sage.
Hastie, T., Tibshirani, R., & Friedman, J. (2011). The Elements of Statistical Learning. New York: Springer.
Izenman, A. L. (2013). Modern multivariate statistical techniques. New York.: Springer Texts. in Statistics.
Kahneman, D. (2011). Thinking, fast and slow. London: Penguin.
Kline, R. B. (2016). Principles and practice of structural equation modeling. New York: Guilford Press.
Kmenta, J. (1997). Elements of econometrics (2nd ed.). New York: Macmillan.
Leeflang, P., Witting, D., Wedel, M., & Naert, P. (2000). Building models for marketing decisions. Boston: Kluwer Academic Publishers.
Little, J. D. C. (1970). Models and managers: The concept of a decision calculus. Management Science, 16(8), 466–485.
Maddala, G., & Lahiri, K. (2009). Introduction to econometrics (4th ed.). New York: Wiley.
Pearl, J., & Mackenzie, D. (2018). The book of why—The new science of cause and effect. New York: Basic Books.
Spearman, C. (1904). The proof and measurement of association between two things. the American Journal of Psychology, 15(1), 72–101.
Stigler, S. M. (1997). Regression towards the mean, historically considered. Statistical Methods in Medical Research, 6, 103–114.
Wooldridge, J. (2016). Introductory econometrics: A modern approach (6th ed.). Cincinnati: Thomson.
Further Reading
Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2009). Regression—models, methods and aplications. Heidelberg: Springer.
Hanke, J. E., & Wichern, D. (2013). Business forecasting (9th ed.). Upper Saddle River: Prentice-Hall.
Härdle, W., & Simar, L. (2012). Applied multivariate analysis. Heidelberg: Springer.
Stigler, S. M. (1986). The history of statistics. Cambridge: Harvard University Press.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 Der/die Herausgeber bzw. der/die Autor(en), exklusiv lizenziert durch Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature
About this chapter
Cite this chapter
Backhaus, K., Erichson, B., Gensler, S., Weiber, R., Weiber, T. (2021). Regression Analysis. In: Multivariate Analysis. Springer Gabler, Wiesbaden. https://doi.org/10.1007/978-3-658-32589-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-658-32589-3_2
Published:
Publisher Name: Springer Gabler, Wiesbaden
Print ISBN: 978-3-658-32588-6
Online ISBN: 978-3-658-32589-3
eBook Packages: Business and Economics (German Language)