## Abstract

Linear regression is the statistical workhorse of the social and physical sciences. At its core, regression is a fairly simple method that fits a line (or a curve) through a set of data points. But the technique can be much more than just fitting a line. It can be thought of as a way to quantify how a variable \(y\) (a *outcome*, *dependent*, or *target* variable) tends to change as *input variables*\(X\) (the *explanatory* or *independent* variables) change.

This is a preview of subscription content, access via your institution.

## Buying options

## Notes

- 1.
Some quantitative researchers may opt for the word “adjust” rather than “control”. As we will later see, including a variety of factors in a regression

*adjusts*the model to account for sources of information. - 2.
In fact, the underlying dataset is a standard “toy” dataset used for teaching statistics and regression.

- 3.
Recall the classic definition of slope: rise over run (or the change in \(y\) divided by the change in \(x\)).

- 4.
Regression jargon: You will generally hear/see “intercept” rather than “\(y\)-intercept”.

- 5.
If the goal is to minimize the mean absolute error, then OLS is not best. Change the objective function and the algorithm will find the new winner.

- 6.
Also recall that the standard deviation is simply the square root of the variance. The (sample) variance of a variable \(x = \{x_1,\, \ldots ,\, x_n\}\) is \(s^2_x = \text {Var}(x) = \frac{1}{n-1}\sum _{i=1}^{n}(x_i - \overline{x})\) (where \(\overline{x} = \frac{1}{n}\sum _{i=1}^n x_i\), i.e., the mean of \(x\)). Thus, the standard deviation of \(x = s_x = \sqrt{\text {Var}(x)}\). The correlation between two variables \(x\) and \(y\) is given by \(\text {Cor}(x,\,y) = \frac{\text {Cov}(x,\,y)}{s_x s_y}\), where \(\text {Cov}(x,\,y) = \frac{\sum _{i = 1}^n (x_i - \overline{x})(y_i - \overline{y})}{n-1}\) is the covariance of the two variables. It is also worth remembering that correlation is bounded between \(-1\) (a strongly linear negative relationship) and \(1\) (strongly linear positive relationship), whereas covariance is unbounded.

- 7.
- 8.
For example, you would write lm(sale.price ~ -1 + gross.square.feet, data = sale_df) if you wanted to estimate the relationship between the sale price and the property size without an intercept.

- 9.
Which is still amazingly small.

- 10.
When assembling these data, we restricted the sample to sales between $10,000 and $10,000,000 so as to avoid gifted properties and extreme values.

- 11.
The null hypothesis in this case is that \(\beta _2 = 0\), which means there is not a quadratic relationship.

- 12.
We actually already ran a multiple regression model when we regressed the price on footage and footage squared.

- 13.
While we have included Adjusted \(R^2\) in this text and R reports it in regression outputs, friends do not let friends use Adjusted \(R^2\). Mills and Prasad (1992) shows that the Adjusted \(R^2\) is a poor measure for model selection.

- 14.
We used the Stargazer package to co-locate the four model into a regression table much like in academic papers.

- 15.
When differencing a series, we automatically lose one observation. For example, making a difference (\(d=1\)) leaves \(n-1\) observations.

- 16.
Because differencing yields \(n-1\) records, we add an NA value at the beginning of the differenced series so the transformed series can be assigned to a column in uk.

- 17.
There are simple functions for estimating and producing forecasts, but we have chosen to show the underlying process to give a look inside the seemingly black box.

## Author information

### Authors and Affiliations

### Corresponding author

## Rights and permissions

## Copyright information

© 2021 Springer Nature Switzerland AG

## About this chapter

### Cite this chapter

Chen, J.C., Rubin, E.A., Cornwall, G.J. (2021). Regression Analysis. In: Data Science for Public Policy. Springer Series in the Data Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-71352-2_7

### Download citation

DOI: https://doi.org/10.1007/978-3-030-71352-2_7

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-030-71351-5

Online ISBN: 978-3-030-71352-2

eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)