Skip to main content

Regression Analysis

  • 730 Accesses

Part of the Springer Series in the Data Sciences book series (SSDS)


Linear regression is the statistical workhorse of the social and physical sciences. At its core, regression is a fairly simple method that fits a line (or a curve) through a set of data points. But the technique can be much more than just fitting a line. It can be thought of as a way to quantify how a variable \(y\) (a outcome, dependent, or target variable) tends to change as input variables\(X\) (the explanatory or independent variables) change.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-71352-2_7
  • Chapter length: 26 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   44.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-71352-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   59.99
Price excludes VAT (USA)
Hardcover Book
USD   79.99
Price excludes VAT (USA)
Figure 7.1:
Figure 7.2:
Figure 7.3:
Figure 7.4:
Figure 7.5:
Figure 7.6:
Figure 7.7:
Figure 7.8:
Figure 7.9:
Figure 7.10:
Figure 7.11:
Figure 7.12:
Figure 7.13:
Figure 7.14:
Figure 7.15:


  1. 1.

    Some quantitative researchers may opt for the word “adjust” rather than “control”. As we will later see, including a variety of factors in a regression adjusts the model to account for sources of information.

  2. 2.

    In fact, the underlying dataset is a standard “toy” dataset used for teaching statistics and regression.

  3. 3.

    Recall the classic definition of slope: rise over run (or the change in \(y\) divided by the change in \(x\)).

  4. 4.

    Regression jargon: You will generally hear/see “intercept” rather than “\(y\)-intercept”.

  5. 5.

    If the goal is to minimize the mean absolute error, then OLS is not best. Change the objective function and the algorithm will find the new winner.

  6. 6.

    Also recall that the standard deviation is simply the square root of the variance. The (sample) variance of a variable \(x = \{x_1,\, \ldots ,\, x_n\}\) is \(s^2_x = \text {Var}(x) = \frac{1}{n-1}\sum _{i=1}^{n}(x_i - \overline{x})\) (where \(\overline{x} = \frac{1}{n}\sum _{i=1}^n x_i\), i.e., the mean of \(x\)). Thus, the standard deviation of \(x = s_x = \sqrt{\text {Var}(x)}\). The correlation between two variables \(x\) and \(y\) is given by \(\text {Cor}(x,\,y) = \frac{\text {Cov}(x,\,y)}{s_x s_y}\), where \(\text {Cov}(x,\,y) = \frac{\sum _{i = 1}^n (x_i - \overline{x})(y_i - \overline{y})}{n-1}\) is the covariance of the two variables. It is also worth remembering that correlation is bounded between \(-1\) (a strongly linear negative relationship) and \(1\) (strongly linear positive relationship), whereas covariance is unbounded.

  7. 7.

    For the custom_theme used to generate Figure 7.6, see Chapter 6.

  8. 8.

    For example, you would write lm(sale.price ~ -1 + gross.square.feet, data = sale_df) if you wanted to estimate the relationship between the sale price and the property size without an intercept.

  9. 9.

    Which is still amazingly small.

  10. 10.

    When assembling these data, we restricted the sample to sales between $10,000 and $10,000,000 so as to avoid gifted properties and extreme values.

  11. 11.

    The null hypothesis in this case is that \(\beta _2 = 0\), which means there is not a quadratic relationship.

  12. 12.

    We actually already ran a multiple regression model when we regressed the price on footage and footage squared.

  13. 13.

    While we have included Adjusted \(R^2\) in this text and R reports it in regression outputs, friends do not let friends use Adjusted \(R^2\). Mills and Prasad (1992) shows that the Adjusted \(R^2\) is a poor measure for model selection.

  14. 14.

    We used the Stargazer package to co-locate the four model into a regression table much like in academic papers.

  15. 15.

    When differencing a series, we automatically lose one observation. For example, making a difference (\(d=1\)) leaves \(n-1\) observations.

  16. 16.

    Because differencing yields \(n-1\) records, we add an NA value at the beginning of the differenced series so the transformed series can be assigned to a column in uk.

  17. 17.

    There are simple functions for estimating and producing forecasts, but we have chosen to show the underlying process to give a look inside the seemingly black box.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jeffrey C. Chen .

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Chen, J.C., Rubin, E.A., Cornwall, G.J. (2021). Regression Analysis. In: Data Science for Public Policy. Springer Series in the Data Sciences. Springer, Cham.

Download citation