Skip to main content

Statistical Learning as a Regression Problem

  • Chapter
  • First Online:
Statistical Learning from a Regression Perspective

Part of the book series: Springer Texts in Statistics ((STS))

Abstract

This chapter makes four introductory points: (1) regression analysis is defined by the conditional distribution of Y |X, not by a conventional linear regression model; (2) different forms of regression analysis are properly viewed as approximations of the true relationships, which is a game changer; (3) statistical learning can be just another kind of regression analysis; (4) and properly formulated regression approximations can have asymptotically most of the desirable estimation properties. The emphasis on regression analysis is justified in part through a rebranding of least squares regression by some as a form of supervised machine learning. Once these points are made, the chapter turns to several key statistical concepts needed for statistical learning: overfitting, data snooping, loss functions, linear estimators, linear basis expansions, the bias–variance tradeoff, resampling, algorithms versus models, and others.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Regularization will have a key role in much of the material ahead. It goals and features will be addressed as needed.

  2. 2.

    “Realized” here means produced through a random process. Random sampling from a finite population is an example. Data generated by a correct linear regression model can also be said to be realized. After this chapter, we will proceed almost exclusively with a third way in which data can be realized.

  3. 3.

    The data, birthwt, are from the MASS package in R.

  4. 4.

    The χ 2 test assumes that the marginal distributions of both variables are fixed in repeated realizations of the data. Only the distribution of counts within cells can change. Whether this is a plausible assumption depends on how the data were generated. If the data are a random sample from a well-defined population, the assumption of fixed marginal distributions is not plausible. Both marginal distributions would almost certainly change in new random samples. The spine plot and the mosaic plot were produced using the R package vcd, which stands for “visualizing categorical data.” Its authors are Meyer et al. (2007).

  5. 5.

    Although there are certainly no universal naming conventions, “predictors” can be seen as variables that are of subject-matter interest, and “covariates” can be seen as variables that improve the performance of the statistical procedure being applied. Then, covariates are not of subject-matter interest. Whatever the naming conventions, the distinction between variables that matter substantively and variables that matter procedurally is important. An example of the latter is a covariate included in an analysis of randomized experiments to improve statistical precision.

  6. 6.

    A crime is “cleared” when the perpetrator is arrested. In some jurisdictions, a crime is cleared when the perpetrator has been identified, even if there has been no arrest.

  7. 7.

    But nature can certainly specify different predictor values for different students.

  8. 8.

    By “asymptotics,” one loosely means what happens to the properties an estimate as the number of observations increases without limit. Sometimes, for example, bias in the estimate shrinks to zero, which means that in sufficiently large samples, the bias will likely be small. Thus, the desirable estimation properties of logistic regression only materialize asymptotically. This means that one can get very misleading results from logistic regression in small samples if one is working at Level II.

  9. 9.

    This is sometimes called “the fallacy of accepting the null” (Rozeboom 1960).

  10. 10.

    Model selection in some disciplines is called variable selection, feature selection, or dimension reduction. These terms will be used interchangeably.

  11. 11.

    Actually, it can be more complicated. For example, if the predictors are taken to be fixed, one is free to examine the predictors alone. Model selection problems surface when associations with the response variable are examined as well. If the predictors are taken to be random, the issues are even more subtle.

  12. 12.

    If one prefers to think about the issues in a multiple regression context, the single predictor can be replaced by the predictor adjusted, as usual, for its linear relationships with all other predictors.

  13. 13.

    Recall that x is fixed and does not change from dataset to dataset. The new datasets result from variation around the true conditional means.

  14. 14.

    We will see later that by increasing the complexity of the mean function estimated, one has the potential to reduce bias with respect to the true response surface. But an improved fit in the data on hand is no guarantee that one is more accurately representing the true mean function. One complication is that greater mean function complexity can promote overfitting.

  15. 15.

    The next several pages draw heavily on Berk et al. (2019) and Buja et al. (2019a,b).

  16. 16.

    Each case is composed of a set (i.e., vector) of values for the random variables that are included.

  17. 17.

    The notation may seem a little odd. In a finite population, these would be matrices or vectors, and the font would be bold. But the population is of limitless size because it constitutes what could be realized from the joint probability distribution. These random variables are really conceptual constructs. Bold font might have been more odd still. Another notational scheme could have been introduced for these statistical constructs, but that seems a bit too precious and in context, unnecessary.

  18. 18.

    For exposition, working with conditional expectations is standard, but there are other options such as conditional probabilities when the predictor is categorical. This will be important in later chapters.

  19. 19.

    They are deviations around a mean, or more properly, an expected value.

  20. 20.

    For example, experiences in the high school years will shape variables such as the high school GPA, the number of advanced placement courses taken, the development of good study habits, an ability to think analytically, and performance on the SAT or ACT test, which, in turn, can be associated the college grades freshman year. One can easily imagine representing these variables in a joint probability distribution.

  21. 21.

    We will see later that some “weak” forms of dependence are allowed.

  22. 22.

    This intuitively pleasing idea has in many settings solid formal justification (Efron and Tibshirani 1993: chapter 4).

  23. 23.

    There is no formal way to determine how large is large enough because such determinations are dataset specific.

  24. 24.

    Technically, a prediction interval is not a confidence interval. A confidence interval provides coverage for a parameter such as a mean or regression coefficient. A prediction interval provides coverage for a response variable value. Nevertheless, prediction intervals are often called confidence intervals.

  25. 25.

    The use of split samples means that whatever overfitting or data snooping that might result from the fitting procedure apply to the first split and do not taint the residuals from the second split. Moreover, there will typically be no known or practical way to do proper statistical inference that includes uncertainty from the training data and fitting procedure when there is data snooping.

  26. 26.

    This works because the data are assumed to be IID, or at least exchangeable. Therefore, it makes sense to consider the interval in which forecasted values fall with a certain probability (e.g., .95) in limitless IID realizations of the forecasting data.

  27. 27.

    Because of the random split, the way some of the labels line up in the plot may not be quite right when the code is run again. But that is easily fixed.

  28. 28.

    The use of split samples can be a disadvantage. As discussed in some detail later, many statistical learning are sample-size dependent when the fitting is undertaken. Smaller samples lead to fitted values and forecasts that can have more bias with respect to the true response surface. But in trade, no additional assumptions need be made when the second split is used to compute residuals.

  29. 29.

    If the sampling were without replacement, the existing data would simply be reproduced unless the sample size was smaller than N. More will be said about this option in later chapters based on work by Buja and Stuetzle (2006).

  30. 30.

    The boot procedures stem from the book by Davidson (1997). The code is written by Angelo Canty and Brian Ripley.

  31. 31.

    The second-order conditions differ substantially from conventional linear regression because the 1s and 0s are a product of Bernoulli draws (McCullagh and Nelder 1989: Chapter 4). It follows that unlike least squares regression for linear models, logistic regression depends on asymptotics to obtain desirable estimation properties.

  32. 32.

    Some treatments of machine learning include logistic regression as a form of supervised learning. Whether in these treatments logistic regression is seen as a model or an algorithm is often unclear. But it really matters, which will be more apparent shortly.

  33. 33.

    As a categorial statement, this is a little too harsh. Least squares itself is an algorithm that in fact can be used on some statistical learning problems. But regression analysis formulated as a linear model incorporates many addition features that have little or nothing to do with least squares. This will be more clear shortly.

  34. 34.

    There is also “semisupervised” statistical learning that typically concentrates on Y |X, but for which there are more observations for the predictors than for the response. The usual goal is to fit the response better using not just set of observations for which both Y  and X are observed, but also using observations for which only X is observed. Because an analysis of Y |X will often need to consider the joint probability distribution of X as well, the extra data on X alone can be very useful.

  35. 35.

    Recall that because we treat the predictors that constitute X as random variables, the disparities between the approximation and the truth are also random, which allows them to be incorporated in ε.

  36. 36.

    Some academic disciplines like to call the columns of X “inputs,” and Y  an “output” or a “target.” Statisticians typically prefer to call the columns of X “predictors” and Y  a “response.” By and large, the terms predictor (or occasionally, regressor) and response will be used here except when there are links to computer science to be made. In context, there should be no confusion.

  37. 37.

    In later chapters, several procedures will be discussed that can help one consider the “importance” of each input and how inputs are related to outputs.

  38. 38.

    A functional is a function that takes one or more functions as arguments.

  39. 39.

    An estimand is a feature of the joint probability distribution whose value(s) are primary interest. An estimator is a computational procedures that can provide estimates of the estimand. An estimate is the actual numerical value(s) produced by the estimator. For example, the expected value of a random variable may be the estimand. The usual expression for the mean in an IID dataset can be the estimator. The value of the mean obtained from the sample is the estimate. These terms apply to Level II, statistical learning but with more a complicated conceptual scaffolding.

  40. 40.

    Should a linear probability model be chosen for binomial regression, one could write G = f(X) + ε, which unlike logistic regression, can be estimated by least squares. However, it has several undesirable properties such as sometimes returning fitted values larger than 1.0 or smaller than 0.0 (Hastie et al. 2009: section 4.2).

  41. 41.

    The term “mean function” can be a little misleading when the response variable is G. It would be more correct to use “proportion function” or “probability function.” But “mean function” seems to be standard, and we will stick with it.

  42. 42.

    Clustering algorithms have been in use since the 1940s ((Cattell 1943)), long before there was the discipline of computer science was born. When rebranded as unsupervised learning, these procedures are just old wine in new bottles. There are other examples of conceptual imperialism, many presented as a form of supervised learning. A common instance is logistic regression, which dates back to at least the 1950s (Cox 1958). Other academic disciplines also engage in rebranding. The very popular difference-in-differences estimator claimed as their own by economists (Abadie 2005) was initially developed by educational statisticians a generation earlier (Linn and Slinde 1977), and the associated research design was formally proposed by Campbell and Stanley (1963).

  43. 43.

    Tuning is somewhat like setting the dials on a coffee grinder by trial and error to determine how fine the grind should be and how much ground coffee should be produced.

  44. 44.

    For linear models, several in-sample solutions have been proposed for data snooping (Berk et al. 2014a,b; Lockhart et al. 2014; Lei et al. 2018), but they are not fully satisfactory, especially for statistical learning.

  45. 45.

    The issues surrounding statistical inference are more complicated. The crux is that uncertainty in the training data and in the algorithm is ignored when performance is gauged solely with test data. This is addressed in the chapters ahead.

  46. 46.

    Loss functions are also called “objective functions” or “cost functions”.

  47. 47.

    In R, many estimation procedures have a prediction procedure that can easily be used with test data to arrive at test data fitted values.

  48. 48.

    As noted earlier, one likely would be better off using evaluation data to determine the order of the polynomial.

  49. 49.

    The transformation is linear because the \(\hat {y}_{i}\) are a linear combination of the y i. This does not mean that the relationships between X and y are necessarily linear.

  50. 50.

    This is a carry-over from conventional linear regression in which X is fixed. When X is random, Eq. (1.16) does not change. There are, nevertheless, important consequences for estimation that we have begun to address. One may think of the \(\hat {y}_{i}\) as estimates of population approximation, not of the true response surface.

  51. 51.

    Emphasis in the original.

  52. 52.

    The residual degrees of freedom can then be computed by subtraction (see also Green and Silverman 1994: Sect. 3.3.4).

  53. 53.

    The symbol I denotes an indicator function. The result is equal to 1 if the argument in brackets is true and equal to 0 if the argument in brackets is false. The 1s and 0s constitute an indicator variable. Sometimes indicator variables are called a dummy variables.

  54. 54.

    To properly employ a Level II framework, lots of hard thought would be necessary. For example, are the observations realized independently as the joint probability distribution approach requires? And if not, then what?.

References

  • Abadie, A. (2005). Semiparametric difference-in-differences estimators. Review of Economic Studies 72(1), 1–19.

    MathSciNet  MATH  Google Scholar 

  • Akaike, H. (1973). Information theory and an extension to the maximum likelihood principle. In B. N. Petrov & F. Casaki (Eds.), International symposium on information theory (pp. 267–281). Budapest: Akademia Kiado.

    Google Scholar 

  • Angrist, J. D., & Pischke, J. (2009). Mostly harmless econometrics. Princeton: Princeton University.

    MATH  Google Scholar 

  • Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge: Cambridge University.

    MATH  Google Scholar 

  • Berk, R. A. (2003). Regression analysis: A constructive critique. Newbury Park, CA.: SAGE.

    Google Scholar 

  • Berk, R. A. (2005). New claims about executions and general deterrence: Déjà Vu all over again? Journal of Empirical Legal Studies, 2(2), 303–330.

    Google Scholar 

  • Berk, R. A., & Freedman, D. A. (2003). Statistical assumptions as empirical commitments. In T. Blomberg & S. Cohen (Eds.), Law, punishment, and social control: Essays in honor of Sheldon Messinger. Part V (pp. 235–254). Aldine de Gruyter, November 1995, revised in second edition, 2003.

    Google Scholar 

  • Berk, R. A., Kriegler, B., & Ylvisaker, D. (2008). Counting the homeless in Los Angeles county. In D. Nolan & S. Speed (Eds.), Probability and statistics: Essays in honor of David A. Freedman. Monograph series for the institute of mathematical statistics.

    Google Scholar 

  • Berk, R. A., Brown, L., & Zhao, L. (2010). Statistical inference after model selection. Journal of Quantitative Criminology, 26, 217–236.

    Google Scholar 

  • Berk, R. A., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2014a). Valid post-selection inference. Annals of Statistics, 41(2), 802–837.

    MathSciNet  MATH  Google Scholar 

  • Berk, R. A., Brown, L., Buja, A., George, E., Pitkin, E., Zhang, K., et al. (2014b). Misspecified mean function regression: Making good use of regression models that are wrong. Sociological Methods and Research, 43, 422–451.

    MathSciNet  Google Scholar 

  • Berk, R. A., Buja, A., Brown, L., George, E., Kuchibhotla, A. K., Su, W., et al. (2019). Assumption lean regression. The American Statistician. Published online, April 12, 2019.

    Google Scholar 

  • Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.

    MATH  Google Scholar 

  • Bolen, C. (2019). Goldman banker snared by AI as U.S. Government Embraces New Tech. Bloomberg Government. Posted July 8, 2019.

    Google Scholar 

  • Bound, J., Jaeger, D. A., & Baker, R. M. (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association, 90(430), 443–450.

    Google Scholar 

  • Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799.

    MathSciNet  MATH  Google Scholar 

  • Breiman, L. (2001b). Statistical modeling: Two cultures (with discussion). Statistical Science, 16, 199–231.

    MathSciNet  MATH  Google Scholar 

  • Buja, A., & Stuetzle, W. (2006). Observations on bagging. Statistica Sinica, 16(2), 323–352.

    MathSciNet  MATH  Google Scholar 

  • Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., et al. (2019a). Models as approximations—Part I: A conspiracy of random regressors and model deviations against classical inference in regression. Statistical Science, 34(4), 523–544.

    MathSciNet  Google Scholar 

  • Buja, A., Berk, R., Brown, L., George, E., Kuchibhotla, A. K., & Zhao, L. (2019b). Models as approximations—Part II: A general theory of model-robust regression. Statistical Science, 34(4), 545–565.

    MathSciNet  Google Scholar 

  • Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Boston: Cengage Learning.

    Google Scholar 

  • Cattell, R. B. (1943). The description of personality: Basic traits resolved into clusters. Journal of Abnormal and Social Psychology, 38(4), 476–506.

    Google Scholar 

  • Christianini, N, & Shawe-Taylor, J. (2000). Support vector machines (Vol. 93(443), pp. 935–948). Cambridge, UK: Cambridge University.

    Google Scholar 

  • Cochran, W. G. (1977) Sampling techniques (3rd edn.). New York: Wiley.

    MATH  Google Scholar 

  • Cook, D. R., & Weisberg, S. (1999). Applied regression including computing and graphics. New York: Wiley.

    MATH  Google Scholar 

  • Cox, D. R. (1958). The regression analysis of binary sequences (with discussion). Journal of the Royal Statistical Society, Series B, 20(2), 215–242.

    MathSciNet  MATH  Google Scholar 

  • Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New York: Wiley.

    MATH  Google Scholar 

  • Davidson, A. C. (1997). Bootstrap methods and their application. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Edgington, E. S., & Ongehena, P. (2007). Randomization tests (4th edn.). New York: Chapman & Hall.

    Google Scholar 

  • Efron, B. (1986). How Biased is the apparent error rate of Prediction rule?. Journal of the American Statistical Association, 81(394), 461–470.

    MathSciNet  MATH  Google Scholar 

  • Efron, B., & Tibshirani, R. (1993). Introduction to the Bootstrap. New York: Chapman & Hall.

    MATH  Google Scholar 

  • Eicker, F. (1963). Asymptotic normality and consistency of the least squares estimators for families of linear regressions. Annals of Mathematical Statistics, 34, 447–456.

    MathSciNet  MATH  Google Scholar 

  • Eicker, F. (1967). Limit theorems for regressions with unequal and dependent errors. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 59–82).

    Google Scholar 

  • Faraway, J. J. (2014). Does data splitting improve prediction? Statistics and Computing, 26(1–2), 49–60.

    MathSciNet  MATH  Google Scholar 

  • Freedman, D. A. (1981). Bootstrapping regression models. Annals of Statistics, 9(6), 1218–1228.

    MathSciNet  MATH  Google Scholar 

  • Freedman, D. A. (1987). As others see us: A case study in path analysis (with discussion). Journal of Educational Statistics, 12, 101–223.

    Google Scholar 

  • Freedman, D. A. (2004). Graphical models for causation and the identification problem. Evaluation Review, 28, 267–293.

    Google Scholar 

  • Freedman, D. A. (2009a). Statistical models Cambridge, UK: Cambridge University.

    Google Scholar 

  • Freedman, D. A. (2009b). Diagnostics cannot have much power against general alternatives. International Journal of Forecasting, 25, 833–839.

    Google Scholar 

  • Freedman, D. A. (2012). On the so-called ‘Huber sandwich estimator’ and ‘Robust standard errors.’ The American Statistician, 60(4), 299–302.

    MathSciNet  Google Scholar 

  • Geisser, S. (1993). Predictive inference: An introduction. New York: Chapman & Hall.

    MATH  Google Scholar 

  • Green, P. J., & Silverman, B. W. (1994). Nonparametric regression and generalized linear models. New York: Chapman & Hall.

    MATH  Google Scholar 

  • Hall, P. (1997) The Bootstrap and Edgeworth expansion. New York: Springer.

    MATH  Google Scholar 

  • Hand, D., Manilla, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press.

    Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd edn.). New York: Springer.

    MATH  Google Scholar 

  • Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Symposium on Mathematical Statistics and Probability (Vol I, pp. 221–233).

    Google Scholar 

  • Janson, L., Fithian, W., & Hastie, T. (2015). Effective degrees of freedom: A flawed metaphor. Biometrika, 102(2), 479–485.

    MathSciNet  MATH  Google Scholar 

  • Jöeskog, K. G. (1979). Advances in factor analysis and structural equation models. Cambridge: Abt Books Press.

    Google Scholar 

  • Kaufman, S., & Rosset, S. (2014). When does more regularization imply fewer degrees of freedom? Sufficient conditions and counter examples from the Lasso and Ridge regression. Biometrica, 101(4), 771–784.

    MATH  Google Scholar 

  • Leamer, E. E. (1978). Specification searches: AD HOC inference with non-experimental data. New York: Wiley.

    MATH  Google Scholar 

  • Leeb, H., & Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21–59.

    MathSciNet  MATH  Google Scholar 

  • Leeb, H., & Pötscher, B. M. (2006). Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5), 2554–2591.

    MathSciNet  MATH  Google Scholar 

  • Leeb, H., Pötscher, B. M. (2008). Model selection. In T. G. Anderson, R. A. Davis, J.-P. Kreib, & T. Mikosch (Eds.), The handbook of financial time series (pp. 785–821). New York, Springer.

    Google Scholar 

  • Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. j., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094–1111.

    MathSciNet  MATH  Google Scholar 

  • Linn, R. L., & Slinde, J. A. (1977). The determination of the significance of change between pre- and post-testing periods. Review of Educational Research, 47, 121–150.

    Google Scholar 

  • Lockhart, R., Taylor, J., Tibshirani, R. J., & Tibshirani, R. (2014). A significance test for the lasso (with discussion). Annals of Statistics, 42(2), 413–468.

    MathSciNet  MATH  Google Scholar 

  • Mallows, C. L. (1973). Some comments on CP. Technometrics, 15(4), 661–675.

    MATH  Google Scholar 

  • Marsland, S. (2014). Machine learning: An algorithmic perspective (2nd edn.). New York: Chapman & Hall

    Google Scholar 

  • McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd edn.). New York: Chapman and Hall.

    MATH  Google Scholar 

  • Meyer, D., Zeileis, A., & Hornik, K. (2007). The Strucplot framework: Visualizing multiway contingency tables with VCD. Journal of Statistical Software, 17(3), 1–48.

    Google Scholar 

  • Michelucci, P., & Dickinson, J. L. (2016). The power of crowds: Combining human and machines to help tackle increasingly hard problems. Science, 351(6268), 32–33.

    Google Scholar 

  • Murdock, D., Tsai, Y., & Adcock, J. (2008). P-values are random variables. The American Statistician, 62, 242–245.

    MathSciNet  Google Scholar 

  • Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, Mass: MIT Press.

    MATH  Google Scholar 

  • Nagin, D. S., & Pepper, J. V. (2012). Deterrence and the death penalty. Washington, D.C.: National Research Council.

    Google Scholar 

  • Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aas4716-1–aas4716-8.

    Google Scholar 

  • Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.

    MATH  Google Scholar 

  • Rice, J. A. (2007). Mathematical statistics and data analysis (3rd edn.). Belmont, CA: Duxbury Press.

    Google Scholar 

  • Rozeboom, W. W. (1960). The fallacy of null-hypothesis significance tests. Psychological Bulletin, 57(5), 416–428.

    Google Scholar 

  • Rubin, D. B. (1986). Which Ifs have causal answers. Journal of the American Statistical Association, 81, 961–962.

    Google Scholar 

  • Rubin, D. B. (2008). For objective causal inference, design trumps analysis. Annals of Applied Statistics, 2(3), 808–840.

    MathSciNet  MATH  Google Scholar 

  • Rummel, R. J. (1988). Applied Factor Analysis. Northwestern University Press.

    MATH  Google Scholar 

  • Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric regression. Cambridge, UK: Cambridge University.

    MATH  Google Scholar 

  • Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

    MathSciNet  Google Scholar 

  • Stigler, S. M. (1981). Gauss and the invention of least squares. The Annals of Statistics, 9(3), 465–474.

    MathSciNet  MATH  Google Scholar 

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning (2nd edn.). A Bradford Book.

    Google Scholar 

  • Torgerson, W. (1958). Theory and methods of scaling. New York: Wiley.

    Google Scholar 

  • Weisberg, S. (2013). Applied linear regression (4th edn.). New York: Wiley.

    MATH  Google Scholar 

  • White, H. (1980a). Using least squares to approximate unknown regression functions. International Economic Review, 21(1), 149–170.

    MathSciNet  MATH  Google Scholar 

  • Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan and Kaufmann.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Berk, R.A. (2020). Statistical Learning as a Regression Problem. In: Statistical Learning from a Regression Perspective. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-40189-4_1

Download citation

Publish with us

Policies and ethics