Skip to main content

An Introduction to Statistical Learning from a Regression Perspective

  • Chapter
  • First Online:
Handbook of Quantitative Criminology

Abstract

Statistical learning is a loose collection of procedures in which key features of the final results are determined inductively. There are clear historical links to exploratory data analysis. There are also clear links to techniques in statistics such as principle components analysis, clustering, and smoothing, and to long-standing concerns in computer science, such as pattern recognition and edge detection. But statistical learning would not exist were it not for recent developments in raw computing power, computer algorithms, and theory from statistics, computer science, and applied mathematics. It can be very computer intensive. Extensive discussions of statistical learning can be found in Hastie et al. (2009) and Bishop (2006). Statistical learning is also sometimes called machine learning or reinforcement learning, especially when discussed in the context of computer science.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Statisticians commonly define regression analysis so that the aim is to understand “as far as possible with the available data how the conditional distribution of some response variable varies across subpopulations determined by the possible values of the predictor or predictors” (Cook and Weisberg 1999: 27). Interest centers on the distribution of the response variable conditioning on one or more predictors. Often the conditional mean is the key parameter.

  2. 2.

    Under these circumstances, statistical learning is in the tradition of principal components analysis and clustering.

  3. 3.

    This may seem a lot like standard numerical methods, such as when the Newton–Raphson algorithm is applied to logistic regression. We will see that it is not. For example, in logistic regression, the form of the relationships between the predictors and the response are determined before the algorithm is launched. In statistical learning, one important job of the algorithm is to determine these relationships.

  4. 4.

    To provide exact instructions would require formal mathematical notation or the actual computer code. That level of precision is probably not necessary for this chapter and can be found elsewhere (e.g., in the source code for the procedure randomForest in R). There are necessarily a few ambiguities, therefore, in the summary of this algorithm and the two to follow. Also, only the basics are discussed. There are extensions and variants to address special problems.

  5. 5.

    Shrinkage estimators have a long history starting with empirical Bayes methods and ridge regression. The basic idea is to force a collection estimated values (e.g., a set of conditional means) toward a common value. For regression applications, the estimated regression coefficients are “shrunk” toward zero. A small amount of bias is introduced into the estimates to gain a substantial reduction in the variance of the estimates. The result can be an overall reduction in mean squared error. Shrinkage can also be used for model selection when some regression coefficients are shrunk to zero and some are not. Shrinkage is discussed at some length in the book by Hastie et al. (2009: 61–69) and Berk (2008: 61–69, 167–174). Shrinkage is closely related to “regularization” methods.

  6. 6.

    Recall that when predictors in a regression analysis are correlated, the covariate adjustments (“partialling”) can cause predictors that are most strongly related to the response to dominate the fitting process. Predictors that are least strongly related to the response can play almost no role. This can be a particular problem for nonlinear response functions because the predictors that are more weakly related to the response overall may be critical for characterizing a small but very essential part of the nonlinear function. By sampling predictors, random forests allows the set of relevant predictors to vary across splits and across trees so that some of the time, weak predictors only have to “compete” with other weak predictors.

  7. 7.

    Overfitting occurs when a statistical procedure responds to idiosyncratic features of the data. As a result, the patterns found do not generalize well. With a new data set, the story changes. In regression, overfitting becomes more serious as the number of parameters being estimated increases for a fixed sample size. The most widely appreciated example is stepwise regression in which a new set of regression coefficients is estimated at each step. Overfitting can also be a problem in conventional linear regression as the number of regression coefficient estimates approaches the sample size. And, overfitting can also occur because of data snooping. The researcher, rather than some algorithm, is the guilty party. The best way to take overfitting into account is to do the analysis on training data and evaluate the results using test data. Ideally, the training data and test data are random samples from the same population. One might well imagine that overfitting would be a problem for random forests. Breiman proves that this is not so.

  8. 8.

    Table 34.1 raises several other important issues, but they are beyond the scope of this review (see Berk et al. 2009a).

  9. 9.

    Although one can construct partial dependence plots for categorical predictors, all one can see is a bar chart. For each category, the height of the bar is the average response value. The order of the bars along the horizontal axis and their distance from one another are necessarily arbitrary.

  10. 10.

    A coding of 1 and 0 can work too. But the 1 or−1 coding leads to the convenient result that the sign of the fitted value determines class membership.

  11. 11.

    Recall that the loss function of least squares regression, for example, is the sum of the squared residuals.

  12. 12.

    Stochastic gradient boosting samples the training data in the same spirit as random forests.

  13. 13.

    In quantile regression, the fitted values are conditional quantiles, not the conditional mean. For example, the 50th quantile is the conditional median (Keonker 2005). Then, if, say, the 75th quantile is used, overestimates are three times more important (75/25) than underestimates.

References

  • Berk RA (2003) Regression analysis: a constructive critique. Sage Publications, Newbury Park, CA

    Google Scholar 

  • Berk RA (2008) Statistical learning from a regression perspective. Springer, New York

    Google Scholar 

  • Berk RA, Sherman L, Barnes G, Kurtz E, Lindsay A (2009a) Forecasting murder within a population of probationers and parolees: a high stakes application of statistical forecasting. J R Stat Soc Ser A 172(part 1):191–211

    Google Scholar 

  • Berk RA, Brown L, Zhao L (2009b) Statistical inference after model selection. Working Paper, Department of Statistics, University of Pennsylvania

    Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning. Springer, New York

    Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn J 26:123–140

    Google Scholar 

  • Breiman L (2001a) Random forests. Mach Learn 45:5–32

    Article  Google Scholar 

  • Breiman L (2001b) Statistical modeling: two cultures (with discussion). Stat Sci 16:199–231

    Article  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth Press, Monterey, CA

    Google Scholar 

  • Buja A, Stuetzle W, Shen Y (2005) Loss functions for binary class probability estimation and classification: structure and applications. Unpublished Manuscript, Department of Statistics, The Wharton School, University of Pennsylvania

    Google Scholar 

  • Bühlmann P, Yu B (2006) Sparse boosting. J Mach Learn Res 7:1001–1024

    Google Scholar 

  • Cook DR, Weisberg S (1999) Applied regression including computing and graphics. Wiley, New York

    Book  Google Scholar 

  • Efron B, Tibshirani R (1993) Introduction to the bootstrap. Chapman & Hall, New York

    Google Scholar 

  • Freedman DA (2004) Graphical models for causation and the identification problem. Eval Rev 28:267–293

    Article  Google Scholar 

  • Freedman DA (2005) Statistical models: theory and practice. Cambridge University Press, Cambridge

    Google Scholar 

  • Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Machine learning: proceedings for the 13th international conference. Morgan Kaufmann, San Francisco, pp 148–156

    Google Scholar 

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:189–1232

    Article  Google Scholar 

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378

    Article  Google Scholar 

  • Friedman JH, Hastie T, Tibsharini R (2000) Additive logistic regression: a statistical view of boosting (with discussion). Ann Stat 28:337–407

    Article  Google Scholar 

  • Friedman JH, Hastie T, Rosset S, Tibsharini R, Zhu J (2004) Discussion of boosting papers. Ann Stat 32:102–107

    Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Holland P (1986) Statistics and causal inference. J Am Stat Assoc 8:945–960

    Article  Google Scholar 

  • Keonker R (2005) Quantile regression. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Kriegler B, Berk RA (2009) Estimating the homeless population in Los Angeles: an application of cost-sensitive stochastic gradient boosting. Working paper, Department of Statistics, UCLA

    Google Scholar 

  • Leeb H, Pötscher BM (2006) Can one estimate the conditional distribution of post-model-selection estimators? Ann Stat 34(5):2554–2591

    Article  Google Scholar 

  • Lin Y, Jeon Y (2006) Random forests and adaptive nearest neighbors. J Am Stat Assoc 101:578–590

    Article  Google Scholar 

  • Mannor S, Meir R, Zhang T (2002) The consistency of greedy algorithms for classification. In: Kivensen J, Sloan RH (eds) COLT 2002. LNAI, vol 2375. pp 319–333

    Google Scholar 

  • McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York

    Google Scholar 

  • Mease D, Wyner AJ (2008) Evidence contrary to the statistical view of boosting. J Mach Learn 9:1–26

    Google Scholar 

  • Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn 8:409–439

    Google Scholar 

  • Morgan SL, Winship C (2007) Counterfactuals and causal inference: methods and principle for social research. Cambridge University Press, Cambridge

    Google Scholar 

  • Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence

    Google Scholar 

  • Schapire RE (2002) The boosting approach to machine learning: an overview. In: MSRI workshop on non-linear estimation and classification

    Google Scholar 

  • Tibshirani RJ (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 25:267–288

    Google Scholar 

  • Traskin M (2008) The role of bootstrap sample size in the consistency of the random forest algorithm. Technical Report, Department of Statistics, University of Pennsylvania

    Google Scholar 

  • Vapnick V (1996) The nature of statistical learning theory. Springer, New York

    Google Scholar 

  • Wyner AJ (2003) Boosting and exponential loss. In: Bishop CM, Frey BJ (eds) Proceedings of the 9th annual conference on AI and statistics, Jan 3–6, Key West, FL

    Google Scholar 

  • Zhang T, Yu B (2005) Boosting with early stopping: convergence and consistency. Ann Stat 33(4):1538–1579

    Article  Google Scholar 

Download references

Acknowledgments

Work on this paper was supported in part by a grant from the National Science Foundation: SES-0437169, “Ensemble Methods for Data Analysis in the Behavioral, Social and Economic Sciences.” That support is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Berk, R. (2010). An Introduction to Statistical Learning from a Regression Perspective. In: Piquero, A., Weisburd, D. (eds) Handbook of Quantitative Criminology. Springer, New York, NY. https://doi.org/10.1007/978-0-387-77650-7_34

Download citation

Publish with us

Policies and ethics