An Introduction to Statistical Learning from a Regression Perspective

Berk, Richard

doi:10.1007/978-0-387-77650-7_34

Richard Berk³

112k Accesses
7 Citations

Abstract

Statistical learning is a loose collection of procedures in which key features of the final results are determined inductively. There are clear historical links to exploratory data analysis. There are also clear links to techniques in statistics such as principle components analysis, clustering, and smoothing, and to long-standing concerns in computer science, such as pattern recognition and edge detection. But statistical learning would not exist were it not for recent developments in raw computing power, computer algorithms, and theory from statistics, computer science, and applied mathematics. It can be very computer intensive. Extensive discussions of statistical learning can be found in Hastie et al. (2009) and Bishop (2006). Statistical learning is also sometimes called machine learning or reinforcement learning, especially when discussed in the context of computer science.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Statisticians commonly define regression analysis so that the aim is to understand “as far as possible with the available data how the conditional distribution of some response variable varies across subpopulations determined by the possible values of the predictor or predictors” (Cook and Weisberg 1999: 27). Interest centers on the distribution of the response variable conditioning on one or more predictors. Often the conditional mean is the key parameter.
2.
Under these circumstances, statistical learning is in the tradition of principal components analysis and clustering.
3.
This may seem a lot like standard numerical methods, such as when the Newton–Raphson algorithm is applied to logistic regression. We will see that it is not. For example, in logistic regression, the form of the relationships between the predictors and the response are determined before the algorithm is launched. In statistical learning, one important job of the algorithm is to determine these relationships.
4.
To provide exact instructions would require formal mathematical notation or the actual computer code. That level of precision is probably not necessary for this chapter and can be found elsewhere (e.g., in the source code for the procedure randomForest in R). There are necessarily a few ambiguities, therefore, in the summary of this algorithm and the two to follow. Also, only the basics are discussed. There are extensions and variants to address special problems.
5.
Shrinkage estimators have a long history starting with empirical Bayes methods and ridge regression. The basic idea is to force a collection estimated values (e.g., a set of conditional means) toward a common value. For regression applications, the estimated regression coefficients are “shrunk” toward zero. A small amount of bias is introduced into the estimates to gain a substantial reduction in the variance of the estimates. The result can be an overall reduction in mean squared error. Shrinkage can also be used for model selection when some regression coefficients are shrunk to zero and some are not. Shrinkage is discussed at some length in the book by Hastie et al. (2009: 61–69) and Berk (2008: 61–69, 167–174). Shrinkage is closely related to “regularization” methods.
6.
Recall that when predictors in a regression analysis are correlated, the covariate adjustments (“partialling”) can cause predictors that are most strongly related to the response to dominate the fitting process. Predictors that are least strongly related to the response can play almost no role. This can be a particular problem for nonlinear response functions because the predictors that are more weakly related to the response overall may be critical for characterizing a small but very essential part of the nonlinear function. By sampling predictors, random forests allows the set of relevant predictors to vary across splits and across trees so that some of the time, weak predictors only have to “compete” with other weak predictors.
7.
Overfitting occurs when a statistical procedure responds to idiosyncratic features of the data. As a result, the patterns found do not generalize well. With a new data set, the story changes. In regression, overfitting becomes more serious as the number of parameters being estimated increases for a fixed sample size. The most widely appreciated example is stepwise regression in which a new set of regression coefficients is estimated at each step. Overfitting can also be a problem in conventional linear regression as the number of regression coefficient estimates approaches the sample size. And, overfitting can also occur because of data snooping. The researcher, rather than some algorithm, is the guilty party. The best way to take overfitting into account is to do the analysis on training data and evaluate the results using test data. Ideally, the training data and test data are random samples from the same population. One might well imagine that overfitting would be a problem for random forests. Breiman proves that this is not so.
8.
Table 34.1 raises several other important issues, but they are beyond the scope of this review (see Berk et al. 2009a).
9.
Although one can construct partial dependence plots for categorical predictors, all one can see is a bar chart. For each category, the height of the bar is the average response value. The order of the bars along the horizontal axis and their distance from one another are necessarily arbitrary.
10.
A coding of 1 and 0 can work too. But the 1 or−1 coding leads to the convenient result that the sign of the fitted value determines class membership.
11.
Recall that the loss function of least squares regression, for example, is the sum of the squared residuals.
12.
Stochastic gradient boosting samples the training data in the same spirit as random forests.
13.
In quantile regression, the fitted values are conditional quantiles, not the conditional mean. For example, the 50th quantile is the conditional median (Keonker 2005). Then, if, say, the 75th quantile is used, overestimates are three times more important (75/25) than underestimates.

References

Berk RA (2003) Regression analysis: a constructive critique. Sage Publications, Newbury Park, CA
Google Scholar
Berk RA (2008) Statistical learning from a regression perspective. Springer, New York
Google Scholar
Berk RA, Sherman L, Barnes G, Kurtz E, Lindsay A (2009a) Forecasting murder within a population of probationers and parolees: a high stakes application of statistical forecasting. J R Stat Soc Ser A 172(part 1):191–211
Google Scholar
Berk RA, Brown L, Zhao L (2009b) Statistical inference after model selection. Working Paper, Department of Statistics, University of Pennsylvania
Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn J 26:123–140
Google Scholar
Breiman L (2001a) Random forests. Mach Learn 45:5–32
Article Google Scholar
Breiman L (2001b) Statistical modeling: two cultures (with discussion). Stat Sci 16:199–231
Article Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth Press, Monterey, CA
Google Scholar
Buja A, Stuetzle W, Shen Y (2005) Loss functions for binary class probability estimation and classification: structure and applications. Unpublished Manuscript, Department of Statistics, The Wharton School, University of Pennsylvania
Google Scholar
Bühlmann P, Yu B (2006) Sparse boosting. J Mach Learn Res 7:1001–1024
Google Scholar
Cook DR, Weisberg S (1999) Applied regression including computing and graphics. Wiley, New York
Book Google Scholar
Efron B, Tibshirani R (1993) Introduction to the bootstrap. Chapman & Hall, New York
Google Scholar
Freedman DA (2004) Graphical models for causation and the identification problem. Eval Rev 28:267–293
Article Google Scholar
Freedman DA (2005) Statistical models: theory and practice. Cambridge University Press, Cambridge
Google Scholar
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Machine learning: proceedings for the 13th international conference. Morgan Kaufmann, San Francisco, pp 148–156
Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:189–1232
Article Google Scholar
Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38:367–378
Article Google Scholar
Friedman JH, Hastie T, Tibsharini R (2000) Additive logistic regression: a statistical view of boosting (with discussion). Ann Stat 28:337–407
Article Google Scholar
Friedman JH, Hastie T, Rosset S, Tibsharini R, Zhu J (2004) Discussion of boosting papers. Ann Stat 32:102–107
Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York
Book Google Scholar
Holland P (1986) Statistics and causal inference. J Am Stat Assoc 8:945–960
Article Google Scholar
Keonker R (2005) Quantile regression. Cambridge University Press, Cambridge
Book Google Scholar
Kriegler B, Berk RA (2009) Estimating the homeless population in Los Angeles: an application of cost-sensitive stochastic gradient boosting. Working paper, Department of Statistics, UCLA
Google Scholar
Leeb H, Pötscher BM (2006) Can one estimate the conditional distribution of post-model-selection estimators? Ann Stat 34(5):2554–2591
Article Google Scholar
Lin Y, Jeon Y (2006) Random forests and adaptive nearest neighbors. J Am Stat Assoc 101:578–590
Article Google Scholar
Mannor S, Meir R, Zhang T (2002) The consistency of greedy algorithms for classification. In: Kivensen J, Sloan RH (eds) COLT 2002. LNAI, vol 2375. pp 319–333
Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York
Google Scholar
Mease D, Wyner AJ (2008) Evidence contrary to the statistical view of boosting. J Mach Learn 9:1–26
Google Scholar
Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn 8:409–439
Google Scholar
Morgan SL, Winship C (2007) Counterfactuals and causal inference: methods and principle for social research. Cambridge University Press, Cambridge
Google Scholar
Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence
Google Scholar
Schapire RE (2002) The boosting approach to machine learning: an overview. In: MSRI workshop on non-linear estimation and classification
Google Scholar
Tibshirani RJ (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 25:267–288
Google Scholar
Traskin M (2008) The role of bootstrap sample size in the consistency of the random forest algorithm. Technical Report, Department of Statistics, University of Pennsylvania
Google Scholar
Vapnick V (1996) The nature of statistical learning theory. Springer, New York
Google Scholar
Wyner AJ (2003) Boosting and exponential loss. In: Bishop CM, Frey BJ (eds) Proceedings of the 9th annual conference on AI and statistics, Jan 3–6, Key West, FL
Google Scholar
Zhang T, Yu B (2005) Boosting with early stopping: convergence and consistency. Ann Stat 33(4):1538–1579
Article Google Scholar

Download references

Acknowledgments

Work on this paper was supported in part by a grant from the National Science Foundation: SES-0437169, “Ensemble Methods for Data Analysis in the Behavioral, Social and Economic Sciences.” That support is gratefully acknowledged.

Author information

Authors and Affiliations

Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, USA
Richard Berk

Authors

Richard Berk
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Criminology, Florida State University, West Call Street 643, Tallahassee, 32306, U.S.A.
Alex R. Piquero
Inst. Criminology, Hebrew University of Jerusalem, Jerusalem, 91905, Israel
David Weisburd

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Berk, R. (2010). An Introduction to Statistical Learning from a Regression Perspective. In: Piquero, A., Weisburd, D. (eds) Handbook of Quantitative Criminology. Springer, New York, NY. https://doi.org/10.1007/978-0-387-77650-7_34

Download citation

DOI: https://doi.org/10.1007/978-0-387-77650-7_34
Published: 03 December 2009
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-77649-1
Online ISBN: 978-0-387-77650-7
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics