Abstract
Conventional statistical inference requires that a model of how the data were generated be known before the data are analyzed. Yet in criminology, and in the social sciences more broadly, a variety of model selection procedures are routinely undertaken followed by statistical tests and confidence intervals computed for a “final” model. In this paper, we examine such practices and show how they are typically misguided. The parameters being estimated are no longer well defined, and post-model-selection sampling distributions are mixtures with properties that are very different from what is conventionally assumed. Confidence intervals and statistical tests do not perform as they should. We examine in some detail the specific mechanisms responsible. We also offer some suggestions for better practice and show though a criminal justice example using real data how proper statistical inference in principle may be obtained.
Similar content being viewed by others
Notes
“Thus sample data, x, are assumed to arise from observing a random variable X defined on a sample space, \(\Re\) . The random variable X has a probability distribution p θ(x) which is assumed known except for the value of the parameter Θ. The parameter Θ is some member of a specified parameter space Ω; x (and the random variable X) and Θ may have one or many components” (Barnett 1983: 121). Emphasis in the original. Some minor changes in notation have been made.
There are also the well-known difficulties that can follow from undertaking a large number of statistical tests. With every 20 statistical tests a researcher undertakes, for example, one null hypothesis will on the average be rejected at the .05 level even if all the null hypotheses are true. Remedies for the multiplicity problem are currently a lively research area in statistics (e.g., Benjamini and Yekutieli 2001; Efron 2007).
The context in which Brown is working is rather different from the present context. But the same basic principles apply.
There is an alternative formulation that mathematically amounts to the same thing. In the real world, nature generates the data through a stochastic process characterized by a regression model. Inferences are made to the parameters of this model, not to the parameters of a regression in a well-defined population. Data on hand are not a random sample from that population, but are a random realization of a stochastic process, and there can be a limitless number of independent realizations. These two formulations and others are discussed in far more detail elsewhere. The material to follow is effectively the same under either account, but the random sampling approach is less abstract and easier build upon.
Setting some regression coefficients to 0 is perhaps the most common kind of restriction imposed on regression coefficients, but others can also lead to interesting models (e.g., β1 = β2, which would mean that y i = β0 + β1[x i + z i ] + ɛ i ). Also, in the interest of simplicity, we are being a little sloppy with notation. We should be using different symbols for the regression coefficients in the two equations because they are in different models and are, therefore, defined differently. But the added notional complexity is probably not worth it.
The problems that follow can materialize regardless of the estimation procedure applied: least squares, maximum likelihood, generalized method of moments, and so on. Likewise, the problems can result from any of the usual model selection procedures.
There are two constants because the constants are model dependent. For example, they can depend on the number of regressors in the model and which ones they are. There are two values for the mean squared error, because the fit of the two models will likely differ. To make this more concrete, \(\hat{\beta}_{1} /\widehat{\hbox{SE}}_{1} \geq C_{1}\) may be nothing more than a conventional t-test. Written this way, the random variation is isolated on the left hand side and the fixed variation is isolated in the right hand side. It is how the random variables behave that is the focus of this paper.
Simulations much like those in section “Simulations of model-selection” can be used to produce virtually identical results starting with appropriate raw data and Eqs. 2 and 3.
The selection mechanism would then not have been ancillary.
The correlations in the numerator are replaced by partial correlations controlling for all other predictors, and the correlation in the denominator is replaced by the multiple correlation of the predictor in question with all other predictors.
Because in regression x and z are usually treated as fixed, s x and r 2 xz are not random variables and not treated as estimates.
When there are more than two regressors, the only change in Eq. 6 is that \(r_{{xz}}^{2}\) is replaced by the square of the multiple correlation coefficient between the given regressor and all other regressors.
If there were, there would be no need to do the research.
This would include all necessary interaction effects.
All of the usual caveats would still apply. For example, if the model specified does not properly represent how the data were generated, the regression estimates will be biased, and statistical tests will be not have their assumed properties.
The logarithm of zero is undefined. So, a value of .5 (i.e., about two weeks) was used instead. Other reasonable strategies led to results that for purposes of this paper were effectively the same. A suspended sentence of zero months can occur, for instance, if a sentencing judge gives sufficient credit for time served awaiting trial.
The model selection was done with the procedure regsubsets in R.
There is a scattering of a few other minor crimes are in the baseline.
References
Barnett V (1983) Comparative statistical inference, 2nd edn. Wiley, New York
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29(4):1165–188
Berk RA (2003) Regression analysis: a constructive critique. Sage Publications, Newbury Park
Blumstein A, Cohen J, Martin SE, Tonrey MH (eds) (1983) Research on sentencing: the search for reform, vols 1 and 2. National Academy Press, Washington, DC
Box GEP (1976) Science and statistics. J Am Stat Assoc 71:791–799
Breiman L (2001) Statistical modeling: two cultures (with discussion). Stat Sci 16:199–231
Brown LD (1967) The conditional level of student’s t test. Ann Math Stat 38(4):1068–1071
Brown LD (1990) An ancillarity paradox which appears in multiple linear regression. Ann Stat 18(2):471–493
Candes E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2331
Cook DR, Weisberg S (1999) Applied regression including computing and graphics. Wiley, New York
Davies G, Dedel K (2006) Violence screening in community corrections. Criminol Public Policy 5(4):743–770
Efron B, Hastie T, Tibshinani R (2007) Discussion: the Dantzig selector: statistical estimation with p much larger than n. Ann Stat 35(6):2358–2364
Efron B (2007) Correlation and large-scale simultaneous significance testing. J Am Stat Assoc 102(477):93–103
Freedman DA (1987) As others see us: a case study in path analysis (with discussion). J Educ Stat 12:101–223
Freedman DA (2004) Graphical models for causation and the identification problem. Eval Rev 28:267–293
Freedman DA (2005) Statistical models: theory and practice. Cambridge University Press, Cambridge
Freedman DA, Navidi W, Peters SC (1988) On the impact of variable selection in fitting regression equations. In: Dijkstra TK (eds) On model uncertainty and its statistical implications. Springer, Berlin, pp 1–16
Greene WH (2003) Econometric methods, 5th edn. Prentice Hall, New Yortk
Johnson BD (2006) The multilevel context of criminal sentencing: integrating judge- and county-level influences. Criminology 44(2):235–258
Lalonde RJ, Cho RM (2008) The impact of incarceration in state prison on the employment prospects of women. J Quant Criminol 24:243–265
Leeb H, Pötscher BM (2005) Model selection and inference: facts and fiction. Econ Theory 21:21–59
Leeb H, Pötscher BM (2006) Can one estimate the conditional distribution of post-model-selection estimators? Ann Stat 34(5):2554–2591
Leeb H, Pötscher BM (2008) Model selection. In: Anderson TG, Davis RA, Kreib J-P, Mikosch T (eds) The handbook of financial time series. Springer, New York, pp 785–821
Leamer EE (1978) Specification searches: ad hoc inference with non-experimental data. Wiley, New York
Manski CF (1990) Nonparametric bounds on treatment effects. Am Econ Rev Pap Proc 80:319–323
McCullagh P, Nelder JA (1989) Generalized linear models. 2nd edn. Chapman & Hall, New York
Morgan SL, Winship C (2007) Counterfactuals and causal inference: methods and principles for social research. Cambridge University Press, Cambridge
Morris N, Tonry M (1990) Prison and probation: intermediate punishment in a rational sentencing system. Oxford, University Press, New York
Olshen RA (1973) The conditional level of the F-test. J Am Stat Assoc 68(343):692–698
Ousey GC, Wilcox P, Brummel S (2008) Déjà vu all over again: investigating temporal continuity of adolescent victimization. J Quant Criminol 24:307–335
Petersilia J (1997) Probation in the United States. Crime Justice 22:149–200
Rubin DB (1986) Which ifs have causal answers. J Am Stat Assoc 81:961–962
Sampson RJ, Raudenbush SW (2004) Seeing disorder: neighborhood stigma and the social construction of broken windows. Soc Psychol Q 67(4):319–342
Schroeder RD, Giordano PC, Cernkovich SA (2007) Drug use and desistance processes. Criminology 45(1):191–222
Wooldredge J, Griffin T, Rauschenberg F (2005) (Un)anticipated effects of sentencing reform on disparate treatment of defendants. Law Soc Rev 39(4):835–874
Acknowledgments
Richard Berk’s work on this paper was funded by a grant from the National Science Foundation: SES-0437169, “Ensemble methods for Data Analysis in the Behavioral, Social and Economic Sciences.” The work by Lawrence Brown and Linda Zhao was supported in part by NSF grant DMS-07-07033. Thanks also go to Andreas Buja, Sam Preston, Jasjeet Sekhon, Herb Smith, Phillip Stark, and three reviewers for helpful suggestions about the material discussed in this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Berk, R., Brown, L. & Zhao, L. Statistical Inference After Model Selection. J Quant Criminol 26, 217–236 (2010). https://doi.org/10.1007/s10940-009-9077-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10940-009-9077-7