Skip to main content
Log in

Working with Misspecified Regression Models

  • Original Paper
  • Published:
Journal of Quantitative Criminology Aims and scope Submit manuscript

Abstract

Objectives

Conventional statistical modeling in criminology assumes proper model specification. Very strong and unrebutted criticisms have existed for decades. Some respond that although the criticisms are correct, there is for observational data no alternative. In this paper, we provide an alternative.

Methods

We draw on work in econometrics and statistics from several decades ago, updated with the most recent thinking to provide a way to properly work with misspecified models.

Results

We show how asymptotically, unbiased regression estimates can be obtained along with valid standard errors. Conventional statistical inference can follow.

Conclusions

If one is prepared to work with explicit approximations of a “true” model, defensible analyses can be obtained. The alternative is working with models about which all of the usual criticisms hold.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. This section draws heavily on Berk’s textbook on statistical learning (2016: Section 1.3). There is not a lot of flexibility in how one characterizes linear regression.

  2. In this context, the predictors are treated as fixed variables if in new realizations of the data, their values do not change. This is the approach in conventional regression. It simplifies the mathematics, but at a substantial interpretative price; the regression results can only be generalized to new observations produced by nature in the same fashion with exactly the same x-values. In contrast, predictors are treated as random variables if in new realizations of the data, their values changes in an unsystematic manner (e.g., through the equivalent of random sampling). This complicates the mathematics, but one gains the ability to generalize the regression results to new observations produced by nature in the same fashion but with different x-values. To take a cartoon illustration, if a predictor is age, and the values in the dataset are ages 24, 25, 30, 31,32 and 35, these are the only ages to which generalizations are permitted even if the true relationship is really linear. Should one want to apply results to, say, a 26 year old, one has to alter the mathematics to allow for realizations of ages that were not in the data. In other words, one has to allow for the x-values to have been different. This introduces a new source of uncertainty not addressed in the usual, fixed-x regression formulation. If one’s regression model is correctly specified, the impact of the additional uncertainty can be in practice small. But as we shall see, it matters a great deal if one wants to allow properly for model misspecification (Freedman 1981).

  3. Some readers may be incredulous because they know that these requirements will almost certainly not be met in practice. They already understand the need to work with misspecified models.

  4. If on substantive grounds one allows for nature to set more than one value for any given predictor and defendant, a temporal process may be implied. Then, there is systematic temporal variation to build into the regression equation. This can be done, but the formulation is more complicated, requires that nature be still more cooperative, and for the points to be made here, adds unnecessary complexity.

  5. The expectation is essentially the mean Eq. 1 over a limitless number of independent realizations of the data conditional on the x-values in the dataset. In the expectation, the values of regression coefficients are their means, and the value of the disturbance term is 0.0. The left hand side is then the means of Y for different values of predictors in the original dataset.

  6. The term “asymptotics” in this context refers to the performance of regression estimates (e.g., the regression coefficients) when the number of observations increases without limit. Often this mathematical exercise shows that estimation biases decrease with larger sample sizes, and disappear with a limitless number of observations. Good asymptotic performance can be a fallback position for statistical procedures whose estimates are otherwise biased. Then, if the number of observations is far larger than the number of predictors, estimation biases are likely to be small.

  7. Other work by White (1980a) and others, to be addressed shortly, allows for asymptotically valid tests when the mean function is misspecified. But that work does not apply to the conventional linear regression model. By “valid” one means that the probabilities computed for statistical tests and confidence intervals, have the properties they are supposed to have. For example, the 95% confidence interval really does cover the value of the population parameter in 95% of possible realized datasets.

  8. For example, if the null hypothesis for a given regression coefficient is 0.0, there will almost always be many reasonable null values close to 0.0 that would also not be rejected. And even a coefficient value close to 0.0 can meaningfully change the model specification and the estimated values of the other regression coefficients. A predictor with a small regression coefficient may be strongly correlated with other predictors so that their estimated regression coefficients will vary substantially depending on whether that variable is included in the regression.

  9. The regression subset selection problem in some disciplines is called model selection, variable selection, feature selection, or dimension reduction.

  10. These problems and more carry over to formal meta-analyses (Berk 2007). For example, the set of studies being summarize is not a probability sample of anything and the studies are not realized in an independent fashion. Indeed, one of the key features of the scientific enterprise is that later studies build on early studies. As a result, all statistical tests and confidence intervals are likely to be bogus. The one exception is when all of the studies are randomized experiments, but then the inferential formulation is somewhat different. Within that framework, one can have valid statistical inference.

  11. There is nothing special about the cubic function except its relative simplicity. We could have used here virtually any nonlinear function, but the price would have been a more difficult exposition.

  12. This is way to think about the underlying mathematical formulation. One imagines that with access to all of the data in the finite population, a least squares linear regression is applied.

  13. The framework to follow applies to any parametric approximation of the true response surface, not just a linear approximation. But working with a linear function makes the exposition much easier.

  14. The intercept is for the approximation the estimated mean of Y when the value of X is 0.0. The slope is for the approximation the change in the mean of Y for a one unit change in X. When there is more than one predictor, the usual descriptive interpretations carry over as well.

  15. There is a technical subtlety here. Each vector observation corresponds to a different individual, within a population of individuals having a single, unknown probability distribution. We assume that the observed individual vectors are independent of each other. This is needed for validity of our approach and is the only stringent assumption behind our assumption lean formulation. (The coordinate values within a given individual vector are in general correlated with each other.) This independence assumption can be replaced by exchangeability as would occur in sampling without replacement from a finite population. Then, a standard, finite-sample adjustment to the sandwich formula is needed. An early version of this type of assumption-lean reasoning in the finite sampling context can be found in Cochrane (1977, Chapter 7). Further generalization of this assumption seems possible, but is beyond the scope of the treatment in the present manuscript.

  16. Skewness is not essential. All one requires is that potential distributions of X have different expected values.

  17. The reliance on asymptotics is widespread in statistical and econometric applications. For example, even if the mean function for a logistic regression is correct, estimates of the regression coefficients are only unbiased asymptotically.

  18. For example, the sandwich standard errors can be computed with a procedure in the car library. Then t values follow as usual, and the associated probabilities can be computed with a procedure for the t-distribution.

  19. The same difficulties arise if regression is replaced by matching.

  20. For groups, an ATE formally is the difference between their two response variable means. Whether that difference can be interpreted as a causal effect depends on the the research design and in particular whether there is an intervention subject to manipulation. This requirement is met in randomized experiments and strong quasi-experiments. It can be very problematic in observational studies.

  21. A thorough discussion the weights are can found in Buja and his colleagues (2016: Section 10). Perhaps the most important conclusion is that although the weights are formally required, they further complicate how an average causal effect is interpreted.

  22. Time actually served can differ, sometimes dramatically.

  23. As described earlier, if as in survey sampling one has a finite population from which the data are a probability sample, statistical inference can follow automatically with no need for the more abstract formulation.

  24. Data examples from other contexts can show much more noticeable differences between the standard t values and the proper t values derived via the sandwich. See the examples in Buja et al. (2016) and McCarthy et al. (2017). For strongly nonlinear or very heteroscedastic situations, the difference in values can be quite striking and can lead to very different conclusions from hypothesis tests and confidence intervals. See especially McCarthy et al. (2017).

  25. It is all confounders because proper model specification requires that all possible confounders are included in the model.

  26. There are well-known interpretative problems treating gender as a cause because it is not manipulable, but that is often overlooked when causal interpretations are provided for regression results. Causal interpretations for race have the same problem (Berk 2003: Chapter 5). Also, one should not confuse the identification with interpretation. Identification refers to the ability to properly estimate a parameter of interest with the data on hand. To take a very simple example, an estimate of a crime rate depends on a numerator of the number of crimes and a denominator of the size of the population at risk. If there is no information on the size of the population at risk, the estimate of the crime rate is not identified. Whether an identified parameter is given a causal interpretation depends on subject matter knowledge and/or on how the data were generated (e.g., with a randomized experiment).

  27. In conventional linear regression, “holding constant” has a very specific formal meaning. Any linear dependence between a given predictor and all other predictors is removed.

  28. The formal meaning of “holding constant” is the same as for conventional linear regression. Adjustments are made for linear dependence between predictors included in the model.

  29. Language matters too. One must be careful about using verbs like “affect,” “impact,” or “influence,” which can be read as implying causality.

  30. A discussion of forecasting is beyond the scope of this paper, but what matters for forecasts is their accuracy, which can be estimated with test data—new random realizations from the same joint probably distribution. There is no need to have a model with subject-matter interpretations. A popular example is time series analysis (Box and Jenkins 1970).

  31. One might wonder why the generalized additive model was not used instead of linear regression. The generalized additive model is an inductive procedure that adapts empirically to the data through a tuning parameter. This constitutes a form of model selection that introduces significant complications for all statistical inference (Berk 2016, Chapter 2). A discussion of these issues is well beyond the scope of this paper.

  32. For example, under the generalized linear model “best” relies on minimizing the deviance.

  33. The partial correlation is not used much any more despite having an impressive pedigree (Fisher 1924). It is just the usual Pearson correlation, but between two variables from which any linear dependence with other specified variables has been removed, much as in multiple regression.

  34. In a natural experiment, nature provides a good approximation of a randomized experiment or quasi-experiment.

  35. There are a number of subtle issues when using of regression diagnostics with explicitly misspecified models that are beyond the scope of this paper. But generally, visual and graphical tools can be properly employed (e.g., a plot of Cook’s distance against each observation number). Formal tests are likely to be problematic.

References

  • Angrist J, Pischke S (2008) Mostly harmless econometrics: an empiricist’s companion. Princeton University Press, Princeton

    Google Scholar 

  • Berk RA (2003) Regression analysis: a constructive critique. Sage, Newbury Park

    Google Scholar 

  • Berk RA (2007) Meta-analysis and statistical inference (with commentary). J Exp Criminol 3(3):247–297

    Article  Google Scholar 

  • Berk RA (2016) Statistical learning from a regression perspective, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Berk RA, Brown L, Zhao L (2010) Statistical inference after model selection. J Quant Criminol 26:217–236

    Article  Google Scholar 

  • Berk RA, Brown L, Buja A, Zhang K, Zhao L (2014a) Valid post-selection inference. Ann Stat 41(2):802–837

    Article  Google Scholar 

  • Berk RA, Brown L, Buja A, George E, Pitkin E, Zhang K, Zhao L (2014b) Misspecified mean function regression: making good use of regression models that are wrong. Sociol Methods Res 43:422–451

    Article  Google Scholar 

  • Box GEP (1976) Science and statistics. J Am Stat Assoc 71(356):791–799

    Article  Google Scholar 

  • Box GEP, Jenkins G (1970) Time series analysis: forecasting and control. Holden-Day, San Francisco

    Google Scholar 

  • Buja A, Berk RA, Brown L, George E, Pitkin E, Traskin M, Zhao L, Zhang K (2016) Models as approximations: a conspiracy of random regressors and model violations against classical inference in regression. \(imsart-sts \,ver. 2015/07/30: Buja_et_al_Conspiracy-v2.tex date: July \, 23, 2015\)

  • Bushway S, Morrison Piehl A (2001) Judging judicial discretion: legal factors and racial discrimination in sentencing. Law Soc Rev 35(4):733–764

    Article  Google Scholar 

  • Cochrane WG (1977) Sampling techniques, 3rd edn. Wiley, New York

    Google Scholar 

  • Cook DR, Weisberg S (1999) Applied regression including computing and graphics. Wiley, New York

    Book  Google Scholar 

  • Fisher RA (1924) The distribution of the partial correlation coefficient. Metron 3:329–332

    Google Scholar 

  • Freedman DA (1981) Bootstrapping regression models. Ann Stat 9(6):1218–1228

    Article  Google Scholar 

  • Freedman DA (1987) As others see us: a case study in path analysis (with discussion). J Educ Stat 12:101–223

    Article  Google Scholar 

  • Freedman DA (2004) Graphical models for causation and the identification problem. Eval Rev 28:267–293

    Article  Google Scholar 

  • Freedman DA (2009) Statistical models cambridge. Cambridge University Press, Cambridge

    Google Scholar 

  • Goodman SN (2016) Aligning statistical and scientific reasoning. Science 352(6290):1180–1181

    Article  Google Scholar 

  • Harris CR (2012) Is the replicability crisis overblown? Three arguments examined. Perspect Psychol Sci 7(6):531–536

    Article  Google Scholar 

  • Imbens G, Lemieux T (2008) Regression discontinuity designs: a guide to practice. J Econom 142:611–614

    Article  Google Scholar 

  • Ioannidis JPA (2012) Why science is not necessarily self-correcting. Perspect Psychol Sci 7(6):645–654

    Article  Google Scholar 

  • Leamer EE (1978) Specification searches: ad hoc inference with non-experimental data. Wiley, New York

    Google Scholar 

  • Leeb H, Pötscher BM (2005) Model selection and inference: facts and fiction. Econom Theory 21:21–59

    Article  Google Scholar 

  • Leeb H, Pötscher BM (2006) Can one estimate the conditional distribution of post-model-selection estimators? Ann Stat 34(5):2554–2591

    Article  Google Scholar 

  • Leeb H, Pötscher BM (2008) Model selection. In: Anderson TG, Davis RA, Kreib J-P, Mikosch T (eds) The handbook of financial time series. Springer, New York, pp 785–821

    Google Scholar 

  • McCarthy D, Zhang K, Berk RA, Brown L, Buja A, George E, Zhao L (2017) Calibrated percentile double bootstrap for robust linear regression inference. Stat Sin (forthcoming)

  • Neyman J (1923) On the application of probability theory to agricultural experiments: essays on principles. Section 9. Roczniki Nauk Rolniczch Tom X [in Polish]; translated in Stat Sci 5:588–606, 1990

  • Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 346(6251):943

  • Rosenbaum PR (2002) Observational studies. Springer, New York

    Book  Google Scholar 

  • Rozeboom WW (1960) The fallacy of the null hypothesis significance test. Psychol Bull 57:416–428

    Article  Google Scholar 

  • Rubin DB (1986) Which ifs have causal answers. J Am Stat Assoc 81:961–962

    Google Scholar 

  • Rubin DB (2008) For objective causal inference, design trumps analysis. Ann Appl Stat 2(3):808–840

    Article  Google Scholar 

  • Rubin DB, Imbens GW (2015) Causal inference for statistics, social, and biomedical sciences. Cambridge University Press, Cambridge

    Google Scholar 

  • Starr SB (2015) Estimating gender disparities in federal criminal cases. Am Law Econ Rev 17(1):127–159

    Article  Google Scholar 

  • Steffensmeier D, Kramer J, Streifel C (1993) Gender and imprisonment decisions. Criminology 31:411–446

    Article  Google Scholar 

  • Ulmer JT, Bradley MS (2006) Variation in trial penalties among serious violent offenders. Criminology 44:631–670

    Article  Google Scholar 

  • Weisberg S (2013) Applied Linear Regression, 4th edn. Wiley, New York

    Google Scholar 

  • White H (1980a) Using least squares to approximate unknown regression functions. Int Econ Rev 21(1):149–170

    Article  Google Scholar 

  • White H (1980b) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(4):817–838

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Berk.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berk, R., Brown, L., Buja, A. et al. Working with Misspecified Regression Models. J Quant Criminol 34, 633–655 (2018). https://doi.org/10.1007/s10940-017-9348-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10940-017-9348-7

Keywords

Navigation