Skip to main content

What You Can Learn from Wrong Causal Models

  • Chapter
  • First Online:
Book cover Handbook of Causal Analysis for Social Research

Abstract

It is common for social science researchers to provide estimates of causal effects from regression models imposed on observational data. The many problems with such work are well documented and widely known. The usual response is to claim, with little real evidence, that the causal model is close enough to the “truth” that sufficiently accurate causal effects can be estimated. In this chapter, a more circumspect approach is taken. We assume that the causal model is a substantial distance from the truth and then consider what can be learned nevertheless. To that end, we distinguish between how nature generated the data, a “true” model representing how this was accomplished, and a working model that is imposed on the data. The working model will typically be “wrong.” Nevertheless, unbiased or asymptotically unbiased estimates from parametric, semiparametric, and nonparametric working models can often be obtained in concert with appropriate statistical tests and confidence intervals. However, the estimates are not of the regression parameters typically assumed. Estimates of causal effects are not provided. Correlation is not causation. Nor is partial correlation, even when dressed up as regression coefficients. However, we argue that insights about causal effects do not require estimates of causal effects. We also discuss what can be learned when our alternative approach is not persuasive.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This definition can apply to categorical response variables, manifest or latent response variables, and response variables whose conditional distributions are related to one another. So, for example, the generalized linear model is covered as well as multiple equation models.

  2. 2.

    The properties of ε i can be formulated other ways. For example, the causes of Y can be organized into two groups: regressors with large causal effects and regressors with small causal effects. In an early treatment that is representative, Hanushek and Jackson (1977: 82) distinguish similarly between “important” predictors and others. The variables with small causal effects are taken to be far more numerous than the variables with large causal effects and to be independent of one another. Nature sets the values of the many small causal variables too, but in the aggregate, the result is disturbances that are effectively independent of the causal variables with large effects. For formal results, this account is too imprecise. Rather, it is common to assume “sparsity.” Sparsity requires that some predictors have true regression coefficients exactly equal to zero (not just small) after conditioning on all other predictors in the model. Then as a theoretical matter, a common question is whether a given model selection procedure will correctly identify which predictors have such regression coefficients (e.g., Leeb and Pötscher 2008b). The “real-world” sources of the disturbances are not addressed.

  3. 3.

    In practice, this summary of how nature functions would need to be fleshed out with specific subject-matter knowledge. For example, why does nature work with a linear combination of predictors, and how exactly does it do that? Still, at least a bit of mathematical license (e.g., a limitless number of independent realizations of the data) will be required so that theorems of interest can be proved.

  4. 4.

    In mathematical statistics, data on hand that might be seen as a population are sometimes treated as a random realization from a population of all possible realizations of the data that nature could generate. Such populations are sometimes called “superpopulations.” Although this formulation allows certain mathematical operations to play through, the scientific payoff is obscure unless one has a credible theory for how the superpopulation is generated and why the data to be analyzed are a random realization from that superpopulation. But if one has such a theory, and if it is of the same form as Eq. (19.1), the approach is essentially the same as the one just described.

  5. 5.

    Consider a simple example. Suppose for a response variable Y there are in the causal model two predictors that enter additively: X and log(Z). Because this is the correct model, E i ) = 0. Therefore, the disturbances and the regressors are unrelated. Now suppose that the researcher does not know about Z and it is not included in the working model. If it is still true that E i ) = 0, the working model least-squares regression coefficient for X will be unbiased. Somewhat different reasoning applies if the researcher mistakenly employs, say, Z instead of log(Z). Even if E i )  =  0, the working model least-squares coefficient for X will be affected unless both Z and log(Z) are uncorrelated with X (i.e., mean independent). Still other reasoning applies if X is measured imperfectly. Even random measurement errors with a mean of zero imply that E i )≠0. For example, education may be measured in years of schooling. But years of schooling is but a proxy from what may really matters: increases in human capital. Biased estimates follow.

  6. 6.

    There are statistical procedures, such as instrumental variables, that under ideal conditions can overcome the confounding of predictors and disturbances. These ideal conditions are difficult to meet with observational data. In effect, an auxiliary model is required that has to be right. So, this escape clause too can be hard to exercise.

  7. 7.

    Implied is that if one denotes the disparities over realizations between any μ i and its y i by ε i , E i ) = 0.

  8. 8.

    If the predictors are treated as fixed, one cannot formally generalized the results to values of the predictors not found in the data. There is also a problem with forecasts because with fixed predictor values, there is no account for how the new predictor values were generated.

  9. 9.

    If the conditional means of the joint distribution really do have a linear relationship with the predictors in the working model, the linear approximation is no longer an approximation. There is, then, no bias in the least-squares estimates with respect to the joint probability distribution. This is an unrealistic scenario in practice because even if the linear approximation were actually correct, there would be no way to definitively know it. All one has is a realization from the joint probability distribution.

  10. 10.

    We change the notation for regression model to underscore that we are no longer trying to estimate the “true” conditional means or “true” regression coefficients. Our estimates are for the linear approximation.

  11. 11.

    All one requires is that [E(X T X)] − 1 and E(X T Y) exit. The asymptotics assume that the number of predictors is fixed as the number of observations increases without limit.

  12. 12.

    The subscripts i and k differ because the denominator is calculated first as a normalizing constant. Gelman and Park (2008: 3) have an expression that is similar to Eq. (19.6).

  13. 13.

    The papers by Freedman and Mammen were in a general way anticipated by Fisher in 1924.

  14. 14.

    This actually is a little tricky. If there is one observation for each x-value and if there is an intercept in the model, one of the indicator variables must be deleted. Otherwise, the predictor cross-product matrix cannot be inverted in the usual manner. The problem disappears if there is no intercept but then the regression coefficients do not have their usual meaning.

  15. 15.

    Categorical predictors would already be included as one or more indicator variables.

  16. 16.

    In f′(t), the t is just a placeholder because when there is more than one predictor, there can be several sensible ways to represent the fitted values (Hastie et al. 2009: 165–167).

  17. 17.

    The trace of the smoother matrix is the “effective degrees of freedom” used by the smoothing procedure, which plays the same role as the model degrees of freedom in conventional regression.

  18. 18.

    There are many kinds of linear smoothers including local means, local linear fits, and local polynomials that can be employed within kernel functions. The LOWESS procedure (Cleveland 1979) is one popular example. We focus on smoothing splines here because it is a natural extension of least-squares regression, commonly available, and effective in practice. Readers seeking a more extensive treatment of smoothing should consult Hastie and colleagues (2009: Chaps. 3, 5, and 6).

  19. 19.

    The estimation target is the nonlinear approximation within nature’s joint probability distribution.

  20. 20.
    1. 1.

      Initialize: α = ave(y i ), and \({f}_{j} = {f}^{0},j = 1,\ldots ,p\) with linear functions.

    2. 2.

      Cycle: j = 1, , p, 1, , p, 

      $${f}_{j} ={ \mathbf{S}}_{j}\left (y = \alpha -\sum\limits_{k\neq j}{f}_{k}\vert {x}_{j}\right )\!,$$

      where S j is a smoother matrix.

    3. 3.

      Continue #2 until the individual functions don’t change.

  21. 21.

    For the backfitting binomial and Poisson variants, penalized maximum likelihood estimation is applied to each nonparametric regression term. In practice, this leads to the usual iteratively reweighted least-squares algorithm but with the penalty term included.

  22. 22.

    There are two versions of GAM in R, one contained with the library gam and one contained within the library mgcv.

  23. 23.

    Binary response variables are not a problem because the associated probabilities can be transformed into logits, which are quantitative.

  24. 24.

    Thin plate splines fit a two-dimensional surface to the data (Hastie et al. 2009: Sect. 5.7).

  25. 25.

    The software provided a joint test for the null hypothesis that none of the predictors was related to the log of the number of homeless. The null hypothesis was rejected at well below conventional p-values. As already discussed, however, the meaning of such tests is obscure in this context.

  26. 26.

    When the relationships with a response are linear in both dimensions, and when there are no interactions, the fitted values form a plane. Along either dimension, the slope does not change with the values of the other dimension. Interactions cause the plane to be torqued. The same reasoning applies when either or both of the relationships with the response are nonlinear (here, especially for median income). When the surface is torqued, the function for one dimension changes with values along the other dimension.

  27. 27.

    It is not clear how to show uncertainty bands in three dimensions without making a plot unreadable.

  28. 28.

    The adjustments for related predictors are approximations too. There is no direct correspondence to post-stratification as there is in conventional linear regression.

References

  • Angrist, J. D., & Pischke, J. (2009). Most harmless econometrics. Princeton: Princeton University Press.

    Google Scholar 

  • Berk, R. A. (2003). Regression analysis: A constructive critique. Newberry Park: Sage Publications.

    Google Scholar 

  • Berk, R. A., Kriegler, B., & Ylvisaker, D. (2008). Counting the homeless in Los Angeles county. In D. Nolan & S. Speed (Eds.), Probability and statistics: Essays in honor of David A. Freedman (Monograph series). Beachwood: Institute of Mathematical Statistics.

    Google Scholar 

  • Berk, R. A., Brown, L., & Zhao, L. (2010). Statistical inference after model selection. Journal of Quantitative Criminology, 26, 217–236.

    Article  Google Scholar 

  • Berk, R. A., Brown, L., Buja, A., George, E., Pitkin, E., Traskin, M., Zhang, K., & Zhao, L. (2011). Regression with a random design matrix (Working paper). Pennsylvania: Department of Statistics, University of Pennsylvania.

    Google Scholar 

  • Box, G. E. P. (1979). Robustness in the strategy of scientific model building. In R. L. Launer & G. N. Wilkinson (Eds.), Robustness in statistics. New York: Academic.

    Google Scholar 

  • Cleveland, W. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 78, 829–836.

    Article  Google Scholar 

  • Cook, D. R., & Weisberg, S. (1999). Applied regression including computing and graphics. NewYork: Wiley.

    Book  Google Scholar 

  • Duncan, O. D. (1975). Introduction to structural equation models. New York: Academic.

    Google Scholar 

  • Eicker, F. (1963). Asymptotic normality and consistency of the least squares estimators for families of linear regressions. Annals of Mathematical Statistics, 34, 447–456.

    Article  Google Scholar 

  • Eicker, F. (1967). Limit theorems for regressions with unequal and dependent errors. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 59–82.

    Google Scholar 

  • Fisher, R. A. (1924). The distribution of the partial correlation coefficient. Metron, 3, 329–332.

    Google Scholar 

  • Freedman, D. A. (1981). Bootstrapping regression models. Annals of Statistics, 9(6), 1218–1228.

    Article  Google Scholar 

  • Freedman, D. A. (2009). Diagnostics cannot have much power against general alternatives. International Journal of Forecasting, 25(4), 833–839.

    Article  Google Scholar 

  • Gelman, A., & Park, D. K. (2008). Splitting a predictor at the upper quarter third and the lower quarter or third. The American Statistician, 62(4), 1–8.

    Google Scholar 

  • Goldberger, A. S., & Duncan, O. D. (1973). Structural equation modeling in the social sciences. New York: Seminar Press.

    Google Scholar 

  • Greene, W. H. (2003). Econometric analysis (5th ed.). New York: Prentice Hall.

    Google Scholar 

  • Hanushek, E. A., & Jackson, J. E. (1977). Statistical methods for social scientists. New York: Academic.

    Google Scholar 

  • Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. New York: Chapman & Hall.

    Google Scholar 

  • Hastie, T. J., Tibshirani, R. J., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). New York: Springer.

    Book  Google Scholar 

  • Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Symposium on Mathematical Statistics and Probability, I, 221–233.

    Google Scholar 

  • Kaplan, D. (2009). Structural equation modeling: Foundations and extensions (2nd ed.). Los Angeles: Sage Publications.

    Google Scholar 

  • Leeb, H., & Pötscher, B. M. (2006). Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5), 2554–2591.

    Article  Google Scholar 

  • Leeb, H., & Pötscher, B. M. (2008a). Model selection. In T. G. Anderson, R. A. Davis, J.-P. Kreib, & T. Mikosch (Eds.), The handbook of financial time series (pp. 785–821). New York: Springer.

    Google Scholar 

  • Leeb, H., & Pötscher, B. M. (2008b). Sparse estimators and the oracle property, or the return of Hodges estimator. Journal of Econometrics, 142, 201–211.

    Article  Google Scholar 

  • Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional linear models. Annals of Statistics, 21(1), 255–285.

    Article  Google Scholar 

  • Rosenbaum, P. (2009). Design of observational studies. New York: Springer.

    Google Scholar 

  • Rosenbaum, P. (2010). Observational studies (2nd ed.). New York: Springer.

    Book  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1), 267–288.

    Google Scholar 

  • Thompson, S. (2002). Sampling (2nd ed.). New York: Wiley.

    Google Scholar 

  • White, H. (1980). Using least squares to approximate unknown regression functions. International Economic Review, 21(1), 149–170.

    Article  Google Scholar 

  • Zellner, A. (1984). Basic issues in econometrics. Chicago: University of Chicago Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard A. Berk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Berk, R.A. et al. (2013). What You Can Learn from Wrong Causal Models. In: Morgan, S. (eds) Handbook of Causal Analysis for Social Research. Handbooks of Sociology and Social Research. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6094-3_19

Download citation

Publish with us

Policies and ethics