Skip to main content
Log in

Model Evaluation in the Presence of Categorical Data: Bayesian Model Checking as an Alternative to Traditional Methods

  • Published:
Prevention Science Aims and scope Submit manuscript

Abstract

Statistical analysis of categorical data often relies on multiway contingency tables; yet, as the number of categories and/or variables increases, the number of table cells with few (or zero) observations also increases. Unfortunately, sparse contingency tables invalidate the use of standard goodness-of-fit statistics. Limited-information fit statistics and bootstrapping procedures offer valuable solutions to this problem, but they present an additional concern in their strict reliance on the (potentially misleading) observed data. To address both of these issues, we demonstrate the Bayesian model checking technique, which yields insightful, useful, and comprehensive evaluations of specific properties of a given model. We illustrate this technique using item response data from a patient-reported psychopathology screening questionnaire, and we provide annotated R code to promote dissemination of this informative method in other prevention science modeling scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Although we largely use default settings here, we also note that subjective prior settings are a helpful way of incorporating specific opinions, theory, or knowledge into the estimation process. We describe this element of subjectivity in more detail in the “Discussion” section.

  2. Thank you to Dr. Waguih IsHak of the Geffen School of Medicine at UCLA for providing this data set.

  3. One important aspect when deciding on the specification of priors is the prior-data disagreement issue that can arise. If priors are misaligned with the evidence in the data, then the posteriors can be impacted by the priors, as well as GOF. This issue is particularly common if informative, but “inaccurate,” priors are implemented. We used non-informative priors to avoid this issue. For more on data-prior conflict, please see Evans and Moshonov (2006).

References

  • Ackerman, T. A. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 15, 13–24.

    Article  Google Scholar 

  • Ansley, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9, 37–48.

    Article  Google Scholar 

  • Bartholomew, D. J., & Tzamourani, P. (1999). The goodness of fit of latent trait models in attitude measurement. Sociological Methods & Research, 27, 525–546.

    Article  Google Scholar 

  • Bayes, T. (1764). An essay toward solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 370–418.

  • Béguin, A. A., & Glas, C. A. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–561.

  • Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–472). Reading, MA: AddisonWesley.

  • Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.

    Article  Google Scholar 

  • Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12, 383–407.

    Article  Google Scholar 

  • Bonifay, W. (2015). An illustration of the two-tier item factor analysis model. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling (pp. 207–225). Routledge.

    Google Scholar 

  • Bonifay, W., & Cai, L. (2017). On the complexity of item response theory models. Multivariate Behavioral Research, 52, 465–484.

    Article  PubMed  Google Scholar 

  • Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71, 791–799.

    Article  Google Scholar 

  • Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35, 307–335.

    Article  Google Scholar 

  • Cai, L. (2020). flexMIRT R version 3.6: Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.

  • Cai, L. Chung, S. W., & Lee, T. (in press). Incremental model fit assessment in the case of categorical data: Tucker-Lewis Index for item response theory. Prevention Science.

  • Cai, L., & Hansen, M. (2013). Limited‐information goodness‐of‐fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66, 245–276.

  • Castel, S., Rush, B., Kennedy, S., Fulton, K., & Toneatto, T. (2007). Screening for mental health problems among patients with substance use disorders: Preliminary findings on the validation of a self-assessment instrument. The Canadian Journal of Psychiatry, 52, 22–27.

    Article  PubMed  Google Scholar 

  • Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.

    Article  Google Scholar 

  • Depaoli, S., Yang, Y., & Felt, J. (2017). Using Bayesian statistics to model uncertainty in mixture models: A sensitivity analysis of priors. Structural Equation Modeling: A Multidisciplinary Journal, 24, 198–215.

    Article  Google Scholar 

  • Depaoli, S., & Van de Schoot, R. (2017). Improving transparency and replication in Bayesian statistics: The WAMBS-Checklist. Psychological Methods, 22, 240.

    Article  PubMed  Google Scholar 

  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–26.

    Article  Google Scholar 

  • Evans, M., & Moshonov, H. (2006). Checking for prior-data conflict. Bayesian. Analysis, 1, 893–914. https://doi.org/10.1214/06-BA129

    Article  Google Scholar 

  • Fox, J. P. (2010). Bayesian item response modeling: Theory and applications. Springer Science & Business Media.

    Book  Google Scholar 

  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC Press.

    Book  Google Scholar 

  • Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733–760.

    Google Scholar 

  • Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38.

    Article  PubMed  Google Scholar 

  • Gibbons, R. D., Rush, A. J., & Immekus, J. C. (2009). On the psychometric validity of the domains of the PDSQ: An illustration of the bi-factor item response theory model. Journal of Psychiatric Research, 43, 401–410.

    Article  PubMed  Google Scholar 

  • Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society. Series B (Methodological), 83–100.

  • Hayduk, L., Cummings, G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing! testing! one, two, three–Testing the theory in structural equation models! Personality and Individual Differences, 42, 841–850.

    Article  Google Scholar 

  • Hoff, P. D. (2009). A first course in Bayesian statistical methods (Vol. 580). Springer.

    Book  Google Scholar 

  • Houben, M., Claes, L., Vansteelandt, K., Berens, A., Sleuwaegen, E., & Kuppens, P. (2017). The emotion regulation function of nonsuicidal self-injury: A momentary assessment study in inpatients with borderline personality disorder features. Journal of Abnormal Psychology, 126, 89–95.

    Article  PubMed  Google Scholar 

  • Kadane, J. B. (2015). Bayesian methods for prevention research. Prevention Science, 16, 1017–1025.

    Article  PubMed  Google Scholar 

  • Kaplan, D. (2014). Bayesian statistics for the social sciences. Guilford Press.

    Google Scholar 

  • Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343–1370.

    Article  Google Scholar 

  • Langeheine, R., Pannekoek, J., & Van de Pol, F. (1996). Bootstrapping goodness-of-fit measures in categorical data analysis. Sociological Methods & Research, 24, 492–516.

    Article  Google Scholar 

  • Levy, R. (2011). Posterior predictive model checking for conjunctive multidimensionality in item response theory. Journal of Educational and Behavioral Statistics, 36, 672–694.

    Article  Google Scholar 

  • Li, Z., & Cai, L. (2018). Summed score likelihood–based indices for testing latent variable distribution fit in item response theory. Educational and Psychological Measurement, 78(5), 857–886.

  • Lim, H., & Wells, C. S. (2020). irtplay: An R package for online item calibration, scoring, evaluation of model fit, and useful functions for unidimensional IRT. Applied Psychological Measurement. https://doi.org/10.1177/0146621620921247

    Article  PubMed  PubMed Central  Google Scholar 

  • MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 130–149.

    Article  Google Scholar 

  • Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory factor analysis: The effect of sample size. Psychological Bulletin, 103(3), 391–410.

  • Marsh, H. W., & Balla, J. (1994). Goodness of fit in confirmatory factor analysis: The effects of sample size and model parsimony. Quality and Quantity, 28, 185–217.

    Article  Google Scholar 

  • Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory models. Measurement, 11, 71–101.

    Google Scholar 

  • McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23, 412–433.

    Article  PubMed  Google Scholar 

  • McNeish, D., & Wolf, M. G. (2020). Thinking twice about sum scores. Behavior research methods, 1–19.

  • Monroe, S. (2021). Testing latent variable distribution fit in IRT using posterior residuals. Journal of Educational and Behavioral Statistics, 46(3), 374–398.

  • Orlando, M., & Thissen, D. (2000). New item fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.

    Article  Google Scholar 

  • Ory, D. T., & Mokhtarian, P. L. (2010). The impact of non-normality, sample size and estimation technique on goodness-of-fit measures in structural equation modeling: Evidence from ten empirical models of travel behavior. Quality & Quantity, 44, 427–445.

    Article  Google Scholar 

  • R Core Team (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

  • Reise, S. R., Cook, K. F., & Moore, T. M. (2015). Evaluating the impact of multidimensionality on unidimensional item response theory model parameters. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling (pp. 13–40). Routledge.

    Google Scholar 

  • Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107, 358–367.

    Article  CAS  PubMed  Google Scholar 

  • Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics, 9, 130–134.

    Article  Google Scholar 

  • Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12, 1151–1172.

    Article  Google Scholar 

  • Rush, A. J., Fava, M., Wisniewski, S. R., Lavori, P. W., Trivedi, M. H., Sackeim, H. A., & Niederehe, G. (2004). Sequenced treatment alternatives to relieve depression (STAR* D): Rationale and design. Controlled Clinical Trials, 25, 119–142.

    Article  PubMed  Google Scholar 

  • Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical and Statistical Psychology, 59(2), 429–449.

  • Stone, C. A., & Zhu, X. (2015). Bayesian analysis of item response theory models using SAS®. SAS Institute Inc.

    Google Scholar 

  • van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & Van Aken, M. A. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85, 842–860.

    Article  PubMed  Google Scholar 

  • van Erp, S., Mulder, J., & Oberski, D. (2018). Prior sensitivity analysis in default Bayesian structural equation modeling. Psychological Methods, 23, 363–388.

    Article  PubMed  Google Scholar 

  • Way, W. D., Ansley, T. N., & Forsyth, R. A. (1988). The comparative effects of compensatory and noncompensatory two-dimensional data on unidimensional IRT estimates. Applied Psychological Measurement, 12, 239–252.

    Article  Google Scholar 

  • Zimmerman, M., & Mattia, J. I. (2001). A self-report scale to help make psychiatric diagnoses: The Psychiatric Diagnostic Screening Questionnaire. Archives of General Psychiatry, 58, 787–794.

    Article  CAS  PubMed  Google Scholar 

  • Zhu, X., & Stone, C. A. (2012). Bayesian comparison of alternative graded response models for performance assessment applications. Educational and Psychological Measurement, 72(5), 774–799.

Download references

Funding

The research reported here was supported by the Institute of Education Sciences, US Department of Education, through Grant R305D210032.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wes Bonifay.

Ethics declarations

Ethics Approval

All procedures performed in the STAR*D trial was approved by the institutional review board of the STAR*D National Coordinating Center at the University of Texas Southwestern Medical Center in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.

Disclaimer

The opinions expressed are those of the authors and do not represent views of the Institute or the US Department of Education.

Consent to Participate

Informed consent was obtained from all the participants.

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bonifay, W., Depaoli, S. Model Evaluation in the Presence of Categorical Data: Bayesian Model Checking as an Alternative to Traditional Methods. Prev Sci 24, 467–479 (2023). https://doi.org/10.1007/s11121-021-01293-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11121-021-01293-w

Keywords

Navigation