Abstract
Statistical analysis of categorical data often relies on multiway contingency tables; yet, as the number of categories and/or variables increases, the number of table cells with few (or zero) observations also increases. Unfortunately, sparse contingency tables invalidate the use of standard goodness-of-fit statistics. Limited-information fit statistics and bootstrapping procedures offer valuable solutions to this problem, but they present an additional concern in their strict reliance on the (potentially misleading) observed data. To address both of these issues, we demonstrate the Bayesian model checking technique, which yields insightful, useful, and comprehensive evaluations of specific properties of a given model. We illustrate this technique using item response data from a patient-reported psychopathology screening questionnaire, and we provide annotated R code to promote dissemination of this informative method in other prevention science modeling scenarios.
Similar content being viewed by others
Notes
Although we largely use default settings here, we also note that subjective prior settings are a helpful way of incorporating specific opinions, theory, or knowledge into the estimation process. We describe this element of subjectivity in more detail in the “Discussion” section.
Thank you to Dr. Waguih IsHak of the Geffen School of Medicine at UCLA for providing this data set.
One important aspect when deciding on the specification of priors is the prior-data disagreement issue that can arise. If priors are misaligned with the evidence in the data, then the posteriors can be impacted by the priors, as well as GOF. This issue is particularly common if informative, but “inaccurate,” priors are implemented. We used non-informative priors to avoid this issue. For more on data-prior conflict, please see Evans and Moshonov (2006).
References
Ackerman, T. A. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 15, 13–24.
Ansley, T. N., & Forsyth, R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9, 37–48.
Bartholomew, D. J., & Tzamourani, P. (1999). The goodness of fit of latent trait models in attitude measurement. Sociological Methods & Research, 27, 525–546.
Bayes, T. (1764). An essay toward solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 370–418.
Béguin, A. A., & Glas, C. A. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–561.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–472). Reading, MA: AddisonWesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.
Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12, 383–407.
Bonifay, W. (2015). An illustration of the two-tier item factor analysis model. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling (pp. 207–225). Routledge.
Bonifay, W., & Cai, L. (2017). On the complexity of item response theory models. Multivariate Behavioral Research, 52, 465–484.
Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71, 791–799.
Cai, L. (2010). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35, 307–335.
Cai, L. (2020). flexMIRT R version 3.6: Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.
Cai, L. Chung, S. W., & Lee, T. (in press). Incremental model fit assessment in the case of categorical data: Tucker-Lewis Index for item response theory. Prevention Science.
Cai, L., & Hansen, M. (2013). Limited‐information goodness‐of‐fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66, 245–276.
Castel, S., Rush, B., Kennedy, S., Fulton, K., & Toneatto, T. (2007). Screening for mental health problems among patients with substance use disorders: Preliminary findings on the validation of a self-assessment instrument. The Canadian Journal of Psychiatry, 52, 22–27.
Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.
Depaoli, S., Yang, Y., & Felt, J. (2017). Using Bayesian statistics to model uncertainty in mixture models: A sensitivity analysis of priors. Structural Equation Modeling: A Multidisciplinary Journal, 24, 198–215.
Depaoli, S., & Van de Schoot, R. (2017). Improving transparency and replication in Bayesian statistics: The WAMBS-Checklist. Psychological Methods, 22, 240.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–26.
Evans, M., & Moshonov, H. (2006). Checking for prior-data conflict. Bayesian. Analysis, 1, 893–914. https://doi.org/10.1214/06-BA129
Fox, J. P. (2010). Bayesian item response modeling: Theory and applications. Springer Science & Business Media.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC Press.
Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733–760.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38.
Gibbons, R. D., Rush, A. J., & Immekus, J. C. (2009). On the psychometric validity of the domains of the PDSQ: An illustration of the bi-factor item response theory model. Journal of Psychiatric Research, 43, 401–410.
Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society. Series B (Methodological), 83–100.
Hayduk, L., Cummings, G., Boadu, K., Pazderka-Robinson, H., & Boulianne, S. (2007). Testing! testing! one, two, three–Testing the theory in structural equation models! Personality and Individual Differences, 42, 841–850.
Hoff, P. D. (2009). A first course in Bayesian statistical methods (Vol. 580). Springer.
Houben, M., Claes, L., Vansteelandt, K., Berens, A., Sleuwaegen, E., & Kuppens, P. (2017). The emotion regulation function of nonsuicidal self-injury: A momentary assessment study in inpatients with borderline personality disorder features. Journal of Abnormal Psychology, 126, 89–95.
Kadane, J. B. (2015). Bayesian methods for prevention research. Prevention Science, 16, 1017–1025.
Kaplan, D. (2014). Bayesian statistics for the social sciences. Guilford Press.
Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343–1370.
Langeheine, R., Pannekoek, J., & Van de Pol, F. (1996). Bootstrapping goodness-of-fit measures in categorical data analysis. Sociological Methods & Research, 24, 492–516.
Levy, R. (2011). Posterior predictive model checking for conjunctive multidimensionality in item response theory. Journal of Educational and Behavioral Statistics, 36, 672–694.
Li, Z., & Cai, L. (2018). Summed score likelihood–based indices for testing latent variable distribution fit in item response theory. Educational and Psychological Measurement, 78(5), 857–886.
Lim, H., & Wells, C. S. (2020). irtplay: An R package for online item calibration, scoring, evaluation of model fit, and useful functions for unidimensional IRT. Applied Psychological Measurement. https://doi.org/10.1177/0146621620921247
MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 130–149.
Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory factor analysis: The effect of sample size. Psychological Bulletin, 103(3), 391–410.
Marsh, H. W., & Balla, J. (1994). Goodness of fit in confirmatory factor analysis: The effects of sample size and model parsimony. Quality and Quantity, 28, 185–217.
Maydeu-Olivares, A. (2013). Goodness-of-fit assessment of item response theory models. Measurement, 11, 71–101.
McNeish, D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23, 412–433.
McNeish, D., & Wolf, M. G. (2020). Thinking twice about sum scores. Behavior research methods, 1–19.
Monroe, S. (2021). Testing latent variable distribution fit in IRT using posterior residuals. Journal of Educational and Behavioral Statistics, 46(3), 374–398.
Orlando, M., & Thissen, D. (2000). New item fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.
Ory, D. T., & Mokhtarian, P. L. (2010). The impact of non-normality, sample size and estimation technique on goodness-of-fit measures in structural equation modeling: Evidence from ten empirical models of travel behavior. Quality & Quantity, 44, 427–445.
R Core Team (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
Reise, S. R., Cook, K. F., & Moore, T. M. (2015). Evaluating the impact of multidimensionality on unidimensional item response theory model parameters. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling (pp. 13–40). Routledge.
Roberts, S., & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107, 358–367.
Rubin, D. B. (1981). The Bayesian bootstrap. The Annals of Statistics, 9, 130–134.
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12, 1151–1172.
Rush, A. J., Fava, M., Wisniewski, S. R., Lavori, P. W., Trivedi, M. H., Sackeim, H. A., & Niederehe, G. (2004). Sequenced treatment alternatives to relieve depression (STAR* D): Rationale and design. Controlled Clinical Trials, 25, 119–142.
Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical and Statistical Psychology, 59(2), 429–449.
Stone, C. A., & Zhu, X. (2015). Bayesian analysis of item response theory models using SAS®. SAS Institute Inc.
van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & Van Aken, M. A. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85, 842–860.
van Erp, S., Mulder, J., & Oberski, D. (2018). Prior sensitivity analysis in default Bayesian structural equation modeling. Psychological Methods, 23, 363–388.
Way, W. D., Ansley, T. N., & Forsyth, R. A. (1988). The comparative effects of compensatory and noncompensatory two-dimensional data on unidimensional IRT estimates. Applied Psychological Measurement, 12, 239–252.
Zimmerman, M., & Mattia, J. I. (2001). A self-report scale to help make psychiatric diagnoses: The Psychiatric Diagnostic Screening Questionnaire. Archives of General Psychiatry, 58, 787–794.
Zhu, X., & Stone, C. A. (2012). Bayesian comparison of alternative graded response models for performance assessment applications. Educational and Psychological Measurement, 72(5), 774–799.
Funding
The research reported here was supported by the Institute of Education Sciences, US Department of Education, through Grant R305D210032.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics Approval
All procedures performed in the STAR*D trial was approved by the institutional review board of the STAR*D National Coordinating Center at the University of Texas Southwestern Medical Center in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards.
Disclaimer
The opinions expressed are those of the authors and do not represent views of the Institute or the US Department of Education.
Consent to Participate
Informed consent was obtained from all the participants.
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bonifay, W., Depaoli, S. Model Evaluation in the Presence of Categorical Data: Bayesian Model Checking as an Alternative to Traditional Methods. Prev Sci 24, 467–479 (2023). https://doi.org/10.1007/s11121-021-01293-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11121-021-01293-w