, Volume 84, Issue 1, pp 285–309 | Cite as

High-Stakes Testing Case Study: A Latent Variable Approach for Assessing Measurement and Prediction Invariance

  • Steven Andrew CulpepperEmail author
  • Herman Aguinis
  • Justin L. Kern
  • Roger Millsap


The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.


measurement invariance prediction invariance instrumental variables high-stakes testing 


Supplementary material

11336_2018_9649_MOESM1_ESM.csv (1.7 mb)
Supplementary material 1 (csv 1757 KB)
11336_2018_9649_MOESM2_ESM.csv (188 kb)
Supplementary material 2 (csv 188 KB)


  1. Aguinis, H. (2004). Regression analysis for categorical moderators. New York: Guilford.Google Scholar
  2. Aguinis, H. (2019). Performance management (4th ed.). Chicago, IL: Chicago Business Press.Google Scholar
  3. Aguinis, H., Cortina, J. M., & Goldberg, E. (1998). A new procedure for computing equivalence bands in personnel selection. Human Performance, 11, 351–365.Google Scholar
  4. Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2010a). Revival of test bias research in preemployment testing. Journal of Applied Psychology, 95, 648–680.Google Scholar
  5. Aguinis, H., Culpepper, S. A., & Pierce, C. A. (2016). Differential prediction generalization in college admissions testing. Journal of Educational Psychology, 108, 1045–1059.Google Scholar
  6. Aguinis, H., Werner, S., Abbott, J. L., Angert, C., Park, J. H., & Kohlhausen, D. (2010b). Customer-centric science: Reporting significant research results with rigor, relevance, and practical impact in mind. Organizational Research Methods, 13, 515–539.Google Scholar
  7. Albano, A. D., & Rodriguez, M. C. (1998). Examining differential math performance by gender and opportunity to learn. Educational and Psychological Measurement, 73, 836–856.Google Scholar
  8. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
  9. Aronson, J., & Dee, T. (2012). Stereotype threat in the real world. In T. Schmader & M. Inzlicht (Eds.), Stereotype threat: Theory, process, and application (pp. 264–278). Oxford: Oxford University Press.Google Scholar
  10. Bernerth, J., & Aguinis, H. (2016). A critical review and best-practice recommendations for control variable usage. Personnel Psychology, 69, 229–283.Google Scholar
  11. Berry, C. M., & Zhao, P. (2015). Addressing criticisms of existing predictive bias research: Cognitive ability test scores still overpredict African Americans’ job performance. Journal of Applied Psychology, 100, 162–179.Google Scholar
  12. Birnbaum, Z. W., Paulson, E., & Andrews, F. C. (1950). On the effect of selection performed on some coordinates of a multi-dimensional population. Psychometrika, 15, 191–204.Google Scholar
  13. Bollen, K. A. (1996). An alternative two stage least squares (2SLS) estimator for latent variable equations. Psychometrika, 61, 109–121.Google Scholar
  14. Bollen, K. A., Kolenikov, S., & Bauldry, S. (2014). Model-implied instrumental variable—generalized method of moments (MIIV-GMM) estimators for latent variable models. Psychometrika, 79, 20–50.Google Scholar
  15. Bollen, K. A., & Maydeu-Olivares, A. (2007). A polychoric instrumental variable (PIV) estimator for structural equation models with categorical variables. Psychometrika, 72, 309–326.Google Scholar
  16. Bollen, K. A., & Paxton, P. (1998). Two-stage least squares estimation on interaction effects. In R. E. Schumacker & G. A. Marcoulides (Eds.), Interaction and nonlinear effects in structural equation modeling (pp. 125–151). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  17. Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440.Google Scholar
  18. Borsboom, D., Romeijn, J. W., & Wicherts, J. M. (2008). Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods, 13, 75–98.Google Scholar
  19. Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods & Research, 21, 230–258.Google Scholar
  20. Bryant, D. (2004). The effects of differential item functioning on predictive bias. Unpublished doctoral dissertation), University of Central Florida, Orlando, Florida.Google Scholar
  21. Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5, 115–124.Google Scholar
  22. Coyle, T. R., & Pillow, D. R. (2008). SAT and ACT predict college GPA after removing g. Intelligence, 36, 719–729.Google Scholar
  23. Coyle, T. R., Purcell, J. M., Snyder, A. C., & Kochunov, P. (2013). Non-g residuals of the SAT and ACT predict specific abilities. Intelligence, 41, 114–120.Google Scholar
  24. Coyle, T. R., Purcell, J. M., Snyder, A. C., & Richmond, M. C. (2014). Ability tilt on the SAT and ACT predicts specific abilities and college majors. Intelligence, 46, 18–24.Google Scholar
  25. Culpepper, S. A. (2010). Studying individual differences in predictability with gamma regression and nonlinear multilevel models. Multivariate Behavioral Research, 45, 153–185.Google Scholar
  26. Culpepper, S. A. (2012a). Using the criterion-predictor factor model to compute the probability of detecting prediction bias with ordinary least squares regression. Psychometrika, 77, 561–580.Google Scholar
  27. Culpepper, S. A. (2012b). Evaluating EIV, OLS, and SEM estimators of group slope differences in the presence of measurement error: The single indicator case. Applied Psychological Measurement, 36, 349–374.Google Scholar
  28. Culpepper, S. A. (2016). An improved correction for range restricted correlations under extreme, monotonic quadratic nonlinearity and heteroscedasticity. Psychometrika, 81, 550–564.Google Scholar
  29. Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166–178.Google Scholar
  30. Culpepper, S. A., & Davenport, E. C. (2009). Assessing differential prediction of college grades by race/ethnicity with a multilevel model. Journal of Educational Measurement, 46, 220–242.Google Scholar
  31. Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indexes to misspecified structural or measurement model components: Rationale of two-index strategy revisited. Structural Equation Modeling, 12, 343–367.Google Scholar
  32. Fischer, F. T., Schult, J., & Hell, B. (2013a). Sex-specific differential prediction of college admission tests: A meta-analysis. Journal of Educational Psychology, 105, 478–488.Google Scholar
  33. Fischer, F., Schult, J., & Hell, B. (2013b). Sex differences in secondary school success: Why female students perform better. European Journal of Psychology of Education, 28, 529–543.Google Scholar
  34. Gottfredson, L. S. (1988). Reconsidering fairness: A matter of social and ethical priorities. Journal of Vocational Behavior, 33, 293–319.Google Scholar
  35. Gottfredson, L. S., & Crouse, J. (1986). Validity versus utility of mental tests: Example of the SAT. Journal of Vocational Behavior, 29, 363–378.Google Scholar
  36. Hägglund, G. (1982). Factor analysis by instrumental variables methods. Psychometrika, 47, 209–222.Google Scholar
  37. Hayashi, F. (2000). Econometrics. Princeton, NJ: Princeton University Press.Google Scholar
  38. Hausman, J. A., Newey, W. K., Woutersen, T., Chao, J. C., & Swanson, N. R. (2012). Instrumental variable estimation with heteroskedasticity and many instruments. Quantitative Economics, 3, 211–255.Google Scholar
  39. Hong, S., & Roznowski, M. (2001). An investigation of the influence of internal test bias on regression slope. Applied Measurement in Education, 14, 351–368.Google Scholar
  40. Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1–55.Google Scholar
  41. Humphreys, L. G. (1952). Individual differences. Annual Review of Psychology, 3, 131–150.Google Scholar
  42. Jöreskog, K. G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409–426.Google Scholar
  43. Jöreskog, K. G. (1998). Interaction and nonlinear modeling: Issues and approaches. In R. E. Schumacker & G. A. Marcoulides (Eds.), Interaction and nonlinear effects in structural equation modeling (pp. 239–250). Mahwah, NJ: Lawrence Erlbaum Associates Inc.Google Scholar
  44. Keiser, H. N., Sackett, P. R., Kuncel, N. R., & Brothen, T. (2016). Why women perform better in college than admission scores would predict: Exploring the roles of conscientiousness and course-taking patterns. Journal of Applied Psychology, 101, 569–581.Google Scholar
  45. Kling, K. C., Noftle, E. E., & Robins, R. W. (2012). Why do standardized tests underpredict women’s academic performance? The role of conscientiousness. Social Psychological and Personality Science, 4, 600–606.Google Scholar
  46. Lance, C. E., Beck, S. S., Fan, Y., & Carter, N. T. (2016). A taxonomy of path-related goodness-of-fit indices and recommended criterion values. Psychological Methods, 21, 388–404.Google Scholar
  47. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635–694.Google Scholar
  48. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Charlotte: Information Age Publishing Inc.Google Scholar
  49. MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1, 130–149.Google Scholar
  50. Marsh, H. W., Wen, Z., & Hau, K. (2004). Structural equation models of latent interactions: Evaluation of alternative estimation strategies and indicator construction. Psychological Methods, 9, 275–300.Google Scholar
  51. Mattern, K. D., & Patterson, B. F. (2013). Test of slope and intercept bias in college admissions: A response to Aguinis, Culpepper, and Pierce (2010). Journal of Applied Psychology, 98, 134–147.Google Scholar
  52. McDonald, R. P., & Ho, M. H. R. (2002). Principles and practice in reporting structural equation analyses. Psychological Methods, 7, 64–82.Google Scholar
  53. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543.Google Scholar
  54. Millsap, R. E. (1995). Measurement invariance, predictive invariance, and the duality paradox. Multivariate Behavioral Research, 30, 577–605.Google Scholar
  55. Millsap, R. E. (1997). Invariance in measurement and prediction: Their relationship in the single-factor case. Psychological Methods, 2, 248–260.Google Scholar
  56. Millsap, R. E. (1998). Group differences in regression intercepts: Implications for factorial invariance. Multivariate Behavioral Research, 33, 403–424.Google Scholar
  57. Millsap, R. E. (2007). Invariance in measurement and prediction revisited. Psychometrika, 72, 461–473.Google Scholar
  58. Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routledge.Google Scholar
  59. Moulder, B. C., & Algina, J. (2002). Comparison of methods for estimating and testing latent variable interactions. Structural Equation Modeling, 9, 1–19.Google Scholar
  60. Muthén, B. O. (1989). Factor structure in groups selected on observed scores. British Journal of Mathematical and Statistical Psychology, 42, 81–90.Google Scholar
  61. Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52, 431–462.Google Scholar
  62. Nestler, S. (2014). How the 2SLS/IV estimator can handle equality constraints in structural equation models: A system-of-equations approach. British Journal of Mathematical and Statistical Psychology, 67, 353–369.Google Scholar
  63. Nguyen, H. H. D., & Ryan, A. M. (2008). Does stereotype threat affect test performance of minorities and women? A meta-analysis of experimental evidence. Journal of Applied Psychology, 93, 1314–1334.Google Scholar
  64. Nye, C. D., & Drasgow, F. (2011). Assessing goodness of fit: Simple rules of thumb simply do not work. Organizational Research Methods, 14, 548–570.Google Scholar
  65. Oczkowski, E. (2002). Discriminating between measurement scales using nonnested tests and 2SLS: Monte Carlo evidence. Structural Equation Modeling, 9, 103–125.Google Scholar
  66. Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of Applied Psychology, 79, 845–851.Google Scholar
  67. Ployhart, R. E., Schmitt, N., & Tippins, N. T. (2017). Solving the supreme problem: 100 years of recruitment and selection research. Journal of Applied Psychology, 102, 291–304.Google Scholar
  68. Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). Generalized multilevel structural equation modeling. Psychometrika, 69, 167–190.Google Scholar
  69. Ree, M. J., & Earles, J. A. (1991). Predicting training success: Not much more than g. Personnel Psychology, 44, 321–332.Google Scholar
  70. Ree, M. J., Earles, J. A., & Teachout, M. S. (1994). Predicting job performance: Not much more than g. Journal of Applied Psychology, 79, 518–524.Google Scholar
  71. Sackett, P. R., & Ryan, A. M. (2011). Concerns about generalizing stereotype threat research findings to operational high-stakes testing settings. In T. Schmader & M. Inzlicht (Eds.), Stereotype threat: Theory, process, and application (pp. 246–259). Oxford: Oxford University Press.Google Scholar
  72. Schmitt, N., Keeney, J., Oswald, F. L., Pleskac, T., Quinn, A., Sinha, R., et al. (2009). Prediction of 4-year college student performance using cognitive and noncognitive predictors and the impact of demographic status on admitted students. Journal of Applied Psychology, 94, 1479–1497.Google Scholar
  73. Schult, J., Hell, B., Päßler, K., & Schuler, H. (2013). Sex-specific differential prediction of academic achievement by German ability tests. International Journal of Selection and Assessment, 21, 130–134.Google Scholar
  74. Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Washington, DC: American Psychological Association.Google Scholar
  75. Sörbom, D. (1974). A general method for studying differences in factor means and factor structure between groups. British Journal of Mathematical and Statistical Psychology, 27, 229–239.Google Scholar
  76. Sörbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika, 43, 381–396.Google Scholar
  77. Steele, C. M. (2011). Whistling Vivaldi: How stereotypes affect us and what we can do. New York: WW Norton & Company.Google Scholar
  78. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70.Google Scholar
  79. Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574.Google Scholar
  80. Walton, G. M., Murphy, M. C., & Ryan, A. M. (2015). Stereotype threat in organizations: Implications for equity and performance. Annual Review of Organizational Psychology and Organizational Behavior, 2, 523–550.Google Scholar
  81. Wicherts, J. M., & Millsap, R. E. (2009). The absence of underprediction does not imply the absence of measurement bias. American Psychologist, 64, 281–283.Google Scholar
  82. Wicherts, J. M., Dolan, C. V., & Hessen, D. J. (2005). Stereotype threat and group differences in test performance: A question of measurement invariance. Journal of Personality and Social Psychology, 89, 696–716.Google Scholar
  83. Widaman, K. F., & Thompson, J. S. (2003). On specifying the null model for incremental fit indices in structural equation modeling. Psychological Methods, 8, 16–37.Google Scholar
  84. Wu, W., West, S. G., & Taylor, A. B. (2009). Evaluating model fit for growth curve models: Integration of fit indices from SEM and MLM frameworks. Psychological Methods, 14, 183–201.Google Scholar
  85. Young, J. W. (1991a). Gender bias in predicting college academic performance: A new approach using item response theory. Journal of Educational Measurement, 28, 37–47.Google Scholar
  86. Young, J. W. (1991b). Improving the prediction of college performance of ethnic minorities using the IRT-based GPA. Applied Measurement in Education, 4, 229–239.Google Scholar
  87. Zwick, R., & Himelfarb, I. (2011). The effect of high school socioeconomic status on the predictive validity of SAT scores and high school grade-point average. Journal of Educational Measurement, 48, 101–121.Google Scholar

Copyright information

© The Psychometric Society 2019

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of Illinois at Urbana–ChampaignChampaignUSA
  2. 2.Department of Psychology, Beckman Institute for Advanced Science and TechnologyUniversity of Illinois at Urbana–ChampaignUrbanaUSA
  3. 3.Department of Management, School of BusinessGeorge Washington UniversityWashingtonUSA
  4. 4.Department of Educational PsychologyUniversity of Illinois at Urbana-ChampaignChampaignUSA
  5. 5.Department of PsychologyArizona State UniversityTempeUSA

Personalised recommendations