Abstract
In using organizational surveys for decision-making, it is essential to consider measurement equivalence/invariance (ME/I), which addresses the questions of whether score differences are attributable to differences in the latent variable we intend to measure, or attributable to confounding differences in measurement properties. Due to the tendency for null results to remain unpublished, most articles have focused on findings of, and reasons for violations of ME/I. On the other hand, little is available to practitioners and researchers concerning situations where ME/I can be expected to uphold. This is especially disconcerting due to the fact that the null is the desired result in such analyses, and allows for unfettered observed-score comparisons. This special issue presents a unique opportunity to provide such a discussion using real-world examples from an organizational culture survey. In doing so we hope to clear up confusion surrounding the concept of ME/I, when it can be expected, and how it relates to actual differences in scores. First, we review the basic tenets and past findings focusing on ME/I, and discuss the item response theory differential item functioning framework used here. Next, we show ME/I being upheld using organizational survey data wherein violations of ME/I would reasonably not be expected (i.e., the null hypothesis was predicted and supported), and simulate the consequences of ignoring ME/I. Finally, we suggest a set of conditions wherein ME/I is likely to be upheld.
Similar content being viewed by others
Notes
A fully freed model was fit to ensure that item and person parameters were on the same scale.
References
AERA, APA, & NCME (1999). Standards for educational and psychological testing. Washington, DC: Author.
Ankenmann, R. D., Witt, E. A., & Dunbar, S. B. (1999). An investigation of the power of likelihood ratio goodness-of-fit statistic in detecting differential item functioning. Journal of Educational Measurement, 36, 277–300. doi:10.1111/j.1745-3984.1999.tb00558.x.
Behrend, T. S., Thompson, L. F., Meade, A. W., Newton, D. A., & Grayson, M. S. (2008). Measurement invariance in careers research: Using IRT to study gender differences in medical students’ specialization decisions. Journal of Career Development, 35, 60–83. doi:10.1177/0894845308317936.
Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). Functional thought experiments. Synthese, 130, 379–387. doi:10.1023/A:1014840616403.
Borsboom, D., Romeijn, J. W., & Wicherts, J. M. (2008). Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods, 13, 75–98. doi:10.1037/1082-989X.13.2.75.
Brown, J. R. (1991). The laboratory of the mind: Thought experiments in the natural sciences. London: Routledge.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks: Sage.
Candell, G. L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12, 253–260. doi:10.1177/014662168801200304.
Carlson, D. S., Kacmar, K. M., & Williams, L. J. (2000). Construction and initial validation of a multidimensional measure of work-family conflict. Journal of Vocational Behavior, 56, 249–276. doi:10.1006/jvbe.1999.1713.
Carter, N. T., Kotrba, L. M., Diab, D. L., Lake, C. J., Lin, B. C., Pui, S.-Y., et al. (2012). A comparison of a subjective and statistical method for establishing score comparability in an organizational culture survey. Journal of Business and Psychology,. doi:10.1007/s10869-011-9254-1.
Carter, N. T., & Zickar, M. J. (2011). A comparison of the LR and DFIT approaches to differential item functioning under the generalized graded unfolding model. Applied Psychological Measurement,. doi:10.1177/0146621611427898.
Chan, K.-Y., Drasgow, F., & Sawin, L. L. (1999). What is the shelf life of a test? The effect of time on the psychometrics of a cognitive ability test battery. Journal of Applied Psychology, 84, 610–619. doi:10.1037/0021-9010.84.4.610.
Chernyshenko, O. S., Stark, S., Chan, K.-Y., Drasgow, F., & Williams, B. (2001). Fitting item response theory models to two personality inventories: Issues and Insights. Multivariate Behavioral Research, 36, 523–562. doi:10.1207/S15327906MBR3604_03.
Cheung, G. W., & Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in cross-cultural research using structural equations modeling. Journal of Cross-Cultural Psychology, 31, 187–212. doi:10.1177/0022022100031002003.
Chuah, S. C., Drasgow, F., & Roberts, B. W. (2006). Personality assessment: Does the medium matter? No. Journal of Research in Personality, 40, 359–376. doi:10.1016/j.jrp.2005.01.006.
Clark, P. C., & LaHuis, D. M. (2011). An examination of power and type I errors for two differential item functioning indices using the graded response model. Organizational Research Methods,. doi:10.1177/1094428111403815.
Cohen, A. S., Kim, S.-H., & Wollack, J. A. (1996). An investigation of the likelihood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20, 15–26. doi:10.1177/014662169602000102.
Collins, W. C., Raju, N. S., & Edwards, J. E. (2000). Assessing differential functioning in a satisfaction scale. Journal of Applied Psychology, 85, 451–461. doi:10.1037/0021-9010.85.3.451.
De Ayala, R. J., Kim, S.-H., Stapleton, L. M., & Dayton, C. M. (2002). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2, 243–276. doi:10.1207/S15327574IJT023&4_4.
De Beuckelaer, A., & Lievens, F. (2009). Measurement equivalence of paper-and-pencil and internet organisational surveys: A large scale examination of 16 countries. Applied Psychology: An International Review, 58, 336–361. doi:10.1111/j.1464-0597.2008.00350.x.
de Dreu, C. K. W., Evers, A., Beersma, B., Kluwer, E. S., & Nauta, A. (2001). A theory-based measure of conflict management strategies in the workplace. Journal of Organizational Behavior, 22, 645–668. doi:10.1002/job.107.
Denison, D. R. (1984). Bringing corporate culture to the bottom line. Organizational Dynamics, 13, 4–22.
Denison, D. R. (1990). Corporate culture and organizational effectiveness. New York: John Wiley & Sons.
Denison, D. R. (2000). Organizational culture: Can it be a key lever for driving organizational change? In C. L. Cooper, S. Cartwright, & P. C. Earley (Eds.), The International Handbook of Organizational Culture and Climate (pp. 347–372).
Denison, D. R., & Mishra, A. K. (1995). Toward a theory of organizational culture and effectiveness. Organizational Science, 6, 204–223.
Denison, D. R., Janovics, J., Young, J., & Cho, H. J. (2007). Diagnosing organizational cultures: Validating a model and method. Working paper, International Institute for Management Development, Lausanne, Switzerland.
Donovan, M. A., Drasgow, F., & Probst, T. M. (2000). Does computerizing pencil-and-paper job attitude scales make a difference? IRT analyses offer insight. Journal of Applied Psychology, 85, 305–313. doi:10.1037/0021-9010.85.2.305.
Drasgow, F. (1982). Biased test items and differential validity. Psychological Bulletin, 92, 526–531. doi:10.1037/0033-2909.92.2.526.
Drasgow, F. (1987). Study of the measurement bias of two standardized psychological tests. Journal of Applied Psychology, 72, 19–29. doi:10.1037/0021-9010.72.1.19.
Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68, 363–373.
Drasgow, F., Levine, M. V., Tsien, S., Williams, B. A., & Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19, 143–165. doi:10.1177/014662169501900203.
Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5, 155–174. doi:10.1037/1082-989X.5.2.155.
Eid, M., & Rauber, M. (2000). Detecting measurement invariance in organizational surveys. European Journal of Psychological Assessment, 16, 20–30. doi:10.1027//1015-5759.16.1.20.
Einarsdóttir, S., & Rounds, J. (2009). Gender bias and construct validity in vocational interest measurement: differential item functioning in the strong interest inventory. Journal of Vocational Behavior, 74, 295–307. doi:10.1016/j.jvb.2009.01.003.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.
Finch, W. H., & French, B. F. (2008). Anomalous type I error rates for identifying one type of differential item functioning in the presence of the other. Educational and Psychological Measurement, 68, 742–759. doi:10.1177/0013164407313370.
Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23, 309–326. doi:10.1177/01466219922031437.
Han, K. T. (2007). WinGen: Windows software that generates IRT parameters and item responses. Applied Psychological Measurement, 31, 457–459. doi:10.1177/0146621607299271.
Hofstede, G. (1991). Cultures and organizations: Software of the mind. London: McGraw-Hill.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel–Haenszel procedure. In H. Wainer & H. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Earlbaum.
Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory. Homewood, IL: Dow Jones-Irwin.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535. doi:10.1037/0033-2909.112.3.527.
Kim, S.-H., & Cohen, A. S. (1995). A comparison of Lord’s Chi-square, Raju’s area measures, and the likelihood ratio test on detection of differential item functioning. Applied Measurement in Education, 8, 291–312. doi:10.1207/s15324818ame0804_2.
Kim, S.-H., & Cohen, A. S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22, 345–355. doi:10.1177/014662169802200403.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking. New York, NY: Springer.
Lautenschlager, G. J., & Park, D. G. (1988). Item response theory item bias detection procedures: Issues of robustness, model misspecification and parameter linking. Applied Psychological Measurement, 12, 365–376. doi:10.1177/014662168801200404.
Lazarsfeld, P. F. (1959). Latent structure analysis. New York: McGraw-Hill.
Liu, C., Borg, I., & Spector, P. E. (2004). Measurement equivalence of the German Job Satisfaction Survey used in a multinational organization: Implications of Schwartz’s culture model. Journal of Applied Psychology, 89, 1070–1082. doi:10.1037/0021-9010.89.6.1070.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. New York: Information Age Publishing.
Mach, E. (1926). On thought experiments. In E. Mach (Ed.), Knowledge and error: Sketches on the psychology of enquiry. Dodrecht: D. Reidel Publishing Co.
Maydeu-Olivares, A., & Cai, L. (2006). A cautionary note on using G 2 to assess relative model fit in categorical data analysis. Multivariate Behavioral Research, 41, 55–64. doi:10.1207/s15327906mbr4101_4.
Meade, A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95, 728–743. doi:10.1037/a0018966.
Meade, A. W., Lautenschlager, G. J., & Johnson, E. C. (2007). A comparison of item response theory and confirmatory factor analytic methodologies for establishing measurement equivalence/invariance. Applied Psychological Measurement, 31, 430–455. doi:10.1177/0146621606297316.
Meade, A. W., & Wright, N. A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97, 1016–1031. doi:10.1037/a0027934.
Meriac, J. P., Poling, T. L., & Woehr, D. J. (2009). Are there differences in work ethic? An examination of the measurement equivalence of the multidimensional work ethic profile? Personality and Individual Differences, 47, 209–213. doi:10.1016/j.paid.2009.03.001.
Millsap, R. E. (2011). Statistical approaches to measurement invariance. New York: Routlege.
Netemeyer, R. G., Brashear-Alejandro, T., & Boles, J. S. (2004). A cross-national model of job-related outcomes of work role and family role variables: A retail sales context. Journal of the Academy of Marketing Science, 32, 49–60. doi:10.1177/0092070303259128.
Park, D. G., & Lautenschlager, G. J. (1990). Improving IRT item bias detection with iterative linking and ability scale purification. Applied Psychological Measurement, 14, 163–173. doi:10.1177/014662169001400205.
Parker, C. P., Baltes, B. B., & Christiansen, N. D. (1997). Support for affirmative action, justice perceptions, and work attitudes: A study of gender and racial-ethnic group differences. Journal of Applied Psychology, 82, 376–389. doi:10.1037/0021-9010.82.3.376.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495–502. doi:10.1007/BF02294403.
Riordin, C. M., & Vandenberg, R. J. (1994). A central question in cross-cultural research: Do employees of different cultures interpret work-related measures in an equivalent manner? Journal of Management, 20, 643–671. doi:10.1016/0149-2063(94)90007-8.
Rivas, G. E. L., Stark, S., & Chernyshenko, O. S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement, 33, 251–265. doi:10.1177/0146621608321760.
Rivers, D. C., Meade, A. W., & Fuller, W. L. (2009). Examining question and context effects in organization survey data using item response theory. Organizational Research Methods, 12, 529–553. doi:10.1177/1094428108315864.
Robert, C., Probst, T. M., Martocchio, J. J., Drasgow, F., & Lawler, J. J. (2000). Empowerment and continuous improvement in the United States, Mexico, Poland, and India: Predicting fit on the basis of dimensions of power distance and individualism. Journal of Applied Psychology, 85, 643–658. doi:10.1037/0021-9010.85.5.643.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of the logistic regression and Mantel–Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105–116. doi:10.1177/014662169301700201.
Roznowski, M. (1989). Examination of the measurement properties of the job descriptive index with experimental items. Journal of Applied Psychology, 74, 805–814. doi:10.1037/0021-9010.74.5.805.
Ryan, A. M., Chan, D., Ployhart, R. E., & Slade, L. A. (1999). Employee attitude surveys in a multinational organization: Considering language and culture in assessing measurement equivalence. Personnel Psychology, 52, 37–58. doi:10.1111/j.1744-6570.1999.tb01812.x.
Ryan, A. M., Horvath, M., Ployhart, R. E., Schmitt, N., & Slade, L. A. (2000). Hypothesizing differential item functioning in global employee opinion surveys. Personnel Psychology, 53, 531–562. doi:10.1111/j.1744-6570.2000.tb00213.x.
Ryan, A. M., West, B. J., & Carr, J. Z. (2003). Effects of the terrorist attacks of 9/11/01 on employee attitudes. Journal of Applied Psychology, 88, 647–659. doi:10.1037/0021-9010.88.4.647.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34, 100–114.
Schneider, K. T., Hitlan, R. T., & Radhakrishnan, P. (2000). An examination of the nature and correlates of ethnic harassment experiences in multiple contexts. Journal of Applied Psychology, 85, 3–12. doi:10.1037/0021-9010.85.1.3.
Shealy, R., & Stout, W. (1993). An item response theory model for test bias and differential test functioning. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197–240). Hillsdale, NJ: Earlbaum.
Sheppard, R., Han, K., Colarelli, S. M., Dai, G., & King, D. W. (2006). Differential item functioning by sex and race in the Hogan Personality Inventory. Assessment, 13, 442–453. doi:10.1177/1073191106289031.
Stark, S. (2001). MODFIT: A computer program for model-data fit. Unpublished manuscript. University of Illinois at Urbana Champaign, Champaign.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306. doi:10.1037/0021-9010.91.6.1292.
Thissen, D. (2001). IRTLRDIF: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Chapel Hill, NC: Author.
Thissen, D., Chen, W.-H., & Bock, R. D. (2003). Multilog v7.0 [Computer software]. Lincolnwood, IL: Scientific Software International.
Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99, 118–128. doi:10.1037/0033-2909.99.1.118.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum.
Tsutsumi, A., Iwata, N., Watanabe, N., de Jonge, J., Pikhart, H., Fernández-López, J. A., et al. (2009). Application of item response theory to achieve cross-cultural comparability of occupational stress measurement. International Journal of Methods in Psychiatric Research, 18, 58–67. doi:10.1002/mpr.277.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. doi:10.1177/109442810031002.
Wang, M., & Russell, S. S. (2005). Measurement equivalence of the Job Descriptive Index across Chinese and American workers: Results from confirmatory factor analysis and item response theory. Educational and Psychological Measurement, 65, 709–732. doi:10.1177/0013164404272494.
Wang, W.-C., & Yeh, Y.-L. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, 479–498. doi:10.1177/0146621603259902.
Woods, C. (2008). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33, 42–57. doi:10.1177/0146621607314044.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Carter, N.T., Kotrba, L.M. & Lake, C.J. Null Results in Assessing Survey Score Comparability: Illustrating Measurement Invariance Using Item Response Theory. J Bus Psychol 29, 205–220 (2014). https://doi.org/10.1007/s10869-012-9283-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10869-012-9283-4