, Volume 82, Issue 1, pp 210–232 | Cite as

Monitoring Countries in a Changing World: A New Look at DIF in International Surveys

  • Robert J. ZwitserEmail author
  • S. Sjoerd F. Glaser
  • Gunter Maris


This paper discusses the issue of differential item functioning (DIF) in international surveys. DIF is likely to occur in international surveys. What is needed is a statistical approach that takes DIF into account, while at the same time allowing for meaningful comparisons between countries. Some existing approaches are discussed and an alternative is provided. The core of this alternative approach is to define the construct as a large set of items, and to report in terms of summary statistics. Since the data are incomplete, measurement models are used to complete the incomplete data. For that purpose, different models can be used across countries. The method is illustrated with PISA’s reading literacy data. The results indicate that this approach fits the data better than the current PISA methodology; however, the league tables are nearly identical. The implications for monitoring changes over time are discussed.


differential item functioning DIF ranking robustness educational testing programme for international student assessment PISA Rasch models reading literacy 


  1. Adams, R. (2011, 19 April). Comments on Kreiner 2011: Is the foundation under PISA solid? A critical look at the scaling model underlying international comparisons of student attainment. Retrieved from
  2. Adams, R., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21(1), 1–23.CrossRefGoogle Scholar
  3. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washinton, DC: American Educational Research Association.Google Scholar
  4. Andersen, E. B. (1973). Conditional inference and models for measuring. (Unpublished doctoral dissertation). Mentalhygiejnisk Forskningsinstitut.Google Scholar
  5. Bechger, T. M., & Maris, G. (2015). A statistical test for differential item pair functioning. Psychometrika, 80(2), 317–340. doi: 10.1007/s11336-014-9408-y.CrossRefPubMedGoogle Scholar
  6. Bechger, T.M., Maris, G., & Verstralen, H.H.F.M. (2010). A different view on DIF (Measurement and Research Department Reports No. 2010-4). Cito.Google Scholar
  7. Béguin, A. A., & Wools, S. (2015). Vertical comparison using reference sets. In R. E. Millsap, D. M. Bolt, L. A. van der Ark, & W. C. Wang (Eds.), Quantitative psychology research (Vol. 89, pp. 195–211). Switzerland: Springer International Publishing.Google Scholar
  8. Bolsinova, M., Maris, G., & Hoijtink, H. (2016). Unmixing Rasch scales: How to score an educational test. Annals of Applied Statistics, 10(2), 925–945. doi: 10.1214/16-AOAS919.CrossRefGoogle Scholar
  9. Council of Europe. (2012). First european survey on language competences: Technical report. Retrieved from
  10. Dieterich, C. (2013, March). In or out, DJIA companies reflect changing times. The Wall Street Journal. Retrieved from
  11. Goldstein, H. (2004). International comparisons of student attainment: Some issues arising from the PISA study. Assessment in Education, 11(3), 319–330. doi: 10.1080/0969594042000304618.CrossRefGoogle Scholar
  12. Holland, P., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  13. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking. Methods and practices (2nd ed.). New York: Springer.CrossRefGoogle Scholar
  14. Kreiner, S. (2011). Is the foundation under PISA solid? A critical look at the scaling model underlying international comparisons of student attainment. (Tech. Rep.). Dept. of Biostatistics, University of Copenhagen.Google Scholar
  15. Kreiner, S., & Christensen, K. B. (2007). Validity and objectivity in health-related scales: Analysis by graphical loglinear Rasch models. In M. Von Davier & C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models (pp. 329–346). New York: Springer.CrossRefGoogle Scholar
  16. Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210–231. doi: 10.1007/s11336-013-9347-z.CrossRefPubMedGoogle Scholar
  17. Le, L. T. (2007). Effects of item positions on their difficulty and discrimination: A study in PISA science data across test language and countries. Paper presented at the 72nd Annual Meeting of the Psychometric Society, Tokyo, Japan. Retrieved from
  18. Linthorne, N. (2014, August). Wind assistance in the 100m sprint. Retrieved from
  19. Lord, F., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.Google Scholar
  20. Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193.CrossRefGoogle Scholar
  21. Marsman, M., Maris, G., Bechger, T., & Glas, C. (2016). What can we learn from Plausible Values? Psychometrika, 81, 274–289. doi: 10.1007/s11336-016-9497-x.CrossRefPubMedPubMedCentralGoogle Scholar
  22. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.CrossRefGoogle Scholar
  23. Mazzeo, J., Kulick, E., Tay-Lim, B., & Perie, M. (2006). Technical report for the 2000 market-basket study in mathematics (Tech. Rep.). ETS.Google Scholar
  24. Mislevy, R. J. (1998). Implications of market-basket reporting for achievement-level setting. Applied Psychological Measurement, 11(1), 49–63.Google Scholar
  25. National Research Council. (2001). Naep reporting practices: Investigating district-level and market-basket reporting. Washington, DC: The National Academies Press. doi: 10.17226/10049.
  26. NCES. (1997, October). NAEP reconfigured: An integrated redesign of the national assessment of educational progress (Tech. Rep. No. 97-31). National Center For Educational Statistics. Retrieved from
  27. OECD. (2004). Learning for tomorrows world: First results from PISA 2003. Retrieved from
  28. OECD. (2007). PISA 2006: Science competencies for tomorrows world: Volume 1: Analysis.Google Scholar
  29. OECD. (2009a). PISA 2006 technical report.Google Scholar
  30. OECD. (2009b) PISA data analysis manual.Google Scholar
  31. OECD. (2012). The policy impact of PISA: An exploration of the normative effects of international benchmarking in school system performance (OECD Education Working Paper No. 71). Organisation for Economic Co-operation and Development.Google Scholar
  32. Oliveri, M. E., & Ercikan, K. (2011). Do different approaches to examining construct comparability in multilanguage assessments lead to similar conclusions? Applied Measurement in Education, 24(4), 349–366. doi: 10.1080/08957347.2011.607063.CrossRefGoogle Scholar
  33. Oliveri, M. E., & Von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53(3), 315–333.Google Scholar
  34. Oliveri, M. E., & Von Davier, M. (2014). Toward increasing fairness in score scale calibrations employed in international large-scale assessments. International Journal of Testing, 14(1), 1–21. doi: 10.1080/15305058.2013.825265.CrossRefGoogle Scholar
  35. Sandilands, D., Oliveri, M. E., Zumbo, B. D., & Ercikan, K. (2013). Investigating sources of differential item functioning in international large-scale assessments using a confirmatory approach. International Journal of Testing, 13(2), 152–174. doi: 10.1080/15305058.2012.690140.CrossRefGoogle Scholar
  36. Verhelst, N. D. (2012). Profile analysis: A closer look at the PISA 2000 reading data. Scandinavian Journal of Educational Research, 56(3), 315–332. doi: 10.1080/00313831.2011.583937.CrossRefGoogle Scholar
  37. Verhelst, N. D., & Glas, C. A. W. (1995). The one parameter logistic model: OPLM. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models: Foundations, recent developments and applications (pp. 215–238). New York: Springer.CrossRefGoogle Scholar
  38. Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F. M. (1993). OPLM: One parameter logistic model. Computer program and manual. Arnhem: Cito.Google Scholar

Copyright information

© The Psychometric Society 2016

Authors and Affiliations

  • Robert J. Zwitser
    • 1
    Email author
  • S. Sjoerd F. Glaser
    • 1
  • Gunter Maris
    • 1
    • 2
  1. 1.University of AmsterdamAmsterdamThe Netherlands
  2. 2.Cito Institute for Educational MeasurementArnhemThe Netherlands

Personalised recommendations