Skip to main content

Analyses of Model Fit and Robustness. A New Look at the PISA Scaling Model Underlying Ranking of Countries According to Reading Literacy

Abstract

This paper addresses methodological issues that concern the scaling model used in the international comparison of student attainment in the Programme for International Student Attainment (PISA), specifically with reference to whether PISA’s ranking of countries is confounded by model misfit and differential item functioning (DIF). To determine this, we reanalyzed the publicly accessible data on reading skills from the 2006 PISA survey. We also examined whether the ranking of countries is robust in relation to the errors of the scaling model. This was done by studying invariance across subscales, and by comparing ranks based on the scaling model and ranks based on models where some of the flaws of PISA’s scaling model are taken into account. Our analyses provide strong evidence of misfit of the PISA scaling model and very strong evidence of DIF. These findings do not support the claims that the country rankings reported by PISA are robust.

This is a preview of subscription content, access via your institution.

Figure 1.
Figure 2.
Figure 3.

References

  • Adams, R.J. (2003). Response to ‘Cautions on OECD’s recent educational survey (PISA)’. Oxford Review of Education, 29, 379–389. Note: Publications from PISA can be found at http://www.oecd.org/pisa/pisaproducts/.

    Article  Google Scholar 

  • Adams, R., Berezner, A., & Jakubowski, M. (2010). Analysis of PISA 2006 preferred items ranking using the percent-correct method. Paris: OECD. http://www.oecd.org/pisa/pisaproducts/pisa2006/44919855.pdf.

    Book  Google Scholar 

  • Adams, R.J., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23.

    Article  Google Scholar 

  • Adams, R.J., Wu, M.L., & Carstensen, C.H. (2007). Application of multivariate Rasch models in international large-scale educational assessments. In M. Von Davier & C.H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models (pp. 271–280). New York: Springer.

    Chapter  Google Scholar 

  • Andersen, E.B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, 123–140.

    Article  Google Scholar 

  • Brown, G., Micklewrigth, J., Schnepf, S.V., & Waldmann, R. (2007). International surveys of educational achievement: how robust are the findings? Journal of the Royal Statistical Society. Series A. General, 170, 623–646.

    Article  Google Scholar 

  • Dorans, N.J., & Holland, P.W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Hilsdale: Lawrence Erlbaum Associates.

    Google Scholar 

  • Fischer, G.H. & Molenaar, I.W. (Eds.) (1995). Rasch models—foundations, recent developments, and applications. Berlin: Springer.

    Google Scholar 

  • Glass, G.V., & Hopkins, K.D. (1995). In Statistical methods in education and psychology. Boston: Allyn & Bacon.

    Google Scholar 

  • Goldstein, H. (2004). International comparisons of student attainment: some issues arising from the PISA study. Assessment in Education, 11, 319–330.

    Article  Google Scholar 

  • Goodman, L.A., & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–764.

    Google Scholar 

  • Hopmann, S.T., Brinek, G., & Retzl, M. (Eds.) (2007). PISA zufolge PISA. PISA according to PISA. Wien: Lit Verlag. http://www.univie.ac.at/pisaaccordingtopisa/pisazufolgepisa.pdf.

    Google Scholar 

  • Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49, 223–245.

    Article  Google Scholar 

  • Kelderman, H. (1989). Item bias detection using loglinear IRT. Psychometrika, 54, 681–697.

    Article  Google Scholar 

  • Kirsch, I., de Jng, J., Lafontaine, D., McQueen, J., Mendelovits, J., & Monseur, C. (2002). Reading for change. performance and engagement across countries. results from PISA 2000. Paris: OECD.

    Google Scholar 

  • Kreiner, S. (1987). Analysis of multidimensional contingency tables by exact conditional tests: techniques and strategies. Scandinavian Journal of Theoretical Statistics, 14, 97–112.

    Google Scholar 

  • Kreiner, S. (2011a). A note on item-restscore association in Rasch models. Applied Psychological Measurement, 35, 557–561.

    Article  Google Scholar 

  • Kreiner, S. (2011b). Is the foundation under PISA solid? A critical look at the scaling model underlying international comparisons of student attainment. Research report 11/1, Dept. of Biostatistics, University of Copenhagen. https://ifsv.sund.ku.dk/biostat/biostat_annualreport/images/c/ca/ResearchReport-2011-1.pdf.

  • Kreiner, S., & Christensen, K.B. (2007). Validity and objectivity in health-related scales: analysis by graphical loglinear Rasch models. In M. Von Davier & C.H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models (pp. 271–280). New York: Springer.

    Google Scholar 

  • Kreiner, S., & Christensen, K.B. (2011). Exact evaluation of bias in Rasch model residuals. Advances in Mathematics Research, 12, 19–40.

    Google Scholar 

  • Molenaar, I.V. (1983). Some improved diagnostics for failure of the Rasch model. Psychometrika, 48, 49–72.

    Article  Google Scholar 

  • OECD (2000). Measuring student knowledge and skills. the PISA 2000 assessment of reading, mathematical and scientific literacy. Paris: OECD. http://www.oecd.org/dataoecd/44/63/33692793.pdf.

    Google Scholar 

  • OECD (2006). PISA 2006. Technical report. Paris: OECD. http://www.oecd.org/dataoecd/0/47/42025182.pdf.

  • OECD (2007). PISA 2006. Volume 2: data. Paris: OECD.

    Book  Google Scholar 

  • OECD (2009). PISA data analysis manual: SPSS (2nd ed.). Paris: OECD. http://www.oecd-ilibrary.org/education/pisa-data-analysis-manual-spss-second-edition_9789264056275-en.

    Book  Google Scholar 

  • Prais, S.J. (2003). Cautions on OECD’s recent educational survey (PISA). Oxford Review of Education, 29, 139–163.

    Article  Google Scholar 

  • Rosenbaum, P. (1989). Criterion-related construct validity. Psychometrika, 54, 625–633.

    Article  Google Scholar 

  • Smith, R.M. (2004). Fit analysis in latent trait measurement models. In E.V. Smith & R.M. Smith (Eds.), Introduction to Rasch measurement (pp. 73–92). Maple Grove: JAM Press.

    Google Scholar 

  • Schmitt, A.P., & Dorans, N.J. (1987). Differential item functioning on the scholastic aptitude test. Research memorandum No. 87-1. Princeton NJ: Educational Testing Service.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Svend Kreiner.

Appendices

Appendix A. Information on Countries

Table A.1 provides information on (a) average scores and the number of students with complete responses to 20 items, (b) DIF equated scores with Azerbaijan as reference country (Kreiner and Christensen 2007), and (c) overall tests of fit of the Rasch model in 56 countries.

Table A.1. Average total and DIF equated scores on 20 items in 56 countries and conditional likelihood ratio (CLR) tests of the Rasch model comparing item parameters estimated for student with raw scores below and above the median raw score in the country.

Appendix B. Analyses of Data from All Booklets with Reading Items

In addition to Booklet 6 with 28 reading items, reading items can also be found in Booklets 2, 7, 9, 11, 12, and 13. Each of these booklets contained 14 items. Booklets 9, 11, and 13 had items from reading units R055, R104, and R111. We refer the these booklets together with Booklet 6 that also had these reading units as Booklet set 1. Booklet set 2 consists of Booklets 2, 6, 7, and 12 with reading units R067, R102, R219, and R220.

Table B.1 shows the overall CLR tests of the Rasch model for the two different booklet sets as a whole and for the different booklets. In addition to the CLR tests not DIF relative to country, CLR tests also provided evidence of DIF relative to booklets (Booklet set 1: CLR=1669.0, df=42, p<0.00005; Booklet set 2: CLR=3260.3, df=27p<0.00005).

Additional information on these analyses is available from the authors on request.

Table B.1. Overall fit statistics for Booklets sets 1 and 2 and for Booklets 2,7,9,11,12,13.

Appendix C. Assessment of Ranking Error

Let Y cvi be the score on item i by person v from country c (c=1,…,C; v=1,…,N c ; i=1,…,I) and let A be the indices of a subset of items A⊂{1,…,I}. The total score on all items is \(S_{cv} = \sum_{i = 1}^{I} Y_{cvi}\) and the subscore over items in A is T cv =∑ iA Y cvi . This Appendix is concerned with errors when countries are ranked according to averages \(S_{c} = \frac{1}{N_{c}}\sum_{v} S_{cv}\) and \(T_{c} = \frac{1}{N_{c}}\sum_{v} T_{cv}\).

We assume that item responses fit a Rasch model with a latent variable Θ. The distribution of Θ may be nonparametric or parametric. In the nonparametric case, the population parameters of interest are the score probabilities P(S cv =s) and P(T cv =t). The marginal distribution of T cv is given by P(T c =t)=∑ s P(T cv =t|S cv =s)P(S cv =s). Under the Rasch model, P(T cv =t|S cv =s) depends on item parameters, but not on Θ. Under such a model, it is consequently easy to calculate estimates of the subscore probabilities P(T cv =t) in the nonparametric case if consistent estimates of the item parameters and the score probabilities P(S cv =s) are available.

In the parametric case, we assume that the latent variables are Gaussian normal with means ξ c and standard deviations σ c . Given these distributions, Monte Carlo methods provide simple estimates of the distributions of S cv and T cv based on estimates of item parameters together with estimates of ξ c and σ c .

The country ranks according to (S 1,…,S C ) and (T 1,…,T C ) are expected to be similar under the Rasch model, but ranking errors will occur depending on the number of items and on sample sizes in different countries: the smaller the sample size and the smaller the number of items, the larger the ranking error. And, of course, the ranking errors also depend on ξ c and σ c . The results reported in this paper are derived under the parametric model with Monte Carlo estimates of the distributions of S cv and T cv based on Monte Carlo samples of 10,000 random students from each country.

To estimate the distribution of the country ranks based on country averages S c and/or T c , we generated 100,000 random values of S c and/or T c and for each set ranked the countries according to these values. Given these estimates, it is easy to find both confidence intervals and probabilities of extreme rankings for the countries.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kreiner, S., Christensen, K.B. Analyses of Model Fit and Robustness. A New Look at the PISA Scaling Model Underlying Ranking of Countries According to Reading Literacy. Psychometrika 79, 210–231 (2014). https://doi.org/10.1007/s11336-013-9347-z

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-013-9347-z

Key words

  • differential item functioning
  • ranking
  • robustness
  • educational testing
  • programme for international student assessment
  • PISA
  • Rasch models
  • reading literacy