Using MFRM and SEM in the Validation of Analytic Rating Scales of an English Speaking Assessment

  • Jinsong Fan
  • Trevor Bond
Conference paper


This study reports a preliminary investigation into the construct validity of an analytic rating scale developed for a school-based English speaking test. Informed by the theory of interpretative validity argument, this study examined the plausibility and accuracy of three warrants which were deemed essential to the construct validity of the rating scale. Methodologically, this study utilized Many-Facets Rasch Model (MFRM) and Structural Equation Modeling (SEM) in conjunction to examine the three warrants and their respective rebuttals. Though MFRM analysis largely supported the first two warrants, the results indicated that the category structure of the rating scale did not function as intended, and hence needed further revisions. In SEM analysis, multitrait multimethod (MTMM) confirmatory factor analysis (CFA) model was employed, whereby four MTMM models were specified, evaluated, and compared. The results lent support to the third warrant, but raised legitimate concerns over common method bias. The study has implications for the future revisions of the rating scale and the speaking assessment in the interest of improved validity. Meanwhile, this study has methodological implications for performance assessment constructors and rating scale validators.


English speaking assessment Construct validity Many-Facets Rasch Model Structural equation modeling 



The study reported in this chapter was supported by the National Social Sciences Fund of the People’s Republic of China under the project title of “Development and Validation of Standards in Language Testing” (Grant No: 13CYY032), and the Research Project of National Foreign Language Teaching in Higher Education under the project title of “Teacher-, Peer-, and Self-assessment in Translation Teaching: A Many-Facets Rasch Modeling Approach” (Grant No: 2014SH0008A). Part of this research was published in the third issue of Foreign Language Education in China (Quarterly) in 2015.


  1. Adams, R. J., Wilson, M. R., & Wang, W. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–24.CrossRefGoogle Scholar
  2. Bachman, L. F., Lynch, B. K., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12(2), 238–257.CrossRefGoogle Scholar
  3. Bachman, L. F., & Palmer, A. S. (1996). Language assessment in practice: Designing and developing useful language tests. Oxford: Oxford University Press.Google Scholar
  4. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.Google Scholar
  5. Batty, A. O. (2015). A comparison of video-and audio-mediated listening tests with Many-Facets Rasch modeling and differential distractor functioning. Language Testing, 32(1), 3–20.CrossRefGoogle Scholar
  6. Bentler, P. M., & Wu, E. J. (2005). EQS 6.1 for Windows. Encino, CA: Multivariate Software.Google Scholar
  7. Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences: New York: Routledge.Google Scholar
  8. Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Mahwah, New Jersey: Psychology Press.Google Scholar
  9. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.CrossRefPubMedGoogle Scholar
  10. Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the Test of English as a Foreign Language. New York and London: Routledge, Taylor & Francis Group.Google Scholar
  11. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233–255.CrossRefGoogle Scholar
  12. Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facets Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197–221.CrossRefGoogle Scholar
  13. Eckes, T. (2011). Introduction to many-facets Rasch measurement. Frankfurt: Peter Lang.CrossRefGoogle Scholar
  14. Fan, J. (2014). Chinese test takers’ attitudes towards the Versant English Test: A mixed-methods approach. Language Testing in Asia, 4(6), 1–17.Google Scholar
  15. Fan, J., & Ji, P. (2013). Exploring the validity of the Fudan English Test (FET): Test data analysis. Foreign Language Testing and Teaching, 3(2), 45–53.Google Scholar
  16. Fan, J., & Ji, P. (2014). Test candidates’ attitudes and their test performance: The case of the Fudan English Test. University of Sydney Papers in TESOL, 9, 1–35.Google Scholar
  17. Fan, J., Ji, P., & Song, X. (2014a). Washback of university-based English language tests on students’ learning: A case study. The Asian Journal of Applied Linguistics, 1(2), 178–192.Google Scholar
  18. Fan, J., Ji, P., & Yu, L. (2014b). Another perspective on language test validation: The factor structure of language tests. Theory and Practice in Foreign Language Teaching, 4, 34–40.Google Scholar
  19. FDU Testing Team. (2014). The FET Test Syllabus. Shanghai: Fudan University Press.Google Scholar
  20. Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating scale construction. Language Testing, 13(2), 208–238.CrossRefGoogle Scholar
  21. Gu, L. (2014). At the interface between language testing and second language acquisition: Language ability and context of learning. Language Testing, 31(1), 111–133.CrossRefGoogle Scholar
  22. Han, B., Dan, M., & Yang, L. (2004). Problems with College English Test as emerged from a survey. Foreign Languages and Their Teaching, 179(2), 17–23.Google Scholar
  23. In’nami, Y., & Koizumi, R. (2012). Factor structure of the revised TOEFL test: A multi-sample analysis. Language Testing, 29(1), 131–152.Google Scholar
  24. In’nami, Y., & Koizumi, R. (2011). Structural equation modeling in language testing and learning research: A review. Language Assessment Quarterly, 8(3), 250–276.Google Scholar
  25. Kane, M. T. (2012). Validating score interpretations and uses. Language Testing, 29(1), 3–17.CrossRefGoogle Scholar
  26. Kline, R. B. (2005). Principles and practice of structural equation modeling (2nd ed.). New York: Guilford Press.Google Scholar
  27. Knoch, U. (2011). Rating scales for diagnostic assessment of writing: What should they look like and where should the criteria come from? Assessing Writing, 16(2), 81–96.CrossRefGoogle Scholar
  28. Kondo-Brown, K. (2002). A FACET analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19, 3–31.CrossRefGoogle Scholar
  29. Kunnan, A. J. (1995). Test taker characteristics and test performance: A structural modeling approach Cambridge: Cambridge University Press.Google Scholar
  30. Kunnan, A. J. (1998). An introduction to structural equation modeling for language assessment research. Language Testing, 15(3), 295–332.Google Scholar
  31. Linacre, M. (2013). A user’s guide to FACETS (3.71.0). Chicago: MESA Press.Google Scholar
  32. Linacre, M. (2004). Optimal rating scale category effectiveness. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 258–278). Maple Grove, MN: JAM Press.Google Scholar
  33. Llosa, L. (2007). Validating a standards-based classroom assessment of English proficiency: A multitrait-multimethod approach. Language Testing, 24(4), 489–515.CrossRefGoogle Scholar
  34. Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276.CrossRefGoogle Scholar
  35. Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  36. Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facets Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158–180.Google Scholar
  37. McNamara, T. (1996). Measuring second language proficiency. London: Longman.Google Scholar
  38. McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 553–574.Google Scholar
  39. North, B. (2000). The development of common framework scale of language proficiency. New York: Peter Lang.CrossRefGoogle Scholar
  40. North, B., & Jones, N. (2009). Further material on maintaining standards across languages, contexts and administrations by exploiting teacher judgment and IRT scaling. Strasbourg: Language Policy Division.Google Scholar
  41. Ockey, G. J., & Choi, I. (2015). Structural equation modeling reporting practices for language assessment. Language Assessment Quarterly, 12(3), 305–319.Google Scholar
  42. Oon, P. T., & Subramaniam, R. (2011). Rasch modelling of a scale that explores the take-up of Physics among school students from the perspective of teachers. In R. F. Cavanaugh & R. F. Waugh (Eds.), Applications of Rasch measurement in learning environments research (pp. 119–139). Netherlands: Sense Publishers.CrossRefGoogle Scholar
  43. Purpura, J. E. (1999). Learner strategy use and performance on language tests: A structural equation modeling approach. Cambridge: Cambridge University Press.Google Scholar
  44. Sasaki, M., & Hirose, K. (1999). Development of an analytic rating scale for Japanese L1 writing. Language Testing, 16(4), 457–478.CrossRefGoogle Scholar
  45. Sato, T. (2012). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29(2), 223–241.CrossRefGoogle Scholar
  46. Sawaki, Y. (2007). Construct validation of analytic rating scale in speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355–390.CrossRefGoogle Scholar
  47. Sawaki, Y., Stricker, L. J., & Oranje, A. H. (2009). Factor structure of the TOEFL Internet-based test. Language Testing, 26(1), 5–30.CrossRefGoogle Scholar
  48. Shin, S.-Y., & Ewert, D. (2015). What accounts for integrated reading-to-write task scores? Language Testing, 32(2), 259–281.CrossRefGoogle Scholar
  49. Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2), 99–123.CrossRefGoogle Scholar
  50. TOPE Project Team. (2013). Syllabus for Test of Oral Proficiency in English (TOPE). Beijing: China Renming University Press.Google Scholar
  51. Tsinghua University Testing Team. (2012). Syllabus for Tsinghua English Proficiency Test (TEPT). Beijing: Tsinghua University Press.Google Scholar
  52. Upshur, J. A., & Turner, C. E. (1995). Constructing rating scales for second language tests. ELT Journal, 49(1), 3–12.CrossRefGoogle Scholar
  53. Upshur, J. A., & Turner, C. E. (1999). Systematic effects in the rating of second-language speaking ability: test method and learner discourse. Language Testing, 16(1), 82–111.Google Scholar
  54. Xie, Q., & Andrews, S. (2012). Do test design and uses influence test preparation? Testing a model of washback with Structural Equation Modeling. Language Testing, 30(1), 49–70.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Singapore 2016

Authors and Affiliations

  1. 1.Fudan UniversityShanghaiPeople’s Republic of China
  2. 2.The University of MelbourneMelbourneAustralia
  3. 3.James Cook UniversityTownsvilleAustralia

Personalised recommendations