Abstract
Scoring rubrics are known to be effective for assessing writing for both testing and classroom teaching purposes. How raters interpret the descriptors in a rubric can significantly impact the subsequent final score, and further, the descriptors may also color a rater’s judgment of a student’s writing quality. Little is known, however, about how peer raters use a teacher-developed scoring rubric in English as a Foreign Language (EFL) writing contexts. In the current study, Many-Facet Rasch Measurement (MFRM) was applied to examine a scoring rubric for EFL writing and analyze the severity and consistency in rating behaviors between teachers and peer raters. The findings revealed four key points: (1) students’ writing skills can be differentiated by this scoring rubric which measured the construct as expected; (2) four rating criteria were reasonably designed, but the scoring bands were not wide enough to separate students’ writing ability; (3) teachers were stricter raters than student peers, but both showed a central tendency effect; and (4) teachers had outstanding intra-rater reliability, while peers showed higher inter-rater reliability. The findings provide some evidence for the reliability and internal validity of the scoring rubric and indicate that teachers could use it for assessing EFL writing; however, the scoring bands would need to be broadened to be applied to a broader range of students’ writing levels. Implications for introducing scoring rubrics in peer-mediated assessment for teaching writing, developing scoring rubrics in EFL writing assessment, and using MFRM to evaluate scoring rubrics are discussed.
This is a preview of subscription content,
to check access.

Data availability
The author confirms that the data supporting the findings of this study are available within this article. Raw data that support the findings of this study are available from the corresponding author, upon reasonable request.
References
Ajjawi, R., Tai, J., Dawson, P., & Boud, D. (2018). Conceptualizing evaluative judgement for sustainable assessment in higher education. In D. Boud, R. Ajjawi, P. Dawson, and J. Tai (Eds.), Developing evaluative judgement in higher education: Assessment for knowing and producing quality work (p. 23–33). London, UK: Routledge.
Aryadoust, V., Ng, L. Y., & Sayama, H. (2020). A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research. Language Testing, 38(1), 6–40.
Becker, A. (2016). Student-generated scoring rubrics: Examining their formative value for improving ESL students’ writing performance. Assessing Writing, 29, 15–24.
Berggren, J. (2015). Learning from giving feedback: A study of secondary-level students. ELT Journal, 69(1), 58–70.
Bond, T., & Fox, C. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. New York, NY: Routledge.
Chan, S., Inoue, C., & Taylor, L. (2015). Developing rubrics to assess the reading-into-writing skills: A case study. Assessing Writing, 26, 20–37.
Chang, C.-Y., Lee, D.-C., Tang, K.-Y., & Hwang, G.-J. (2021). Effect sizes and research directions of peer assessments: From an integrated perspective of meta-analysis and co-citation network. Computers & Education, 164, 104123.
Chen, J. L. (2016). Validating TEM-8 rating scale via MFRM and TAPs based evidence. Journal of China Examinations, 1, 29–38.
Cho, K., Schunn, C. D., & Wilson, R. W. (2006). Validity and reliability of scaffolded peer assessment of writing from instructor and student perspectives. Journal of Educational Psychology, 98(4), 891–901.
Crusan, D. (2010). Assessment in the second language writing classroom. Ann Arbor, MI: University of Michigan Press.
Deygers, B., & Gorp, K. V. (2015). Determining the scoring validity of a co-constructed CEFR-based rating scale. Language Testing, 32(4), 521–541.
Erguvan, I. D., & Aksu Dunya, B. (2020). Analyzing rater severity in a freshman composition course using many facet Rasch measurement. Language Testing in Asia, 10, 1.
Fan, J. S., & Ji, P. Y. (2017). Teacher-, peer-, and self-assessment in the translation classroom: A many-facet Rasch modelling approach. Foreign Language World, 4, 61–70.
Goodwin, S. (2016). A many-facet Rasch analysis comparing essay rater behavior on an academic English reading/writing test used for two purposes. Assessing Writing, 30, 21–31.
Gordon, R. A., Peng, F., Curby, T. W., & Zinsser, K. M. (2021). An introduction to the many facet Rasch model as a method to improve observational quality measures with an application to measuring the teaching of emotion skills. Early Childhood Research Quarterly, 55, 149–164.
Hafner, C. A., & Ho, W. Y. J. (2020). Assessing digital multimodal composing in second language writing: Towards a process-based model. Journal of Second Language Writing, 47, 100710.
Hawe, E., Dixon, H., Murray, J., & Chandler, S. (2020). Using rubrics and exemplars to develop students’ evaluative and productive knowledge and skill. Journal of Further and Higher Education, 45(8), 1033–1047.
Hawkey, R., & Barker, F. (2004). Developing a common scale for the assessment of writing. Assessing Writing, 9(2), 122–159.
Hoerr, M. (2018). A study on the use of rubrics to guide writing and self-assessment in the L2 elementary classroom (Unpublished M.A. thesis). Hamline University, Saint Paul, MN.
Huisman, B., Saab, N., van Driel, J., & van den Broek, P. (2018). Peer feedback on academic writing: Undergraduate students’ peer feedback role, peer feedback perceptions and essay performance. Assessment & Evaluation in Higher Education, 43(6), 955–968.
Isbell, D. R. (2017). Assessing C2 writing ability on the Certificate of English Language Proficiency: Rater and examinee age effects. Assessing Writing, 34, 37–49.
Janssen, G., Meier, V., & Trace, J. (2015). Building a better rubric: Mixed methods rubric revision. Assessing Writing, 26, 51–66.
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144.
Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275–304.
Lallmamode, S. P., Mat Daud, N., & Abu Kassim, N. L. (2016). Development and initial argument-based validation of a scoring rubric used in the assessment of L2 writing electronic portfolios. Assessing Writing, 30, 44–62.
Langan, J. (2011). College writing skills with readings. New York, NY: McGraw-Hill Education.
Lee, I. (2017). Classroom writing assessment and feedback in L2 school contexts. Singapore: Springer.
Li, W., & Zhang, F. (2021). Tracing the path toward self-regulated revision: An interplay of instructor feedback, peer feedback, and revision goals. Frontiers in Psychology, 11, 612088.
Li, H. (2015). The effect of the use of holistic and analytic scales on the reliability of EFL essay rating. Foreign Languages and Their Teaching, 4(2), 45–51.
Linacre, J. M. (2020). Facets computer program for many-facet Rasch measurement, Version 3.83.4. Beaverton, OR: Winsteps.com.
Linacre, J. M. (2004). Predicting measures from rating scale or partial credit categories for samples and individuals. Rasch Measurement Transactions, 18(1), 972.
Lu, J., & Law, N. (2012). Online peer assessment: Effects of cognitive and affective feedback. Instructional Science, 40, 257–275.
Ma, L. H., & Liu, J. (2021). Validation of critical thinking evaluation in foreign language writing: Based on many-facet Rasch model. Foreign Language Learning Theory and Practice, 4(2), 97–107.
McNamara, T. F. (1996). Measuring second language performance. London, UK: Longman.
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576.
Mendoza, A., & Knoch, U. (2018). Examining the validity of an analytic rating scale for a Spanish test for academic purposes using the argument-based approach to validation. Assessing Writing, 35, 41–55.
Myers, A. J., Ames, A. J., Leventhal, B. C., & Holzman, M. A. (2020). Validating rubric scoring processes: An application of an item response tree model. Applied Measurement in Education, 33(4), 293–308.
Myford, C., & Wolfe, E. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5, 189–227.
O’Donovan, B., Price, M., & Rust, C. (2004). Know what I mean? Enhancing student understanding of assessment standards and criteria. Teaching in Higher Education, 9(3), 325–335.
Paré, D. E., & Joordens, S. (2008). Peering into large lectures: Examining peer and expert mark agreement using peerScholar, an online peer assessment tool. Journal of Computer Assisted Learning, 24(6), 526–540.
Rezaei, A. R., & Lovorn, M. (2010). Reliability and validity of rubrics for assessment through writing. Assessing Writing, 15(1), 18–39.
Skillings, M. J., & Ferrell, R. (2000). Student-generated rubrics: Bringing students into the assessment process. The Reading Teacher, 53(6), 452–455.
Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. London, UK: Palgrave Macmillan.
Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192.
Wu, X. F., Liu, Y. C., & Yin, Y. (2018). Establishment and validation of a rating scale model for English writing tests. Foreign Language and Literature, 34(5), 137–146.
Wu, X. F., & Xiao, Y. T. (2020). Process-oriented validation of a rating scale for English writing assessment. Foreign Language and Literature, 36(5), 150–159.
Yamanishi, H., Ono, M., & Hijikata, Y. (2019). Developing a scoring rubric for L2 summary writing: A hybrid approach combining analytic and holistic assessment. Language Testing in Asia, 9, 13.
Zhang, F., Schunn, C., Li, W., & Long, M. (2020). Changes in the reliability and validity of peer assessment across the college years. Assessment & Evaluation in Higher Education, 45(8), 1073–1087.
Acknowledgements
I would like to especially thank the handling editor, Dr. Peng Peng, who provided valuable suggestions that facilitated two rounds of revisions and I feel obliged to his professionalism. I am also grateful to Dr. Ling Shi, Dr. Paul Stapleton, and four anonymous reviewers for their detailed and incisive comments on earlier drafts of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix I
Appendix I
Scoring rubrics for teacher- and peer-mediated assessment

Rights and permissions
About this article
Cite this article
Li, W. Scoring rubric reliability and internal validity in rater-mediated EFL writing assessment: Insights from many-facet Rasch measurement. Read Writ 35, 2409–2431 (2022). https://doi.org/10.1007/s11145-022-10279-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11145-022-10279-1