Skip to main content
Log in

Scoring rubric reliability and internal validity in rater-mediated EFL writing assessment: Insights from many-facet Rasch measurement

Reading and Writing Aims and scope Submit manuscript

Cite this article

Abstract

Scoring rubrics are known to be effective for assessing writing for both testing and classroom teaching purposes. How raters interpret the descriptors in a rubric can significantly impact the subsequent final score, and further, the descriptors may also color a rater’s judgment of a student’s writing quality. Little is known, however, about how peer raters use a teacher-developed scoring rubric in English as a Foreign Language (EFL) writing contexts. In the current study, Many-Facet Rasch Measurement (MFRM) was applied to examine a scoring rubric for EFL writing and analyze the severity and consistency in rating behaviors between teachers and peer raters. The findings revealed four key points: (1) students’ writing skills can be differentiated by this scoring rubric which measured the construct as expected; (2) four rating criteria were reasonably designed, but the scoring bands were not wide enough to separate students’ writing ability; (3) teachers were stricter raters than student peers, but both showed a central tendency effect; and (4) teachers had outstanding intra-rater reliability, while peers showed higher inter-rater reliability. The findings provide some evidence for the reliability and internal validity of the scoring rubric and indicate that teachers could use it for assessing EFL writing; however, the scoring bands would need to be broadened to be applied to a broader range of students’ writing levels. Implications for introducing scoring rubrics in peer-mediated assessment for teaching writing, developing scoring rubrics in EFL writing assessment, and using MFRM to evaluate scoring rubrics are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Data availability

The author confirms that the data supporting the findings of this study are available within this article. Raw data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

  • Ajjawi, R., Tai, J., Dawson, P., & Boud, D. (2018). Conceptualizing evaluative judgement for sustainable assessment in higher education. In D. Boud, R. Ajjawi, P. Dawson, and J. Tai (Eds.), Developing evaluative judgement in higher education: Assessment for knowing and producing quality work (p. 23–33). London, UK: Routledge.

  • Aryadoust, V., Ng, L. Y., & Sayama, H. (2020). A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research. Language Testing, 38(1), 6–40.

    Article  Google Scholar 

  • Becker, A. (2016). Student-generated scoring rubrics: Examining their formative value for improving ESL students’ writing performance. Assessing Writing, 29, 15–24.

    Article  Google Scholar 

  • Berggren, J. (2015). Learning from giving feedback: A study of secondary-level students. ELT Journal, 69(1), 58–70.

    Article  Google Scholar 

  • Bond, T., & Fox, C. (2015). Applying the Rasch model: Fundamental measurement in the human sciences. New York, NY: Routledge.

    Book  Google Scholar 

  • Chan, S., Inoue, C., & Taylor, L. (2015). Developing rubrics to assess the reading-into-writing skills: A case study. Assessing Writing, 26, 20–37.

    Article  Google Scholar 

  • Chang, C.-Y., Lee, D.-C., Tang, K.-Y., & Hwang, G.-J. (2021). Effect sizes and research directions of peer assessments: From an integrated perspective of meta-analysis and co-citation network. Computers & Education, 164, 104123.

  • Chen, J. L. (2016). Validating TEM-8 rating scale via MFRM and TAPs based evidence. Journal of China Examinations, 1, 29–38.

    Google Scholar 

  • Cho, K., Schunn, C. D., & Wilson, R. W. (2006). Validity and reliability of scaffolded peer assessment of writing from instructor and student perspectives. Journal of Educational Psychology, 98(4), 891–901.

    Article  Google Scholar 

  • Crusan, D. (2010). Assessment in the second language writing classroom. Ann Arbor, MI: University of Michigan Press.

    Book  Google Scholar 

  • Deygers, B., & Gorp, K. V. (2015). Determining the scoring validity of a co-constructed CEFR-based rating scale. Language Testing, 32(4), 521–541.

    Article  Google Scholar 

  • Erguvan, I. D., & Aksu Dunya, B. (2020). Analyzing rater severity in a freshman composition course using many facet Rasch measurement. Language Testing in Asia, 10, 1.

    Article  Google Scholar 

  • Fan, J. S., & Ji, P. Y. (2017). Teacher-, peer-, and self-assessment in the translation classroom: A many-facet Rasch modelling approach. Foreign Language World, 4, 61–70.

    Google Scholar 

  • Goodwin, S. (2016). A many-facet Rasch analysis comparing essay rater behavior on an academic English reading/writing test used for two purposes. Assessing Writing, 30, 21–31.

    Article  Google Scholar 

  • Gordon, R. A., Peng, F., Curby, T. W., & Zinsser, K. M. (2021). An introduction to the many facet Rasch model as a method to improve observational quality measures with an application to measuring the teaching of emotion skills. Early Childhood Research Quarterly, 55, 149–164.

    Article  Google Scholar 

  • Hafner, C. A., & Ho, W. Y. J. (2020). Assessing digital multimodal composing in second language writing: Towards a process-based model. Journal of Second Language Writing, 47, 100710.

    Article  Google Scholar 

  • Hawe, E., Dixon, H., Murray, J., & Chandler, S. (2020). Using rubrics and exemplars to develop students’ evaluative and productive knowledge and skill. Journal of Further and Higher Education, 45(8), 1033–1047.

    Article  Google Scholar 

  • Hawkey, R., & Barker, F. (2004). Developing a common scale for the assessment of writing. Assessing Writing, 9(2), 122–159.

    Article  Google Scholar 

  • Hoerr, M. (2018). A study on the use of rubrics to guide writing and self-assessment in the L2 elementary classroom (Unpublished M.A. thesis). Hamline University, Saint Paul, MN.

  • Huisman, B., Saab, N., van Driel, J., & van den Broek, P. (2018). Peer feedback on academic writing: Undergraduate students’ peer feedback role, peer feedback perceptions and essay performance. Assessment & Evaluation in Higher Education, 43(6), 955–968.

    Article  Google Scholar 

  • Isbell, D. R. (2017). Assessing C2 writing ability on the Certificate of English Language Proficiency: Rater and examinee age effects. Assessing Writing, 34, 37–49.

    Article  Google Scholar 

  • Janssen, G., Meier, V., & Trace, J. (2015). Building a better rubric: Mixed methods rubric revision. Assessing Writing, 26, 51–66.

    Article  Google Scholar 

  • Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144.

    Article  Google Scholar 

  • Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275–304.

    Article  Google Scholar 

  • Lallmamode, S. P., Mat Daud, N., & Abu Kassim, N. L. (2016). Development and initial argument-based validation of a scoring rubric used in the assessment of L2 writing electronic portfolios. Assessing Writing, 30, 44–62.

    Article  Google Scholar 

  • Langan, J. (2011). College writing skills with readings. New York, NY: McGraw-Hill Education.

  • Lee, I. (2017). Classroom writing assessment and feedback in L2 school contexts. Singapore: Springer.

    Book  Google Scholar 

  • Li, W., & Zhang, F. (2021). Tracing the path toward self-regulated revision: An interplay of instructor feedback, peer feedback, and revision goals. Frontiers in Psychology, 11, 612088.

  • Li, H. (2015). The effect of the use of holistic and analytic scales on the reliability of EFL essay rating. Foreign Languages and Their Teaching, 4(2), 45–51.

    Google Scholar 

  • Linacre, J. M. (2020). Facets computer program for many-facet Rasch measurement, Version 3.83.4. Beaverton, OR: Winsteps.com.

  • Linacre, J. M. (2004). Predicting measures from rating scale or partial credit categories for samples and individuals. Rasch Measurement Transactions, 18(1), 972.

    Google Scholar 

  • Lu, J., & Law, N. (2012). Online peer assessment: Effects of cognitive and affective feedback. Instructional Science, 40, 257–275.

    Article  Google Scholar 

  • Ma, L. H., & Liu, J. (2021). Validation of critical thinking evaluation in foreign language writing: Based on many-facet Rasch model. Foreign Language Learning Theory and Practice, 4(2), 97–107.

    Google Scholar 

  • McNamara, T. F. (1996). Measuring second language performance. London, UK: Longman.

    Google Scholar 

  • McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576.

    Article  Google Scholar 

  • Mendoza, A., & Knoch, U. (2018). Examining the validity of an analytic rating scale for a Spanish test for academic purposes using the argument-based approach to validation. Assessing Writing, 35, 41–55.

    Article  Google Scholar 

  • Myers, A. J., Ames, A. J., Leventhal, B. C., & Holzman, M. A. (2020). Validating rubric scoring processes: An application of an item response tree model. Applied Measurement in Education, 33(4), 293–308.

    Article  Google Scholar 

  • Myford, C., & Wolfe, E. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5, 189–227.

    Google Scholar 

  • O’Donovan, B., Price, M., & Rust, C. (2004). Know what I mean? Enhancing student understanding of assessment standards and criteria. Teaching in Higher Education, 9(3), 325–335.

    Article  Google Scholar 

  • Paré, D. E., & Joordens, S. (2008). Peering into large lectures: Examining peer and expert mark agreement using peerScholar, an online peer assessment tool. Journal of Computer Assisted Learning, 24(6), 526–540.

    Article  Google Scholar 

  • Rezaei, A. R., & Lovorn, M. (2010). Reliability and validity of rubrics for assessment through writing. Assessing Writing, 15(1), 18–39.

    Article  Google Scholar 

  • Skillings, M. J., & Ferrell, R. (2000). Student-generated rubrics: Bringing students into the assessment process. The Reading Teacher, 53(6), 452–455.

    Google Scholar 

  • Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.

    Book  Google Scholar 

  • Weir, C. J. (2005). Language testing and validation: An evidence-based approach. London, UK: Palgrave Macmillan.

    Book  Google Scholar 

  • Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192.

    Article  Google Scholar 

  • Wu, X. F., Liu, Y. C., & Yin, Y. (2018). Establishment and validation of a rating scale model for English writing tests. Foreign Language and Literature, 34(5), 137–146.

    Google Scholar 

  • Wu, X. F., & Xiao, Y. T. (2020). Process-oriented validation of a rating scale for English writing assessment. Foreign Language and Literature, 36(5), 150–159.

    Google Scholar 

  • Yamanishi, H., Ono, M., & Hijikata, Y. (2019). Developing a scoring rubric for L2 summary writing: A hybrid approach combining analytic and holistic assessment. Language Testing in Asia, 9, 13.

    Article  Google Scholar 

  • Zhang, F., Schunn, C., Li, W., & Long, M. (2020). Changes in the reliability and validity of peer assessment across the college years. Assessment & Evaluation in Higher Education, 45(8), 1073–1087.

    Article  Google Scholar 

Download references

Acknowledgements

I would like to especially thank the handling editor, Dr. Peng Peng, who provided valuable suggestions that facilitated two rounds of revisions and I feel obliged to his professionalism. I am also grateful to Dr. Ling Shi, Dr. Paul Stapleton, and four anonymous reviewers for their detailed and incisive comments on earlier drafts of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wentao Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix I

Appendix I

Scoring rubrics for teacher- and peer-mediated assessment

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W. Scoring rubric reliability and internal validity in rater-mediated EFL writing assessment: Insights from many-facet Rasch measurement. Read Writ 35, 2409–2431 (2022). https://doi.org/10.1007/s11145-022-10279-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11145-022-10279-1

Keywords

Navigation