This study aims to follow an argument-based approach to validation of using automated essay evaluation (AWE) system with the example of Pigai, a Chinese AWE program, in English as a Foreign Language (EFL) writing assessment in China. First, an interpretive argument was developed for its use in the course of College English. Second, three sub-studies were conducted to seek evidence of claims related to score evaluation, score generalization, score explanation, score extrapolation and feedback utilization. Major findings are: (1) Pigai yields scores that are accurate indicators of the quality of a test performance sample; (2) its scores are consistent across tasks in the same form; (3) its scoring features represent the construct of interest to some extent, yet problems of construct under-representation and construct-irrelevant features still exist; (4) its scores are consistent with teachers’ judgments of students’ writing ability; (5) its feedback has a positive impact on students’ development of writing ability, but to some extent. These results reveal that AWE can only be used as a supplement to human evaluation, but can never replace the latter.
This is a preview of subscription content, log in to check access.
Warschauer, M.: Automated writing evaluation: defining the classroom research agenda. Lang. Teach. Res. 10, 1–24 (2006)CrossRefGoogle Scholar
Valenti, S., Neri, F., Cucchiarelli, A.: An overview of current research on automated essay grading. J. Inf. Technol. Educ. Res. 2, 319–330 (2003)Google Scholar
Xi, X.: Automated scoring and feedback systems: where are we and where are we heading? Lang. Test. 27, 291–300 (2010)CrossRefGoogle Scholar
Williamson, D.M., Xi, X., Breyer, F.J.: A framework for evaluation and use of automated scoring. Educ. Meas.: Issues Pract. 31, 2–13 (2012)CrossRefGoogle Scholar
Zhang, Z.: Student engagement with computer-generated feedback: a case study. ELT J. 70, 1–12 (2016)CrossRefGoogle Scholar
Bai, L., Hu, G.: In the face of fallible AWE feedback: how do students respond? Educ. Psychol. 37, 67–81 (2017)CrossRefGoogle Scholar
Zhang, J.: Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writ. 27, 37–53 (2016)CrossRefGoogle Scholar
Linacre, J.M.: A User’s Guide to FACETS: Rasch-Model Computer Programs. MESA Press, Chicago (2005)Google Scholar
Green, A.: Verbal Protocol Analysis in Language Testing Research: A Handbook. Cambridge University Press, Cambridge (1998)Google Scholar
Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine de Gruyter, Chicago (1967)Google Scholar
McNamara, T.F.: Measuring Second Language Performance. Longman, London (1996)Google Scholar
Miles, M.B., Huberman, A.M.: Qualitative Data Analysis: An Expanded Sourcebook. Sage, Thousand Oaks (1994)Google Scholar