A Fairness Evaluation of Automated Methods for Scoring Text Evidence Usage in Writing

Litman, Diane; Zhang, Haoran; Correnti, Richard; Matsumura, Lindsay Clare; Wang, Elaine

doi:10.1007/978-3-030-78292-4_21

Diane Litman¹³,
Haoran Zhang¹³,
Richard Correnti¹³,
Lindsay Clare Matsumura¹³ &
…
Elaine Wang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12748))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

3527 Accesses
5 Citations
10 Altmetric

Abstract

Automated Essay Scoring (AES) can reliably grade essays at scale and reduce human effort in both classroom and commercial settings. There are currently three dominant supervised learning paradigms for building AES models: feature-based, neural, and hybrid. While feature-based models are more explainable, neural network models often outperform feature-based models in terms of prediction accuracy. To create models that are accurate and explainable, hybrid approaches combining neural network and feature-based models are of increasing interest. We compare these three types of AES models with respect to a different evaluation dimension, namely algorithmic fairness. We apply three definitions of AES fairness to an essay corpus scored by different types of AES systems with respect to upper elementary students’ use of text evidence. Our results indicate that different AES models exhibit different types of biases, spanning students’ gender, race, and socioeconomic status. We conclude with a step towards mitigating AES bias once detected.

The research reported here was supported, in whole or in part, by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A160245 to the University of Pittsburgh. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Analysis, Evidence, Organization, Style, Mechanics/Usage/Grammar/Spelling.
2.
Students in our sample also identified as Hispanic (22.0%), Native American (11.5%), Asian (4.3%), Hawaiian (2.0%) and White (12.1%). These categories are not mutually exclusive. We focus on African American students in our study as this was the only subgroup that was large enough (had sufficient data) for our analyses.
3.
https://github.com/Rokeer/co-attention.
4.
https://github.com/Rokeer/hybrid.
5.
We select one subset of features (from the set computed by the code release) that works for general AES purposes. Specifically, we introduce data from more prompts, including a second RTA prompt and eight prompts from the ASAP dataset (https://www.kaggle.com/c/asap-aes/). Then, we train models with only one combined hand-crafted feature for each prompt. Last, we select features that significantly improve the base neural model on the development set for at least 6 (out of 10) prompts. The intuition is that we want to select multiple features and combine each into the best level of the model hierarchy to create a version of \(AES_{hybrid}\) that is robust, while still preserving a reasonable number of features for our experiment.
6.
Comparing to the broader fairness literature, Loukine et al. [19] state that OSA is similar in spirit to predictive accuracy [31], OSD to standardized mean difference [34] and treatment equality [3], and CSD to conditional procedure equality [3] and differential feature functioning [39].

References

Amorim, E., Cançado, M., Veloso, A.: Automated essay scoring in the presence of biased ratings. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long Papers), vol. 1, pp. 229–237 (2018)
Google Scholar
Attali, Y., Burstein, J.: Automated essay scoring with e-rater® v. 2. J. Technol. Learn. Assess. 4(3) (2006)
Google Scholar
Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.: Fairness in criminal justice risk assessments: the state of the art. Sociol. Methods Res. 0049124118782533 (2018)
Google Scholar
Bridgeman, B.: 13 human ratings and automated essay evaluation. In: Handbook of Automated Essay Evaluation: Current Applications and New Directions, p. 221 (2013)
Google Scholar
Chen, H., He, B.: Automated essay scoring by maximizing human-machine agreement. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1741–1752 (2013)
Google Scholar
Correnti, R., Matsumura, L.C., Hamilton, L., Wang, E.: Assessing students’ skills at writing analytically in response to texts. Elem. Sch. J. 114(2), 142–177 (2013)
Article Google Scholar
Correnti, R., Matsumura, L.C., Wang, E., Litman, D., Rahimi, Z., Kisa, Z.: Automated scoring of students’ use of text evidence in writing. Read. Res. Q. 55(3), 493–520 (2020)
Article Google Scholar
Dasgupta, T., Naskar, A., Dey, L., Saha, R.: Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 93–102 (2018)
Google Scholar
Dong, F., Zhang, Y., Yang, J.: Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 153–162 (2017)
Google Scholar
Ghosh, D., Khanam, A., Han, Y., Muresan, S.: Coarse-grained argumentation features for scoring persuasive essays. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Short Papers), vol. 2, pp. 549–554 (2016)
Google Scholar
Jin, C., He, B., Hui, K., Sun, L.: TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), vol. 1, pp. 1088–1097 (2018)
Google Scholar
Johnson, D., VanBrackle, L.: Linguistic discrimination in writing assessment: how raters react to African American “errors,” ESL errors, and standard English errors on a state-mandated writing exam. Assess. Writ. 17(1), 35–54 (2012)
Article Google Scholar
Ke, Z., Ng, V.: Automated essay scoring: a survey of the state of the art. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6300–6308. AAAI Press (2019)
Google Scholar
Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel (1975)
Google Scholar
Kizilcec, R.F., Lee, H.: Algorithmic fairness in education. In: Holmes, W., Porayska-Pomsta, K. (eds.) Ethics in Artificial Intelligence in Education. Taylor and Francis (forthcoming)
Google Scholar
Kung, C., Yu, R.: Interpretable models do not compromise accuracy or fairness in predicting college success. In: Proceedings of the Seventh ACM Conference on Learning@ Scale, pp. 413–416 (2020)
Google Scholar
Liu, J., Xu, Y., Zhao, L.: Automated essay scoring based on two-stage learning. arXiv preprint arXiv:1901.07744 (2019)
Loukina, A., Evanini, K., Mulholland, M., Blood, I., Zechner, K.: Do face masks introduce bias in speech technologies? The case of automated scoring of speaking proficiency. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 1942–1946. ISCA (2020)
Google Scholar
Loukina, A., Madnani, N., Zechner, K.: The many dimensions of algorithmic fairness in educational applications. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–10. Association for Computational Linguistics, Florence (2019)
Google Scholar
Malouff, J.M., Thorsteinsson, E.B.: Bias in grading: a meta-analysis of experimental research findings. Aust. J. Educ. 60(3), 245–256 (2016)
Article Google Scholar
Matsumura, L.C., Correnti, R., Wang, E.: Classroom writing tasks and students’ analytic text-based writing. Read. Res. Q. 50(4), 417–438 (2015)
Article Google Scholar
Mayfield, E., Black, A.W.: Should you fine-tune BERT for automated essay scoring? In: Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 151–162 (2020)
Google Scholar
Nadeem, F., Nguyen, H., Liu, Y., Ostendorf, M.: Automated essay scoring with discourse-aware neural models. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 484–493 (2019)
Google Scholar
Nguyen, H.V., Litman, D.J.: Argument mining for improving the automated scoring of persuasive essays. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Östling, R., Smolentzov, A., Hinnerich, B.T., Höglin, E.: Automated essay scoring for Swedish. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 42–47 (2013)
Google Scholar
Persing, I., Ng, V.: Modeling argument strength in student essays. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Long Papers), vol. 1, pp. 543–552 (2015)
Google Scholar
Phandi, P., Chai, K.M.A., Ng, H.T.: Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 431–439 (2015)
Google Scholar
Pitler, E., Nenkova, A.: Using syntax to disambiguate explicit discourse connectives in text. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 13–16 (2009)
Google Scholar
Rahimi, Z., Litman, D., Correnti, R., Wang, E., Matsumura, L.C.: Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intell. Educ. 27(4), 694–728 (2017). https://doi.org/10.1007/s40593-017-0143-2
Article Google Scholar
Rahimi, Z., Litman, D.J., Correnti, R., Matsumura, L.C., Wang, E., Kisa, Z.: Automatic scoring of an analytical response-to-text assessment. In: Trausan-Matu, S., Boyer, K.E., Crosby, M., Panourgia, K. (eds.) ITS 2014. LNCS, vol. 8474, pp. 601–610. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07221-0_76
Chapter Google Scholar
Ramineni, C., Williamson, D.M.: Automated essay scoring: psychometric guidelines and practices. Assess. Writ. 18(1), 25–39 (2013)
Article Google Scholar
Tay, Y., Phan, M.C., Tuan, L.A., Hui, S.C.: SkipFlow: incorporating neural coherence features for end-to-end automatic text scoring. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Uto, M., Xie, Y., Ueno, M.: Neural automated essay scoring incorporating handcrafted features. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6077–6088 (2020)
Google Scholar
Williamson, D.M., Xi, X., Breyer, F.J.: A framework for evaluation and use of automated scoring. Educ. Meas. Issues Pract. 31(1), 2–13 (2012)
Article Google Scholar
Zesch, T., Wojatzki, M., Scholten-Akoun, D.: Task-independent features for automated essay grading. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 224–232 (2015)
Google Scholar
Zhang, H., Litman, D.: Word embedding for response-to-text assessment of evidence. In: Proceedings of ACL 2017, Student Research Workshop, pp. 75–81 (2017)
Google Scholar
Zhang, H., Litman, D.: Co-attention based neural network for source-dependent essay scoring. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 399–409 (2018)
Google Scholar
Zhang, H., et al.: eRevise: using natural language processing to provide formative feedback on text evidence usage in student writing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9619–9625 (2019)
Google Scholar
Zhang, M., Dorans, N., Li, C., Rupp, A.: Differential feature functioning in automated essay scoring. In: Test Fairness in the New Generation of Large-Scale Assessment, pp. 185–208 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Pittsburgh, Pittsburgh, PA, 15260, USA
Diane Litman, Haoran Zhang, Richard Correnti & Lindsay Clare Matsumura
RAND Corporation, Pittsburgh, PA, 15213, USA
Elaine Wang

Authors

Diane Litman
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Richard Correnti
View author publications
You can also search for this author in PubMed Google Scholar
Lindsay Clare Matsumura
View author publications
You can also search for this author in PubMed Google Scholar
Elaine Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diane Litman .

Editor information

Editors and Affiliations

Technion – Israel Institute of Technology, Haifa, Israel
Ido Roll
Arizona State University, Tempe, AZ, USA
Danielle McNamara
Utrecht University, Utrecht, The Netherlands
Sergey Sosnovsky
London Knowledge Lab, London, UK
Rose Luckin
University of Leeds, Leeds, UK
Vania Dimitrova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Litman, D., Zhang, H., Correnti, R., Matsumura, L.C., Wang, E. (2021). A Fairness Evaluation of Automated Methods for Scoring Text Evidence Usage in Writing. In: Roll, I., McNamara, D., Sosnovsky, S., Luckin, R., Dimitrova, V. (eds) Artificial Intelligence in Education. AIED 2021. Lecture Notes in Computer Science(), vol 12748. Springer, Cham. https://doi.org/10.1007/978-3-030-78292-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-78292-4_21
Published: 11 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78291-7
Online ISBN: 978-3-030-78292-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics