Skip to main content

A Fairness Evaluation of Automated Methods for Scoring Text Evidence Usage in Writing

  • Conference paper
  • First Online:
Artificial Intelligence in Education (AIED 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12748))

Included in the following conference series:

Abstract

Automated Essay Scoring (AES) can reliably grade essays at scale and reduce human effort in both classroom and commercial settings. There are currently three dominant supervised learning paradigms for building AES models: feature-based, neural, and hybrid. While feature-based models are more explainable, neural network models often outperform feature-based models in terms of prediction accuracy. To create models that are accurate and explainable, hybrid approaches combining neural network and feature-based models are of increasing interest. We compare these three types of AES models with respect to a different evaluation dimension, namely algorithmic fairness. We apply three definitions of AES fairness to an essay corpus scored by different types of AES systems with respect to upper elementary students’ use of text evidence. Our results indicate that different AES models exhibit different types of biases, spanning students’ gender, race, and socioeconomic status. We conclude with a step towards mitigating AES bias once detected.

The research reported here was supported, in whole or in part, by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A160245 to the University of Pittsburgh. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Analysis, Evidence, Organization, Style, Mechanics/Usage/Grammar/Spelling.

  2. 2.

    Students in our sample also identified as Hispanic (22.0%), Native American (11.5%), Asian (4.3%), Hawaiian (2.0%) and White (12.1%). These categories are not mutually exclusive. We focus on African American students in our study as this was the only subgroup that was large enough (had sufficient data) for our analyses.

  3. 3.

    https://github.com/Rokeer/co-attention.

  4. 4.

    https://github.com/Rokeer/hybrid.

  5. 5.

    We select one subset of features (from the set computed by the code release) that works for general AES purposes. Specifically, we introduce data from more prompts, including a second RTA prompt and eight prompts from the ASAP dataset (https://www.kaggle.com/c/asap-aes/). Then, we train models with only one combined hand-crafted feature for each prompt. Last, we select features that significantly improve the base neural model on the development set for at least 6 (out of 10) prompts. The intuition is that we want to select multiple features and combine each into the best level of the model hierarchy to create a version of \(AES_{hybrid}\) that is robust, while still preserving a reasonable number of features for our experiment.

  6. 6.

    Comparing to the broader fairness literature, Loukine et al. [19] state that OSA is similar in spirit to predictive accuracy [31], OSD to standardized mean difference [34] and treatment equality [3], and CSD to conditional procedure equality [3] and differential feature functioning [39].

References

  1. Amorim, E., Cançado, M., Veloso, A.: Automated essay scoring in the presence of biased ratings. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long Papers), vol. 1, pp. 229–237 (2018)

    Google Scholar 

  2. Attali, Y., Burstein, J.: Automated essay scoring with e-rater® v. 2. J. Technol. Learn. Assess. 4(3) (2006)

    Google Scholar 

  3. Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.: Fairness in criminal justice risk assessments: the state of the art. Sociol. Methods Res. 0049124118782533 (2018)

    Google Scholar 

  4. Bridgeman, B.: 13 human ratings and automated essay evaluation. In: Handbook of Automated Essay Evaluation: Current Applications and New Directions, p. 221 (2013)

    Google Scholar 

  5. Chen, H., He, B.: Automated essay scoring by maximizing human-machine agreement. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1741–1752 (2013)

    Google Scholar 

  6. Correnti, R., Matsumura, L.C., Hamilton, L., Wang, E.: Assessing students’ skills at writing analytically in response to texts. Elem. Sch. J. 114(2), 142–177 (2013)

    Article  Google Scholar 

  7. Correnti, R., Matsumura, L.C., Wang, E., Litman, D., Rahimi, Z., Kisa, Z.: Automated scoring of students’ use of text evidence in writing. Read. Res. Q. 55(3), 493–520 (2020)

    Article  Google Scholar 

  8. Dasgupta, T., Naskar, A., Dey, L., Saha, R.: Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 93–102 (2018)

    Google Scholar 

  9. Dong, F., Zhang, Y., Yang, J.: Attention-based recurrent convolutional neural network for automatic essay scoring. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 153–162 (2017)

    Google Scholar 

  10. Ghosh, D., Khanam, A., Han, Y., Muresan, S.: Coarse-grained argumentation features for scoring persuasive essays. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Short Papers), vol. 2, pp. 549–554 (2016)

    Google Scholar 

  11. Jin, C., He, B., Hui, K., Sun, L.: TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), vol. 1, pp. 1088–1097 (2018)

    Google Scholar 

  12. Johnson, D., VanBrackle, L.: Linguistic discrimination in writing assessment: how raters react to African American “errors,” ESL errors, and standard English errors on a state-mandated writing exam. Assess. Writ. 17(1), 35–54 (2012)

    Article  Google Scholar 

  13. Ke, Z., Ng, V.: Automated essay scoring: a survey of the state of the art. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 6300–6308. AAAI Press (2019)

    Google Scholar 

  14. Kincaid, J.P., Fishburne Jr, R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel (1975)

    Google Scholar 

  15. Kizilcec, R.F., Lee, H.: Algorithmic fairness in education. In: Holmes, W., Porayska-Pomsta, K. (eds.) Ethics in Artificial Intelligence in Education. Taylor and Francis (forthcoming)

    Google Scholar 

  16. Kung, C., Yu, R.: Interpretable models do not compromise accuracy or fairness in predicting college success. In: Proceedings of the Seventh ACM Conference on Learning@ Scale, pp. 413–416 (2020)

    Google Scholar 

  17. Liu, J., Xu, Y., Zhao, L.: Automated essay scoring based on two-stage learning. arXiv preprint arXiv:1901.07744 (2019)

  18. Loukina, A., Evanini, K., Mulholland, M., Blood, I., Zechner, K.: Do face masks introduce bias in speech technologies? The case of automated scoring of speaking proficiency. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020, pp. 1942–1946. ISCA (2020)

    Google Scholar 

  19. Loukina, A., Madnani, N., Zechner, K.: The many dimensions of algorithmic fairness in educational applications. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–10. Association for Computational Linguistics, Florence (2019)

    Google Scholar 

  20. Malouff, J.M., Thorsteinsson, E.B.: Bias in grading: a meta-analysis of experimental research findings. Aust. J. Educ. 60(3), 245–256 (2016)

    Article  Google Scholar 

  21. Matsumura, L.C., Correnti, R., Wang, E.: Classroom writing tasks and students’ analytic text-based writing. Read. Res. Q. 50(4), 417–438 (2015)

    Article  Google Scholar 

  22. Mayfield, E., Black, A.W.: Should you fine-tune BERT for automated essay scoring? In: Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 151–162 (2020)

    Google Scholar 

  23. Nadeem, F., Nguyen, H., Liu, Y., Ostendorf, M.: Automated essay scoring with discourse-aware neural models. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 484–493 (2019)

    Google Scholar 

  24. Nguyen, H.V., Litman, D.J.: Argument mining for improving the automated scoring of persuasive essays. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  25. Östling, R., Smolentzov, A., Hinnerich, B.T., Höglin, E.: Automated essay scoring for Swedish. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 42–47 (2013)

    Google Scholar 

  26. Persing, I., Ng, V.: Modeling argument strength in student essays. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Long Papers), vol. 1, pp. 543–552 (2015)

    Google Scholar 

  27. Phandi, P., Chai, K.M.A., Ng, H.T.: Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 431–439 (2015)

    Google Scholar 

  28. Pitler, E., Nenkova, A.: Using syntax to disambiguate explicit discourse connectives in text. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 13–16 (2009)

    Google Scholar 

  29. Rahimi, Z., Litman, D., Correnti, R., Wang, E., Matsumura, L.C.: Assessing students’ use of evidence and organization in response-to-text writing: using natural language processing for rubric-based automated scoring. Int. J. Artif. Intell. Educ. 27(4), 694–728 (2017). https://doi.org/10.1007/s40593-017-0143-2

    Article  Google Scholar 

  30. Rahimi, Z., Litman, D.J., Correnti, R., Matsumura, L.C., Wang, E., Kisa, Z.: Automatic scoring of an analytical response-to-text assessment. In: Trausan-Matu, S., Boyer, K.E., Crosby, M., Panourgia, K. (eds.) ITS 2014. LNCS, vol. 8474, pp. 601–610. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07221-0_76

    Chapter  Google Scholar 

  31. Ramineni, C., Williamson, D.M.: Automated essay scoring: psychometric guidelines and practices. Assess. Writ. 18(1), 25–39 (2013)

    Article  Google Scholar 

  32. Tay, Y., Phan, M.C., Tuan, L.A., Hui, S.C.: SkipFlow: incorporating neural coherence features for end-to-end automatic text scoring. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  33. Uto, M., Xie, Y., Ueno, M.: Neural automated essay scoring incorporating handcrafted features. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6077–6088 (2020)

    Google Scholar 

  34. Williamson, D.M., Xi, X., Breyer, F.J.: A framework for evaluation and use of automated scoring. Educ. Meas. Issues Pract. 31(1), 2–13 (2012)

    Article  Google Scholar 

  35. Zesch, T., Wojatzki, M., Scholten-Akoun, D.: Task-independent features for automated essay grading. In: Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 224–232 (2015)

    Google Scholar 

  36. Zhang, H., Litman, D.: Word embedding for response-to-text assessment of evidence. In: Proceedings of ACL 2017, Student Research Workshop, pp. 75–81 (2017)

    Google Scholar 

  37. Zhang, H., Litman, D.: Co-attention based neural network for source-dependent essay scoring. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 399–409 (2018)

    Google Scholar 

  38. Zhang, H., et al.: eRevise: using natural language processing to provide formative feedback on text evidence usage in student writing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9619–9625 (2019)

    Google Scholar 

  39. Zhang, M., Dorans, N., Li, C., Rupp, A.: Differential feature functioning in automated essay scoring. In: Test Fairness in the New Generation of Large-Scale Assessment, pp. 185–208 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diane Litman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Litman, D., Zhang, H., Correnti, R., Matsumura, L.C., Wang, E. (2021). A Fairness Evaluation of Automated Methods for Scoring Text Evidence Usage in Writing. In: Roll, I., McNamara, D., Sosnovsky, S., Luckin, R., Dimitrova, V. (eds) Artificial Intelligence in Education. AIED 2021. Lecture Notes in Computer Science(), vol 12748. Springer, Cham. https://doi.org/10.1007/978-3-030-78292-4_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78292-4_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78291-7

  • Online ISBN: 978-3-030-78292-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics