Ensuring Fairness of Human- and AI-Generated Test Items

Belzak, William C. M.; Naismith, Ben; Burstein, Jill

doi:10.1007/978-3-031-36336-8_108

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1831))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

2909 Accesses
2 Citations
1 Altmetric

Abstract

Large language models (LLMs) have been a catalyst for the increased use of AI for automatic item generation on high-stakes assessments. Standard human review processes applied to human-generated content are also important for AI-generated content because AI-generated content can reflect human biases. However, human reviewers have implicit biases and gaps in cultural knowledge which may emerge where the test population is diverse. Quantitative analyses of item responses via differential item functioning (DIF) can help to identify these unknown biases. In this paper, we present DIF results based on item responses from a high-stakes English language assessment (Duolingo English Test - DET). We find that human- and AI-generated content, both of which were reviewed for fairness and bias by humans, show similar amounts of DIF overall but varying amounts by certain test-taker groups. This finding suggests that humans are unable to identify all biases beforehand, regardless of how item content is generated. To mitigate this problem, we recommend that assessment developers employ human reviewers which represent the diversity of the test-taking population. This practice may lead to more equitable use of AI in high-stakes educational assessment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Test-taker operating system is a proxy for socioeconomic status (SES), with Mac correlating to higher SES.
2.
This definition describes “intercept DIF” but not “slope DIF”. We focus on evaluating intercept DIF here to simplify the presentation of results. See Millsap [10] for a more general definition of DIF.
3.
A score of .5 means that half of the damaged words in the passage were completed correctly.

References

Gierl, M.J., Haladyna, T.M. (eds.): Automatic Item Generation: Theory and Practice. Routledge, Abingdon (2012)
Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 54(6), 1–35 (2021)
Article Google Scholar
Zieky, M.J.: Fairness in test design and development. In: Dorans, N.J., Cook, L.L. (eds.) Fairness in Educational Assessment and Measurement, pp. 9–31. Routledge (2016)
Google Scholar
Zieky, M.J.: Developing fair tests. In: Downing, S.M., Haladyna, T.M. (eds.) Handbook of Test Development, pp. 97–115. Routledge (2015)
Google Scholar
Sherman, J.E.: Multiple levels of cultural bias in TESOL course books. RELC J. 41(3), 267–281 (2010)
Article MathSciNet Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Cardwell, R., Naismith, B., LaFlair, G.T., Nydick, S.W.: Duolingo English Test: Technical Manual [Duolingo Research Report]. Duolingo (2023). https://go.duolingo.com/dettechnicalmanual
Council of Europe: Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge, UK (2001)
Google Scholar
Attali, Y., et al.: The interactive reading task: transformer-based automatic item generation. Front. Artif. Intell. 5, 903077 (2022)
Article Google Scholar
Millsap, R.E.: Statistical Approaches to Measurement Invariance. Routledge, Abingdon (2011)
Google Scholar
Swaminathan, H., Rogers, H.J.: Detecting differential item functioning using logistic regression procedures. J. Educ. Meas. 27(4), 361–370 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Duolingo, Pittsburgh, PA, 15209, USA
William C. M. Belzak, Ben Naismith & Jill Burstein

Authors

William C. M. Belzak
View author publications
You can also search for this author in PubMed Google Scholar
Ben Naismith
View author publications
You can also search for this author in PubMed Google Scholar
Jill Burstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William C. M. Belzak .

Editor information

Editors and Affiliations

University of Southern California, Los Angeles, CA, USA
Ning Wang
University of British Columbia, Vancouver, BC, Canada
Genaro Rebolledo-Mendez
University of Leeds, Leeds, UK
Vania Dimitrova
North Carolina State University, Raleigh, NC, USA
Noboru Matsuda
UNED, Madrid, Spain
Olga C. Santos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Belzak, W.C.M., Naismith, B., Burstein, J. (2023). Ensuring Fairness of Human- and AI-Generated Test Items. In: Wang, N., Rebolledo-Mendez, G., Dimitrova, V., Matsuda, N., Santos, O.C. (eds) Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky. AIED 2023. Communications in Computer and Information Science, vol 1831. Springer, Cham. https://doi.org/10.1007/978-3-031-36336-8_108

Download citation

DOI: https://doi.org/10.1007/978-3-031-36336-8_108
Published: 30 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36335-1
Online ISBN: 978-3-031-36336-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics