Deep Learning in Employee Selection: Evaluation of Algorithms to Automate the Scoring of Open-Ended Assessments

Thompson, Isaac; Koenig, Nick; Mracek, Derek L.; Tonidandel, Scott

doi:10.1007/s10869-023-09874-y

Deep Learning in Employee Selection: Evaluation of Algorithms to Automate the Scoring of Open-Ended Assessments

Original Paper
Published: 10 March 2023

Volume 38, pages 509–527, (2023)
Cite this article

Journal of Business and Psychology Aims and scope Submit manuscript

Isaac Thompson¹,
Nick Koenig¹,
Derek L. Mracek¹ &
…
Scott Tonidandel ORCID: orcid.org/0000-0003-1520-8288²

1351 Accesses
2 Citations
Explore all metrics

Abstract

This paper explores the application of deep learning in automating the scoring of open-ended candidate responses to pre-hire employment selection assessments. Using job applicant text data from pre-employment virtual assessment center exercises, three algorithmic approaches were compared: a traditional bag of words (BoW), long short-term memory (LSTM) models, and robustly optimized bidirectional encoder representations from transformers approach (RoBERTa). Measurement and assessment best practices were leveraged in the development of the candidate assessment items and human labels (subject matter experts’ (SME) ratings on job-relevant competencies), producing a rich set of data to train the algorithms on. The trained models were used to score the candidate textual responses on the given competencies, and the level of agreement with expert human raters was assessed. Using data from three companies hiring for three different occupations and across seven competencies, three algorithmic approaches to automatically score text were evaluated, showcasing correlations between SME and algorithmically scored competencies on holdout samples that were very strong (avg r = 0.84 for the best performing method, RoBERTa) and nearly identical to the inter-rater reliability achieved by multiple expert raters following consensus (avg r = 0.85). Criterion-related validity, subgroup differences, and decision accuracy are investigated for each algorithmic approach. Lastly, the impact of smaller sample sizes to train the algorithms is explored.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Article Open access 28 October 2019

Re-evaluating GPT-4’s bar exam performance

Article Open access 30 March 2024

Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020

Article 26 February 2022

References

Aghajanyan, A., Shrivastava, A., Gupta, A., Goyal, N., Zettlemoyer, L., & Gupta, S. (2020). Better fine-tuning by reducing representational collapse. https://arxiv.org/abs/2008.03156
American Educational Research Association. (2014). American Psychological Association, Joint Committee on Standards for Educational, Psychological Testing (US), & National Council on Measurement in Education. In Standards for educational and psychological testing. American Educational Research Association.
Google Scholar
Benaich, N & Hogarth, I. (2020). State of AI report. https://www.stateof.ai/
Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74(3), 478. https://doi.org/10.1037/0021-9010.74
Article Google Scholar
Booth, B. M., Hickman, L., Subburaj, S. K., Tay, L., Woo, S. E., & D'Mello, S. K. (2021, October). Bias and fairness in multimodal machine learning: A case study of automated video interviews. In Proceedings of the 2021 International Conference on Multimodal Interaction (pp. 268–277).
Chapter Google Scholar
Campion, E. D., & Campion, M. A. (2020). Using computer-assisted text analysis (CATA) to inform employment decisions: Approaches, software, and findings. In Research in Personnel and Human Resources Management (Vol. 38, pp. 285–325). Emerald Publishing Limited. https://doi.org/10.1108/S0742-730120200000038010
Chapter Google Scholar
Campion, M. C., Campion, M. A., Campion, E. D., & Reider, M. H. (2016). Initial investigation into computer scoring of candidate essays for personnel selection. Journal of Applied Psychology, 101, 958–975. https://doi.org/10.1037/apl0000108
Article PubMed Google Scholar
Chollet, F. (2015). Keras. https://github.com/fchollet/keras
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Human Language Technologies (pp. 4171–4186). Minneapolis.
Google Scholar
Douglas, E. F., McDaniel, M. A., & Snell, A. F. (1996). The validity of non-cognitive measures decays when applicants fake. Proceedings of the Academy of Management, 127–131.
Dudley, N. M., & Cortina, J. M. (2008). Knowledge and skills that facilitate the personal support dimension of citizenship. Journal of Applied Psychology, 93(6), 1249–1270. https://doi.org/10.1037/a0012572
Article PubMed Google Scholar
Edwards, B. D., Day, E. A., Arthur Jr., W., & Bell, S. T. (2006). Relationships among team ability composition, team mental models, and team performance. Journal of Applied Psychology, 91(3), 727–736. https://doi.org/10.1037/0021-9010.91.3.727
Article PubMed Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: The MIT Press.
Google Scholar
Gray, S., Radford, A., Kingma, D. (2017). Gpu kernels for block sparse weights. https://cdn.openai.com/blocksparse/blocksparsepaper.pdf
Hausknecht, J. P., Day, D. V., & Thomas, S. C. (2004). Applicant reactions to selection procedures: An updated model and meta-analysis. Personnel Psychology, 57(3), 639–683.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article PubMed Google Scholar
LawGeex. (2019,). AI vs. lawyers: The ultimate showdown. https://blog.lawgeex.com/resources/whitepapers/aivslawyer
Lievens, F., Sackett, P. R., Dahlke, J. A., Oostrom, J. K., & De Soete, B. (2019). Constructed response formats and their effects on minority–majority differences and validity. Journal of Applied Psychology, 104(5), 715.
Article PubMed Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. Arxiv: https://arxiv.org/pdf/1907.11692.pdf
Google Scholar
McKinney, S. M., Sieniek, M., Godbole, V., Godwin, J., Antropova, N., Ashrafian, H., Back, T., Chesus, M., Corrado, G. C., Darzi, A., Etemadi, M., Garcia-Vicente, F., Gilbert, F. J., Halling-Brown, M., Hassabis, D., Jansen, S., Karthikesalingam, A., Kelly, C. J., King, D., et al. (2020). International evaluation of an AI system for breast cancer screening. Nature, 577(7788), 89–94. https://doi.org/10.1038/s41586-019-1799-6
Article PubMed Google Scholar
Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word representations in vector space. International Conference on Learning Representations. https://arxiv.org/pdf/1301.3781
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2022). Deep learning–based text classification: A comprehensive review. ACM Computing Surveys, 54(3), 1–40. https://doi.org/10.1145/3439726
Article Google Scholar
de Oliveira, J. M., Zylka, M. P., Gloor, P. A., & Joshi, T. (2019). Mirror, mirror on the wall, who is leaving of them all: Predictions for employee turnover with gated recurrent neural networks. In Y. Song, F. Grippa, P. Gloor, & J. Leitão (Eds.), Collaborative innovation networks. Studies on entrepreneurship, structural change and industrial dynamics (pp. 43–59). Springer. https://doi.org/10.1007/978-3-030-17238-1_2
Chapter Google Scholar
Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359. https://doi.org/10.1109/TKDE.2009.191
Article Google Scholar
Pandey, S., & Pandey, S. K. (2019). Applying natural language processing capabilities in computerized textual analysis to measure organizational culture. Organizational Research Methods, 22(3), 765–797. https://doi.org/10.1177/1094428117745648
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). https://doi.org/10.3115/v1/D14-1162
Chapter Google Scholar
Phandi, P., Chai, K. M. A., & Ng, H. T. (2015). Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 431–439). https://doi.org/10.18653/v1/D15-1049
Chapter Google Scholar
Putka, D. J., Oswald, F. L., Landers, R. N., Beatty, A. S., Rodney, M. A., & Yu, M. C. (2022). Evaluating a natural language processing approach to estimating ksa and interest job analysis ratings. Journal of Business and Psychology, 2022. https://doi.org/10.1007/s10869-022-09824-0
Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25–39. https://doi.org/10.1016/j.asw.2012.10.004
Article Google Scholar
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135–1144).
Chapter Google Scholar
Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczynska, U. (2012). Rater training revisited: An updated meta-analytic review of frame-of-reference training. Journal of Occupational and Organizational Psychology, 85, 370–395.
Article Google Scholar
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. https://arxiv.org/abs/1706.05098
Rupp, D. E., Hoffman, B. J., Bischof, D., Byham, W., Collins, L., Gibbons, A., & Jackson, D. J. (2015). Guidelines and ethical considerations for assessment center operations. Journal of Management, 41(4), 1244–1273. https://doi.org/10.1177/0149206314567780
Article Google Scholar
Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology, 107, 2040–2068. https://doi.org/10.1037/apl0000994
Article PubMed Google Scholar
Sajjadiani, S., Sojourner, A. J., Kammeyer-Mueller, J. D., & Mykerezi, E. (2019). Using machine learning to translate applicant work history into predictors of performance and turnover. Journal of Applied Psychology, 104(10), 1207–1225. https://doi.org/10.1037/apl0000405
Article PubMed Google Scholar
Salgado, J. F., & Moscoso, S. (2019). Meta-analysis of interrater reliability of supervisory performance ratings: Effects of the appraisal purpose, range restriction, and scale type. Frontiers in Psychology, 10, 2281. https://doi.org/10.3389/fpsyg.2019.02281
Article PubMed PubMed Central Google Scholar
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262.
Article Google Scholar
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631–1642).
Google Scholar
Speer, A. (2020). Scoring dimension-level job performance from narrative comments: Validity and generalizability when using natural language processing. Organizational Research Methods, 24(3), 572–594. https://doi.org/10.1177/1094428120930815
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–2958.
Google Scholar
Staudenmeyer, R. C. & Morris, E. R. (2019). Understanding LSTM: A tutorial into long-short-term memory recurrent neural networks. https://arxiv.org/abs/1909.09586
Sujan, H., Sujan, M., & Bettman, J. R. (1988). Knowledge structure differences between more effective and less effective salespeople. Journal of Marketing Research, 25(1), 81–86. https://doi.org/10.1177/002224378802500108
Article Google Scholar
Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882–1891). https://doi.org/10.18653/v1/d16-1193
Chapter Google Scholar
Uniform guidelines on employee selection procedures. (1978). 43 Fed. Reg., 38290-38315.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 2017 International Conference on Neural Information Processing Systems (pp. 6000–6010).
Google Scholar
Wang, S., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In 50th annual meeting of the association for computational linguistics: Short papers (Vol. 2, pp. 90–94).
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. https://arxiv.org/abs/1804.07461
Williamson, D. M., Bennett, R. E., Lazer, S., Bernstein, J., Foltz, P. W., Landauer, T. K., & Sweeney, K. (2010). Automated scoring for the assessment of common core standards. White Paper. https://www.ets.org/research/policy_research_reports/publications/paper/2010/izph.html.
Google Scholar
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE international conference on computer vision (ICCV) (pp. 19–27). https://doi.org/10.1109/ICCV.2015.11
Chapter Google Scholar
Žliobaitė, I., Pechenizkiy, M., & Gama, J. (2016). An overview of concept drift applications. In N. Japkowicz & J. Stefanowski (Eds.), Big data analysis: New algorithms for a new society. Studies in big data (Vol. 16, pp. 91–114). Springer. https://doi.org/10.1007/978-3-319-26989-4_4
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Modern Hire, Cleveland, OH, USA
Isaac Thompson, Nick Koenig & Derek L. Mracek
University of North Carolina, Charlotte, NC, USA
Scott Tonidandel

Authors

Isaac Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Nick Koenig
View author publications
You can also search for this author in PubMed Google Scholar
Derek L. Mracek
View author publications
You can also search for this author in PubMed Google Scholar
Scott Tonidandel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Scott Tonidandel.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Data Transparency Appendix

Portions of this paper were presented at the 2020 Annual Conference of the Society of Industrial-Organizational Psychology. Data for this study are not available because they are proprietary. Three of the variables examined in the present article have some overlap with another article currently under review (see Data Transparency Table below). For Conducting Research, the same labels and responses were used in both manuscripts. Developing Networks and Leveraging Networks are described as partially overlapping because although the same responses were leveraged, a different competency-based scoring procedure was used. In summary, of 7905 scores used to train the various algorithms in the current manuscript, 958 were scored in an identical manner in the manuscript currently under review.

Variable	MS (status = under review)
Conducting Research	Overlaps
Developing Networks	Partial overlap
Leveraging Networks	Partial overlap
Improving Services	No Overlap
Reviews and Reflects	No Overlap
Seeks Understanding	No Overlap
Structures the Work	No Overlap
Seeks Understanding	No Overlap

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Thompson, I., Koenig, N., Mracek, D.L. et al. Deep Learning in Employee Selection: Evaluation of Algorithms to Automate the Scoring of Open-Ended Assessments. J Bus Psychol 38, 509–527 (2023). https://doi.org/10.1007/s10869-023-09874-y

Download citation

Accepted: 16 January 2023
Published: 10 March 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10869-023-09874-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning in Employee Selection: Evaluation of Algorithms to Automate the Scoring of Open-Ended Assessments

Abstract

Access this article

Similar content being viewed by others

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Re-evaluating GPT-4’s bar exam performance

Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher’s Note

Appendix

Data Transparency Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep Learning in Employee Selection: Evaluation of Algorithms to Automate the Scoring of Open-Ended Assessments

Abstract

Access this article

Similar content being viewed by others

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Re-evaluating GPT-4’s bar exam performance

Artificial intelligence in online higher education: A systematic review of empirical research from 2011 to 2020

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher’s Note

Appendix

Appendix

Data Transparency Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation