In this section we describe the results of this systematic review, starting with the themes identified through the search. We conclude by applying the HIRE framework to each paper, and stating clearly whether AI is deemed better, worse or equal to humans in the paper’s specific context. To ease the summary of the papers, we group the findings of each paper into themes based on their outcomes.
A total of 22 quantitative empirical studies adhered to our inclusion criteria (see Fig. 1) The four themes that emerged from papers groupings are; efficiency, performance, diversity, and perceptions. Sub-themes emerged in the perception section consisting of ethical perceptions, organisational perceptions, perceptions of use, emotional perceptions, and additional perceptions. These themes accounted for all papers, with some papers falling into multiple themes. We evaluate findings of each theme in line with the HIRE framework.
For the efficiency section, we look at how AI can be used to simulate human hiring decisions. We define efficiency as the ability to reach an outcome using the least amount of resources possible. For this topic, the resources used include time and cost of hiring. If AI can make more efficient decisions which are stated to be lower in time and/or cost, it is deemed better. If AI can simulate human hiring decisions, it is considered equal to human hiring. If AI is inaccurate in simulating human hiring, it is worse. We note that this is a cautious approach to take with making conclusions on the efficiency of AI, as AI simulating human hiring decisions in theory will be more efficient due to the rapid nature of algorithmic decision making compared to human decision making.
For the performance section, we evaluate the performance outcomes of candidates hired through AI or human methods. If AI results in better performance outcomes than human hiring, it is deemed as better. If the performance outcomes of candidates hired through AI or human methods are statistically indistinguishable, it is equal. If performance outcomes are worse in AI hiring, it is evaluated as worse.
In the diversity theme, we look at how diversity outcomes differ between AI and human hiring. AI is considered to be better if this type of hiring results in a more diverse group of hired employees. AI is considered equal to human hiring if diversity outcomes are statistically indistinguishable. Lastly, AI is considered worse if it promotes selection of less diverse hires.
For the perception sub-themes, we evaluate how people view AI hiring. AI is considered better if people have more positive perceptions of AI than human hiring. If perceptions towards AI and human hiring are statistically indistinguishable, it is considered to be equal. Finally, if perceptions towards AI hiring is more negative, it is considered worse.
In all of the above themes, findings can also be considered unclear. This can occur when AI is not compared directly to a human hiring method, or the statistical findings are inconclusive or unable to be replicated.
Efficiency of AI hiring
The potential for AI to automate the hiring process and produce cost-savings is contingent on the ability of AI to replicate human decisions in hiring. Four papers assess the efficiency of algorithmic hiring by whether AI can simulate human hiring decisions (Naim et al. 2018; Stein 2018; Bergman et al. 2020; Horton, 2017). An overview of the studies is shown in Table 2.
Naim et al. (2018) finds that trained algorithms can predict human hiring decisions to a large extent. The authors developed and assessed machine learning algorithms to use verbal and nonverbal behaviours in job interviews in order to predict human interview scores and human rated interview-specific traits. They trained support vector regression (SVR) and least absolute shrinkage and selection operator (LASSO) regression algorithms on the prosodic, lexical, and facial features extracted from audio-visual videos of university students seeking internships. An additional group of participants rated the interview videos in terms of interview performance and additional traits on a 7-point Likert scale. Prediction accuracy was measured by correlating the human ratings to the predicted ratings from the algorithm and estimating AUC (area under the curve) which assesses how well the model can separate true positives from true negatives, as a proxy for how correct predictions are. The results indicated that the algorithms could predict overall interview score (r = 0.62, AUC > 0.76) and whether the applicant would be recommended for hiring (r ranged between 0.64–0.65, AUC > 0.78). Additionally, the model was able to predict some interview-specific traits, such as excitement (r ranged between 0.75–0.79, AUC > 0.88), engagement (r ranged between 0.74–0.75, AUC > 0.84), and friendliness (r ranged between 0.70–0.73, AUC > 0.80). Both the correlation and AUC values from these predictions are large in effect size. However, the algorithms did not perform as well in predicting other traits such as calmness (r ranged between 0.30–0.38, AUC > 0.60), level of stress (r ranged between 0.26, AUC > 0.57), and eye-contact (r ranged between 0.26–0.33, AUC > 0.62), which indicate small-medium effect sizes.
Stein (2018) shows that algorithms modelling cultural compatibility can make predictions which are largely related to hiring outcome, implying that they are good substitutes for humans. The author assesses whether linguistic similarity of a candidate with the hired employees in a firm can predict the job applicants’ chance of being hired, and whether this similarity can be algorithmically modelled. Linguistic similarity is a measure of cultural compatibility and is measured through the frequencies of words people use. Among applicants and hired employees in a mid-sized technology firm, data from job application questions was used to model linguistic similarity through logistic regression algorithms. The authors transformed job application responses into a term frequency inverse document (TF-IDF) statistic and then measured similarity of words between applicants and employees. They found that depending on the control measures included, a one standard deviation increase in linguistic similarity increases a candidate’s likelihood of being hired by between 33–53%, seen through significant hiring odds ratios of 1.331–1.529, p < 0.05.
Bergman et al. (2020) find that algorithms can predict hiring outcomes in a modest way, and that the data-driven method outperforms human recruiter predictions. Specifically, they train supervised learning (SL) and upper confidence bound (UCB) algorithms on applicants’ demographics, education, and work history to predict whether they will be hired by a human after an interview. The data comes from a Fortune 500 company which had been interviewing candidates based on recruiter’s recommendations. In this company, 10% of applicants are hired after being interviewed, thus, the algorithmic approach was studied to see if they could increase this percentage and improve efficiency through interviewing fewer candidates. The algorithms produced a score for the likelihood of an applicant being hired and when correlated to actual hiring outcome yielded small positive and significant correlations (r ranged between 0.214-0.233, all p < 0.001). Additionally, using algorithmic recommendations for the sample of candidates interviewed increased the hiring rate to 30%. This means that the company could use these algorithms over recruiter recommendations to find a suitable hire with the added bonus of conducting 20% fewer interviews, resulting in time savings.
Finally, Horton (2017) assessed how the introduction of algorithmic candidate recommendations on an online labour market impacts candidate hiring outcomes. The algorithmic recommendations were computed by measuring a candidate’s relevance and ability. Relevance was calculated by the degree of overlap of skills listed on a candidate’s page and the skills required for job. Ability included the worker’s test score, feedback ratings, and past earnings. The actual algorithm was “black box” so the exact type of computational process is unknown. Employers were randomly assigned to the treatment group where they were given algorithmic candidate recommendations for their job postings, or the control group where they were not given any recommendation. Among the employers who were given algorithmic recommendations, hiring rate increased by 20%. However, this pattern only emerged for technical jobs, and algorithmic recommendations did not increase non-technical job fill rates. The authors argue that this is likely because technical job openings typically have fewer applicants and are less price sensitive than non-technical job openings. The algorithm helped technical jobs to have higher numbers of suitable applicants, promoting hiring for those roles. The algorithm also screens for past earnings as an aspect of candidate ability, so the algorithmically recommended candidates are typically higher cost, and because the technical jobs are less price-sensitive, they were more likely to hire these more expensive algorithmically recommended candidates. Thus, the use of AI hiring in this situation helped to increase efficiency for technical jobs through increasing the fill rate of positions.
The results of these four studies indicate that hiring algorithms can be designed to produce outputs which are equal to or better than human hiring. The ability of the algorithms to predict hiring outcomes varied in size, but even at the smallest effect size in Bergman, Li, & Raymond (2020), AI still outperformed human prediction methods. From these results, it would seem that AI can be used to model or improve human hiring outcomes, thus producing time cost-savings in the recruitment process.
Performance outcomes of AI hiring
In order to consider if AI can make better hiring decisions in terms of the employee’s performance outcomes, two studies assess novel algorithmic methods in predicting workplace talent from an applicant pool (Sajjadiani et al. 2019; Bergman et al. 2020). Refer to Table 3 for an overview of these studies.
Sajjadiani et al. (2019) finds that algorithms can maximise selecting hires on the basis of performance outcomes. They develop a machine-learning hiring technique based on prior data from teaching position applications in a school district. The machine learning method utilised a naïve Bayes classifier to predict work outcomes from data on the applicant’s demographics, previous work experience, tenure history, and turnover reasons.
Work outcomes from the hired applicants included turnover and the performance variables of student evaluations, expert observations, and value-added measured by students’ scores on standardized tests. The model was trained on 90% of the data and evaluated on the remaining 10%. The model was evaluated by looking at how well the input variables predicted work outcomes, and Heckman regression indicated a number of significant coefficients, indicating that input variables could predict work outcomes. The model was compared to human selection methods by developing a list of recommended hires based on predicted performance and turnover outcomes, and comparing these to the candidates who were actually hired. Depending on the outcome the model was maximized to predict, results showed that the overlap between algorithm recommended hires and actual hires was between 11–29%. Since the algorithm maximized predicted performance outcomes, this suggests that the algorithmic hiring method differs to human hiring in that it may result in hires with higher performance.
Bergman et al. (2020) find that the computed candidate score from hiring algorithms is modestly and positively related to promotion and performance outcome, and that it was an improvement over human methods. They correlated measures of job performance ratings and promotions with the algorithmic and human recommendation score. Performance was rated on a 3-point scale, and promotions followed a binary yes/no outcome. They found that human recruiter recommendations are significantly negatively correlated with job performance rating (r ranged between − 0.288 to − 0.309, p < 0.01), and algorithm recommendations are significantly positively correlated with whether a new hire receives a promotion (r = 0.132, p < 0.01). This means that candidates selected by human recruiters have worse job performance, and those selected by algorithms are more likely to be promoted, and these effects are small-medium in size. All other correlations were insignificant due to very small effect sizes, however, the human recommendations were slightly negatively correlated with promotions and the algorithm was slightly positively correlated with performance ratings.
Taken together, these two studies indicate that AI hiring methods are better than human ones in terms of selecting applicants who will have better performance while on the job. Although as shown in Bergman et al. (2020), the ability of AI to predict candidates with positive job performance is limited, but still an improvement over human methods.
Diversity outcomes of AI hiring
Due to the potential of AI to reduce human cognitive biases in decision-making, it has been examined as a tool to promote diversity and inclusion. Seven studies explore how inclusive algorithmic hiring is, and advance novel approaches to improving inclusivity metrics (Chen et al., 2018; Lambrecht & Tucker, 2019; Allred, 2019; Song, 2018; Sajjadiani et al. 2019; Bergman et al. 2020; Suhr et al. 2020) See Table 4 for an overview of the studies.
Sajjadiani et al. (2019) finds that AI hiring can remove adverse impact for gender and ethnicity. In a Bayesian machine learning approach, they assessed the level of adverse impactFootnote 1 in selection decisions. Adverse impact is evaluated by computing whether sensitive variables such as gender and ethnicity predict hiring outcome. While the previous human hiring method showed that gender and ethnicity were in fact predictive of hiring (\(\beta\) female = 0.06, p < 0.05; \(\beta\) white = 0.11, p < 0.01), the machine learning method yielded a non-significant prediction with coefficients close to zero, implying that no group would be disproportionately hired with the algorithmic method.
On the other hand, two studies (Chen et al., 2018; Suhr et al. 2020) look at how search engine algorithms can cause discrimination in candidate ranking. Chen et al. (2018) find that females are ranked slightly lower than men on search engines. Data was collected from three resume search engines on job titles in 20 U.S. cities. The type of algorithms used in the search engines are unspecified so the authors could only assess the outcomes. On such websites, a higher ranking indicates a better candidate, and recruiters will look at highly ranked applicants first. Although sensitive features were removed from profiles for the rankings, it was found through Mixed Linear Model regressions that female candidates were ranked statistically lower than males (\(\beta\)= − 0.042, − 0.028, & − 0.071; p < 0.05). The effect was small, however, so in the top 10 rankings, the gender difference was proportionally negligible. The authors also assessed how this effect shows up in specific job contexts. Mann–Whitney U tests revealed that 8.5–13.2% of job titles have significant group gender unfairness, all with men being ranked higher than women (U ranged from 0.01–,0.59, p < 0.05). Since the search engine websites use black box algorithms, the reason for this discrimination can only be speculated. The authors hypothesize a hidden feature in the algorithm that is correlated with gender, or that rankings are adjusted based on recruiter clicks and recruiters are biased towards favouring male candidates.
Suhr et al. (2020) find that gender discrimination in candidate selection can be improved by using alternative ranking algorithms. They recruited online participants in a simulated job selection process and had them review ranked candidates and select four in order of preference to be recommended to a company. Participants were randomly assigned to see candidates which were grouped based on three ranking algorithms and three datasets. In the first ranking algorithm, candidates were ranked by their relevance score from a search engine website; in the second algorithm, candidates were ranked in a random order; and in the third algorithm, candidates were ranked by a fairness ranking algorithm (LinkedIn’s DetGreedy) which ensures a proportional representation of underrepresented groups. They found that the fair ranking algorithm increased selection of female candidates in comparison to the relevance ranking by the search engine from 2.5–17.35 percentage points depending on the job context and ranked position out of 4, with higher gains for females being selected as the first recommendation. The random ranking algorithm also increased selection of female candidates, but in all cases by less than the fair ranking algorithm. This shows that the way individuals are ranked on hiring search engines can have diversity consequences in terms of actual hiring outcomes.
Lambrecht and Tucker (2019) tested an algorithm for showing job adverts to individuals in STEM which was designed to be gender neutral, however, they find that the advert was displayed to more men than women. The job advert was programmed to be gender neutral by setting the ad targeting settings to both genders. Despite the good intentions behind the algorithm, the job advert was found to be displayed to 20% more men than women (R2 = 0.49, p < 0.001). Since an additional analysis showed the ad appealed similarly to both men and women, the authors discovered the effect to be driven by economic factors. The price premium that an advertiser must pay to show ads is more expensive for women than men. This is because women have a higher click to profit ratio, meaning they are more likely to purchase an item after clicking on it. This study emphasises that discriminatory effects can occur even when AI is designed to be fair, and the importance of checking external factors which may impact the outcome of AI. It is important to note, however, that this algorithmic discrimination was not compared to human methods, so it is unclear if gender discrimination outcomes are better or worse than the status quo.
The last three studies propose novel algorithmic methods to improve inclusion in hiring. (Bergman et al. 2020; Allred, 2019; Song, 2018). Bergman et al. (2020) find that depending on the type of algorithm, hiring outcomes for underrepresented individuals can either be greatly improved or moderately worsened. In the static supervised learning (SL) algorithms, the model is trained on a dataset to predict a candidate’s likelihood of being hired. In the upper confidence bound (UCB) model, the algorithm values a reduction in uncertainty, and exploration bonuses are given for hiring candidates which have higher standard error due to less reported outcomes. This means that the model favours selecting candidates with less hiring data outcomes on, in order to build up the model to have better predictions for all individuals. Diversity outcomes of the models and the baseline human recruiter method are assessed. In the sample of all applicants, the majority racial groups are White and Asian (79%) and the minority groups are Black and Hispanic (21%). In the human recruiter selection, minority applicants represent 9% of selected candidates. For the SL models, the model decreases minority groups so that they only represent 3% of chosen applicants. However, for the UCB model, the proportion of minority groups increases to be 24% of the selected candidates. In a further extension study of the UCB model, where the model is blinded to race, minority groups drop down to being selected 14% of the time. This means, depending on the type of AI and what data is inputted, it can be much better or slightly worse than humans at selecting underrepresented groups for hire.
Allred (2019) finds that group differences in a cognitive test used for hiring can be substantially reduced through an algorithmic method. The author responds to the problem of racioethnic group differences in the general cognitive ability (GCA) assessment, meaning that there are differences in test scores based on one’s racial group. Despite this, GCA is utilized in many organisations due to the high level of prediction of job outcomes, but this can result in lower selection rates for certain racial groups. To promote fairer selection, Allred (2019) proposes a metaheuristic algorithm to lower group differences in the test scores. Specifically, they use an ant colony optimization algorithm to differentially weight items on the test so that racial group differences are minimized while validity of the measure is maximized. The author used simulated data from archives and meta-analytic estimates of mean differences between groups in the GCA. Compared to the prior approaches to GCA interpretation, the metaheuristic algorithm produced smaller group differences in GCA scores. The effect sizes varied based on level of job complexity,Footnote 2 with group differences lowering from a large (d = 0.840-0.845) to small (d = 0.349) effect size in low job complexity, and medium (d = 0.602-0.730) to small (d = 0.442-0.475) in medium and high job complexity. However, the diversity improvements came at the cost of a reduction in cross-validation validity, which was also dependent on job complexity. In low job complexity, validity was 52% lower, in medium job complexity it was 30–31% lower, and in high job complexity it was 25% lower. Thus, at low job complexity where the greatest diversity improvements occur, the greatest reduction in validity also occurs. This is known as the diversity-validity dilemma, and Song (2018) sought to correct it.
Specifically, Song (2018) identify shrinkage as the validity issue in pareto-optimal methods for personnel selection. Shrinkage occurs when a model is over fitted, and so variance explained in a new sample will be smaller than in the original training data. This means that if organisations employ pareto-optimal methods in their personnel selection process, the diversity and validity outcomes may not be as optimal as the model may lose prediction ability in who the best employees are to hire. This study examined how much shrinkage occurs with pareto-optimal methods in personnel selection by assessing the cross-validity through a Monte Carlo simulation with the factors of sample size and job predictors. Job predictors included cognitive ability tests, biodata, conscientiousness, structured interview, and integrity test results. Calibration and validation samples were generated for conditions which varied on sample size and job predictors, then pareto-optimal weights for validity and diversity were calculated for the calibration and validation samples and were plotted against each other to observe amount of shrinkage. The results showed that the validation curve fell beneath the calibration curve, meaning there was substantial shrinkage in validity and diversity. However, even with the shrinkage, diversity was still improved over the status-quo unit weighted method when the sample size was at least 100. In attempt to correct the shrinkage, the author developed an algorithm to achieve regularization where optimization occurs in local predictions and future generalizations. This is done through four steps, where end-points of the pareto-optimal curve are detected where diversity or validity is maximized, then a phi trade-off matrix is built to estimate optimal predictor weights, linear interpolation creates evenly spaced solutions for pareto-optimal points, and finally a sequential least squares programming algorithm is used to find optimal weights for regularization. Through observing a smaller gap in the pareto-optimal curves between calibration and validation samples, they found that diversity shrinkage was smaller for this solution as compared to original pareto-optimal results, showing that validity of such AI can be improved.
Among the papers in this section which compare AI to human hiring diversity outcomes, each shows that AI is better at producing more diverse hiring outcomes. The exception comes from the Bergman, Li, & Raymond (2020) paper where AI diversity outcome can be worse than humans when a static supervised learning algorithm is used, or better when a UCB algorithm is used. This points to the importance of using the correct type of AI for maximizing diversity. Two papers in this section do not compare AI to human methods (Chen et al., 2018; Lambrecht & Tucker, 2019) and show that AI recruitment methods can result in poor diversity outcomes. However, because these papers did not compare to status-quo human methods, it is unclear whether these findings would be better or worse than the diversity outcomes in human hiring.
Perceptions of AI hiring
Twelve of the papers included in this systematic review pertain to perceptions surrounding the use of hiring with AI (Suen et al. 2019; Newman et al. 2020; Lee 2018; Langer et al. 2018, 2019; Kaibel et al. 2019; Kodapanakkal et al. 2020; Oberst et al. 2020; Acikgoz et al. 2020; Noble et al. 2021; Warrenbrand 2021; Bigman 2020). These empirical papers cover a wide range of types of perceptions and will be discussed within the categories of ethical perceptions, organisational perceptions, perceptions of use, emotional perceptions, and additional perceptions.
There are ten papers which study ethical perceptions regarding the use of AI in the hiring process (Suen et al. 2019; Newan et al. 2020; Lee 2018; Langer et al. 2018; Kodapanakkal et al. 2020; Langer et al. 2019; Acikgoz et al. 2020; Noble et al. 2021; Warrenbrand 2021; Bigman 2020). The majority of experiments on this topic assess the perceived fairness of hiring using AI technologies. See Table 5 below for an overview of the studies.
Suen et al. (2019) and Newman et al. (2020) experimentally simulate a job hiring scenario to see how fair the applicants perceive the use of AI in personnel selection. Suen et al. (2019) find that whether an interview is conducted with a human or AI, fairness ratings do not differ. They conducted structured job interviews where participants were assigned to one of three conditions: synchronous video interviews (SVI), asynchronous video interviews (AVI), or asynchronous video interviews using an AI decision maker (AI-AVI). In the SVIs, communication is two-way, meaning the applicants interact with an interviewer over a video call. In contrast, the AVI is a one-way interview where job applicants record interview answers to be evaluated at another time. All applicants completed a questionnaire following the interview where they rated the perceived fairness of the interview process by answering questions adopted from Guchait et al. (2014) on a 5 point Likert scale, and results indicated that there were no significant differences in ratings between the three conditions (np2 = 0.004, p = 0.482). While this study does not find evidence that perceived fairness in a hiring context is significantly altered by an algorithmic or human decision maker, Newman et al. (2020) study introduces an additional factor in this relationship and finds that fairness does moderately relate to type of interview.
In Newman et al. (2020) experiment, undergraduate students participated in an asynchronous video interview where they were told that either a human or algorithm would analyse their responses prior to the interview. The authors also manipulated transparency by randomly allocating the participants to low or high transparency. For the participants in the high transparency condition, they were shown an additional paragraph about how the human or AI evaluates the interviews. Fairness measurement was adopted from Conlon et al.’s (2004) organisational justice scale, where questions were answered on a 7-point Likert scale. This transparency factor was found to be significant in predicting fairness scores. When participants were given additional information on the interview process, algorithms were found to be less fair than humans to a medium extent (d = 0.5, p = 0.011). However, when participants were not given additional information about how the human or AI evaluates interviews, this finding was reversed and algorithms were found to have higher ratings of fairness than humans.
While the findings of these studies would be potentially compatible if Suen et al. (2019) used a medium transparency method, as the point where the fairness effect crosses over from low to high transparency and is null, inspection of the methodology for that paper reveals description of the instructions to participants aligns closely with Newman et al. (2020) low transparency condition. Other factors may be involved in these different results, as there were other notable methodological differences. The job questions used in the interview were different, as well as the level of seniority of the job. While Suen et al. (2019) hired for HR managers of a mid-senior level, Newman et al. (2020) recruited undergraduate students for an unspecified future job opportunity. Additionally, the most prominent difference between the methodology of these studies is the time at which applicants filled out the questionnaire on the hiring process. Participants in the Suen et al. (2019) experiment completed the interview and then filled out the questionnaire, whereas applicants in Newman et al. (2020) were briefed on the process and filled out the questionnaire prior to the interview. Thus, it is possible that the experience of completing the actual interview resulted in different fairness ratings.
Another three papers on perceived fairness asked participants to assess a hypothetical hiring scenario (Lee 2018; Langer et al. 2018, 2019). While these experiments do not have real stakes, such as participating in a job interview, they inform what potential applicants might think when evaluating a company, prior to the application stage. Langer et al. (2018) find that there is no significant effect of AI interviews on fairness, regardless of the level of transparency. They manipulated transparency similar to Newman et al. (2020) by providing additional details regarding how the AI program analyses the video and audio information to the participants in the high information condition, and leaving this out for participants in the low information condition. They had participants observe a job interview where the interviewer was a virtual character who interacted with a human applicant, and then asked participants to answer a questionnaire on the process. Fairness was adapted from Warszta (2012) and questions were answered on a 5-point Likert scale. The findings indicated that there was a small but insignificant effect of higher fairness ratings in the high transparency condition (np2 = 0.03, p > 0.05).
Langer et al. (2019) also find no difference in fairness ratings between type of interview, but find a moderate effect that level of interview stakes influence fairness ratings. They had participants watch either an automated interview or videoconference and varied stakes by saying the interview was for training and feedback in the low stakes and saying that the interview was real in the high stakes condition. The authors used the same fairness measurement as Langer, Konig, and Fitili (2018) and also found a small but insignificant effect of fairness being lower in automated interviews (np2 = 0.03, p > 0.05). There was a medium sized significant difference between level of stakes and fairness in that for participants who believed the interview was real, they rated the interview process as less fair (np2 = 0.07, p < 0.01).
On the other hand, Lee (2018) finds a large effect that human interview decisions are rated as more fair than AI ones. They recruited participants to evaluate a hypothetical hiring scenario and assessed perceptions around using an algorithm or a human for the initial recruitment stage, such as reviewing resumes and personal statements on a job website. Fairness measurement questions were adopted from Brockner et al. (1994) and Konovsky & Folger (1991), and participants answered questions on a 7-point Likert scale. Participants gave much higher fairness ratings when the decision maker was a human rather than an algorithm (d = 0.861, p < 0.0001).
Due to the mixed findings for fairness, more recent studies have broken down the meaning of fairness into different components (Acikgoz et al., 2020; Noble et al. 2021; Warrenbrand, 2021). All three studies looked at procedural justice, which in this context is the perceived fairness of the decision making process for hiring, whether through human or AI means. All three studies utilised a vignette style methodology where participants read over a hiring process situation in which a human or AI made the hiring decision. Procedural justice was measured through Bauer et al.’s (2001) 5-point Likert (Acikgoz et al., 2020), Bauer et al.’s (2001) 7-point Likert (Noble et al. 2021), or Colquitt (2001) 5-point Likert scale (Warrenbrand, 2021). Across the three studies, procedural justice was rated moderately to substantially higher in the condition which used a human decision maker for the hiring process (d ranged from 0.27 to 0.80, p < 0.05).Footnote 3
Acikgoz et al. (2020) and Noble et al. (2021) also looked at the perceived fairness of how individuals are treated during the hiring process (i.e., interactional or interpersonal justice), measured through Bauer et al.’s (2001) 5-point or 7-point Likert scale. Interactional justice was generally rated as much higher for human decision makers (d ranged from 0.78 to 1.40, p < 0.001)Footnote 4 (Acikgoz et al. 2020). Interpersonal justice was rated moderately higher for human decision makers (d ranged from 0.26 to 0.64, p < 0.05) (Noble et al. 2021). Furthermore, the treatment sub-factor mediated the influence of type of interview on litigation intentions (\(\beta\)=0.09, p < 0.05) (Acikgoz et al. 2020). Litigation intentions in this context indicates how likely people were to report chance of seeking legal recourse from the hiring process, and participants who were shown the automated hiring process felt that they were treated worse, thus causing them to be more likely to seek legal recourse.
Warrenbrand (2021) also looked at distributive justice, which in this case refers to applicant’s perceptions of the fairness of a hiring decision, comparing their experience to other applicants’ experience. Distributive justice was measured on a Colquitt’s (2001) 5-point Likert scale. It was found that distributive justice was rated as substantially higher when humans made the hiring decision (d = 1.13, p < 0.001).
The last three papers discussing ethical perceptions are related to morality and privacy concerns (Kodapanakkal et al. 2020; Langer et al. 2019; Bigman et al. 2020). Langer et al. (2019) find that people have slightly more privacy concern towards AI interviews than human ones. Participants watched either a videoconference or automated interview and then filled out a questionnaire regarding the process. Privacy concern was measured with six items from previous research (Smith et al. 1996; Agarwal et al. 2004; Langer et al. 2018, 2017) using a 7-point Likert scale. The authors found that levels of privacy concern were higher in the automated interview condition (np2 = 0.04, p < 0.05), however, this finding is considered to be small in effect size. This may also inform about other ethical aspects of using algorithmic hiring, as Kodapanakkal et al. (2020) extend these findings to show that data protection drives moral acceptability of a hiring algorithm. Moral acceptability was rated by participants on a 0–100 numeric scale. They manipulate outcome favourability of the technology by stating whether the algorithm will increase or decrease the chance of someone finding employment. Data sharing is manipulated by stating whether the data will be shared with no one, a private company, or academic researchers, and data protection is altered by stating whether the data is encrypted and stored securely. They found that data protection was the driving factor in the moral acceptability (z = 14.68, p < 0.001) of using such algorithms for hiring.
Finally, Bigman et al. (2020) found that people are less morally outraged when an algorithm discriminates than a human. Moral outrage was measured on a 7-point Likert scale using a measure from Sunstein et al. (1998), and attribution of moral outrage was measured through questions created by the authors on a 7-point Likert scale. Across three hiring scenarios where discrimination occurred on the basis of race, age, or gender, algorithms were consistently rated to evoke less moral outrage, with the effect size varying from small to large (d ranged from 0.34 to 0.80, p = 0.012). This effect was mediated by the attribution of bias, meaning the perceived motivations behind the discrimination (b = − 0.30, p < 0.05). People attributed substantially higher levels of prejudiced motivation to the human decision than the algorithmic hiring decision (d = 1.03, p < 0.001).
Overall, the current findings regarding ethical perceptions of AI are mixed, but most findings point towards AI perceptions being equal to human, or worse. Due to this, further research is needed to identify other factors that influence ethical perceptions of AI in a hiring context. From the findings in Langer, Konig & Papathanasiou (2019) and Kodapanakkal et al. (2020) it is possible that concern about data privacy may be a driving factor guiding the ethical perceptions of AI.
How attractive an organisation is perceived to be due to the use of algorithmic hiring is assessed in four studies using hypothetical hiring scenarios (Kaibel et al. 2019; Langer, Konig and Fitili 2018; Langer, Konig, and Papathanasiou 2019; Acikgoz et al. 2020). See Table 6 for an overview of the studies.
Acikgoz et al. (2020) found a moderate effect in that ratings of organisational attraction are lower on automated interviews. In their experiment, participants reviewed a vignette of a job hiring situation and then rated the organisational attractiveness on a 5-point Likert scale from Highhouse et al. (2003). In this study, results showed that participants in the condition where interviews were automated gave moderately lower scores for organisational attractiveness than participants in the condition with human interviewers (r = 0.29, p < 0.001).
Kaibel et al. (2019) also finds that organisations using AI hiring are considered slightly less attractive than ones using human hiring. They asked participants to evaluate a hiring process where the decision maker was either a human or an algorithm. Organisational attractiveness was measured on Highhouse et al. (2003) scale. Across two studies, they found a small significant effect (d = 0.375-0.443, p < 0.05) that participants rated organisational attractiveness lower when an algorithm rather than a human made the hiring decision. It was also found that personal uniqueness negatively moderates this relationship; the more an individual considers themselves to be unique, the lower organisational attractiveness is when an algorithm makes the hiring decision (\(\beta\) coefficient = − 0.46, p = 0.002). This suggests that there are individual differences in how applicants view organisations which use algorithmic hiring.
Langer, Konig & Papathanasiou (2019) measured organisational attractiveness on the User Experience Questionnaire (UEQ; Laugwitz et al., 2008) 7-point Likert scale. They also found a small effect of attractiveness being lower on automated interviews (np2 = 0.04, p < 0.05). This effect was mediated by two factors. The first, social presence, meaning the lack of human interaction in automated interviews resulted in lower attractiveness scores. The second, fairness, meaning perceptions of lower fairness in automated interviews also resulted in lower attractiveness ratings. These mediation effects were found to explain 61% of variance in organisational attractiveness scores (r2 = 0.61, p < 0.01).
Langer et al. (2018) extend these findings by including the factor of transparency to test its effect on organisational attractiveness in algorithmic hiring, and find mixed effects. They measure organisational attractiveness on a 5-point Likert scale adapted from Highhouse et al. (2003) and Warszta (2012). These authors found that the amount of information known about AI during the hiring process affects organisational attractiveness in two opposite ways; there is an indirect positive effect of information on organisational attractiveness through the factors of open treatment and information known, but also a negative direct effect of information on organisational attractiveness. This model explained 24% of variance in organisational attractiveness scores. As the authors postulate, these opposing effects might be driven by applicants appreciating the honesty, but being intimidated by the technological aspects of the selection process. It could also be that the amount of information provided was enough to make applicants sceptical, but not enough to explain the methodology so that the participants had no concerns.
It is, however, important to point out that the positive indirect effect that Langer et al. (2018) describe may also be a statistical artefact. The main purpose of a mediation analysis is to explain the psychological mechanism behind the main effect of an intervention on a dependent variable of interest (Yzerbyt et al., 2018)—in this case, the influence of information level on organisational attractiveness. In that regard, obtaining a mediated (i.e., indirect) effect that is in the opposite direction to the main effect, as is the case in Langer et al. (2018), is statistically possible. However, because in this case the indirect effect does not conceptually explain the main effect, the indirect effect may be an artefact that occurred because of a spurious correlation between the mediators (i.e., information known and open treatment) and dependent variable (i.e., organisational attractiveness) (Fiedler et al., 2018; Yzerbyt et al., 2018).
These findings show that applicants perceive organisations which use AI in the hiring process as less attractive than those using human hiring. Due to this, AI is perceived to be worse for organisational attractiveness. Although in Langer, Konig, & Fitili (2018), AI is not compared to human hiring, it shows that the level of transparency one has surrounding the hiring process may influence how attractive it is perceived to be. In this case, the negative organisational perceptions may be alleviated through giving applicants more information on how AI hiring works.
Perceptions of use
Three studies assess the perceptions of the usability of AI hiring technologies (Suen et al., 2019; Kodapanakkal et al., 2020; Oberst et al., 2020). See Table 7 for an overview of the studies.
Suen et al. (2019) found that applicants in an interview featuring an AI decision agent have less favourability towards this process than if the agent is a human (np2 = 0.391, p < 0.001), and this effect is large. Favourability, meaning how beneficial the outcome is to the individual, was measured with 10 questions adopted from Guchait et al. (2014) on a 5-point Likert scale. Favourability was also found to be indicative of the decision to adopt or reject AI hiring technology in Kodapanakkal et al.’s (2020) experiment. Favourability was manipulated by stating whether it increases or decreases the chance of someone finding employment. When participants were given the choice regarding usage of AI hiring technology, they were more likely to embrace the technology if outcome favourability was high (z = 12.26, p < 0.001).
Oberst et al. (2020) extend these studies by looking at perceptions of recruitment professionals, and find that they greatly prefer using human judgements than algorithmic assessments in candidate selection decisions. They assessed how recruitment professionals make decisions in a fictitious scenario about candidate selection when they are given information from an algorithm regarding the candidate’s suitability for a job with three levels; “sufficient”, “satisfactory”, and “good”. They were also given a co-worker’s recommendation on each candidate with three levels; “I recommend this person totally”, “this person causes an excellent impression”, and “this person does not inspire confidence”. Results assessed to what extent the recruiters used the algorithmic assessment and human judgements in their selection decisions, giving each factor an average utility score. Bayesian hierarchical analysis revealed that the average utility assigned to co-worker’s recommendation (M = 58.69, SD = 9.47) was higher than the algorithm (M = 22.34, SD = 11.23). Thus, there is a large effect (d = 3.49) of recruiters using co-worker’s recommendations in selection decisions to a greater extent than an algorithmic assessment.
Again, here we see that perceptions around the usability of AI are worse than those of human hiring. This reflects a possible obstacle in the adoption of AI hiring. As shown in Kodapanakkal et al. (2020), favourability of the hiring outcome was indicative of the decision to adopt AI. Thus, these negative perceptions surrounding usability may be driven by a fear of poor hiring outcomes.
Candidate’s emotional perceptions which were evoked by AI hiring is assessed in two studies (Lee, 2018; Langer, Konig & Papathanasiou, 2019). See Table 8 for an overview of the studies.
Lee (2018) found that people trust humans much more than algorithms in hiring decisions, and have slightly more negative feelings towards AI hiring. They measured trust of decision process by having participants answer on a 7-point Likert scale and emotional response by asking questions adapted from Larsson, 1987; Weiss et al., 1999 on a 7-point Likert scale. They found a large effect that people trust algorithms less than humans during hiring (d = 0.951, p < 0.0001), and a small effect that they have more negative feelings towards algorithmic hiring (d = 0.394, p < 0.05). In addition, Langer et al. (2019) measured creepiness of algorithmic hiring from Langer and Konig (2018) on a 7-point scale. They found a medium effect that automated interviews evoked feelings of “creepiness” (np2 = 0.06, p < 0.01).
These studies show preliminary evidence that people have negative connotations surrounding the AI hiring process, but that the size of the effect may vary based on the type of emotional perception, the largest effect seen in the lack of trust in AI hiring. Thus, emotional perceptions are worse for AI hiring than for human hiring.
Five experiments have looked at the role of consistency in perceptions surrounding algorithmic hiring (Langer et al. 2018, 2019; Kaibel et al. 2019; Acikgoz et al. 2020; Noble et al. 2021). See Table 9 for an overview of the studies.
Kaibel et al. (2019) hypothesized that algorithms in the context of hiring would be viewed as more consistent than a human decision-making process. They measured consistency on a 5-point Likert scale developed by Bauer et al. (2001). Although initially they found a large effect (d = 1.16) of algorithmic hiring (M = 4.36, SD = 0.67) being rated as more consistent than humans (M = 3.36, SD = 1.02), it could not be replicated in the second study which included additional contextual information, due to scores being near the end-points of the scale (algorithms- M = 4.51, SD = 0.60, humans- M = 4.44, SD = 0.66). Furthermore, the two other studies (Langer, Konig & Papathanasiou, 2019; Langer, Konig & Fitili, 2018) which measured consistency on a 5-point Likert scale adapted from Bauer et al. (2001) and Warstza (2012) could not find any significant effect of consistency (np2 = 0.00, p > 0.05), even when varying stakes (np2 = 0.03, p > 0.05) and level of information (np2 = 0.00, p > 0.05). However, the last two studies (Acikgoz et al., 2020; Noble, Foster, & Craig, 2021) found that consistency, also measured using Bauer et al.'s (2001) items via 5- or 7-point Likert scales, was rated significantly higher in AI hiring processes than human hiring, with an effect size ranging from small to large (d ranged from 0.36 to 0.90, p < 0.001). Thus, there is mixed evidence concerning the relationship between consistency and type of hiring. Considering that the studies used the same measure of consistency, and that effect sizes varied substantially, it is likely that the ratings were sensitive to the experimental methodology factors.
Lastly, a few additional negative perceptions surrounding algorithmic hiring were discovered. Kaibel et al. (2019) measured personableness of the hiring process through a four item scale adapted from Wilhelmy et al. (2019) and found a medium to large effect (d = 0.618-0.904, p < 0.001) that the AI selection process is less personable. Langer, Konig & Papathanasiou (2019) measured perceived behavioural control on a 7-point Likert scale adapted from Langer et al. (2017). They found that perceived behavioural control was lower on automated interviews (np2 = 0.10, p < 0.01), with a moderate effect size. Finally, Newman et al. (2020) measured decontextualisation—the ability of the algorithm to accurately combine and weigh pieces of information—using a 7-point Likert scale and found a medium effect that algorithms are perceived to have more decontextualisation in high transparency conditions (d = 0.53, p = 0.006).
These last findings show additional perceptual areas where AI is considered worse than humans. Thus, perceptions regarding AI hiring extend beyond the ethical, organisational, usage, and emotional. Further research findings may distinguish even more perceptual differences between AI and human hiring.