Introduction

Over the past decade, there has been an increase in domains where AI is used to assist humans by providing recommendations in the context of a prediction problem. Examples of these AI recommendation systems include making bail decisions in a legal context (Kleinberg et al., 2018), detecting deception in consumer reviews (Ott et al., 2011), making medical decisions in diagnostic imaging (Esteva et al., 2017; Patel et al., 2019; Rajpurkar et al., 2020), recognizing faces in forensic analysis (Phillips et al., 2018), and classifying astronomical images (Wright et al., 2017). Such widespread adoption of AI decision aids has been accompanied by burgeoning interest in investigating the efficacy of AI assistance in collaborative decision-making settings (Yin et al., 2019; Park et al., 2019; Zhang et al., 2021; Poursabzi-Sangdeh et al., 2021; Buçinca et al., 2021; Kumar et al., 2021; Chong et al., 2022; Becker et al., 2022).

To investigate such AI-assisted decision-making, researchers have designed a variety of workflows. Some workflows require the human to provide an independent decision first, then display the AI’s advice which the human can then use to update their final decision (Yin et al., 2019; Poursabzi-Sangdeh et al., 2021; Chong et al., 2022). Other workflows present AI advice alongside the prediction problem and the human can decide to follow the advice or ignore it (Rajpurkar et al., 2020; Sayres et al., 2019). Finally, a few studies force individuals to spend time thinking about the decision problem by artificially delaying the presentation of AI advice (Buçinca et al., 2021; Park et al., 2019) or making AI advice available only when it is requested (Kumar et al., 2021; Liang et al., 2022). In this work, we focus on two of the aforementioned workflows of AI-assisted decision-making and refer to them as paradigms; a detailed illustration can be found in Fig. 1. We term the first as a sequential paradigm, where AI advice is displayed only after the human provides an independent judgment and the human can choose to revise their initial judgment. We term the second as a concurrent paradigm where AI advice is displayed concurrently with the prediction problem.

The sequential paradigm provides direct insights about the human’s reliance on the AI based on two human judgments: the initial independent judgment and a final judgment after receiving the AI advice. This paradigm makes it easier for experimenters to disentangle the influence of AI advice on the human’s decision. However, in many real-world applications, the human user does not independently make a decision before AI assistance is provided since providing the AI’s recommendation immediately simplifies the workflow and can save time. The concurrent paradigm offers an alternative setting to study AI-assisted decision-making. One drawback of the concurrent paradigm is the fundamental ambiguity in data interpretation — it is unclear as to how one can assess the usefulness of the AI decision aid to the human user. Since there is no initial human judgment available before AI advice is offered, there is no direct empirical observation about any changes the human is making in their decision-making. Any observed agreement between the human and the AI, in the concurrent paradigm, could arise because the human changed their judgment and took the AI’s advice or the human already arrived at the same judgment independent of the AI. How, then, do we assess the impact of AI assistance on the human’s decision?

Fig. 1
figure 1

Illustration of the sequential and concurrent paradigms for AI-assisted decision-making (top two rows). The no-AI assistance paradigm (bottom row) is used as a control condition for the concurrent paradigm

Our research has three main goals. First, we develop a computational cognitive model for AI-assisted decision-making in the concurrent paradigm. The cognitive model provides a principled way to infer the latent reliance of a human on the AI assistant in spite of the fact that there are no direct observations of switching behaviors when a person is presented with the AI advice. We empirically validate the computational model by collecting empirical data from a behavioral study using both the sequential and concurrent paradigms. The data from the sequential paradigm offers a comparison to the concurrent paradigm and provides a test to assess the merit of the computational framework. We demonstrate that the model’s predictions of reliance behavior in the concurrent paradigm are qualitatively similar to the reliance behavior observed in the sequential paradigm. In addition, we demonstrate that the model can generalize to held out trials in the concurrent paradigm.

In our second goal, we use the cognitive modeling approach to understand how a human’s reliance policy depends on a number of factors related to the human and the AI. Previous research has shown that a human’s confidence in their own decision influences the tendency to rely on AI assistance (Lu and Yin, 2021; Pescetelli et al., 2021; Wang et al., 2022). In addition, reliance on the AI is also affected by the AI’s confidence in its decision (Zhang et al., 2020). Another contributing factor is the overall accuracy of the AI. In some previous research, only a single AI model with a fixed degree of accuracy was used; for example, an AI model with an accuracy comparable to human performance (Zhang et al., 2020) or above human performance (Lai and Tan, 2019; Pescetelli et al., 2021). A few studies have investigated the effect of varying AI accuracy on reliance strategy (Yin et al., 2019). In our empirical paradigm, we investigate how human reliance varies across multiple levels of AI accuracy. This allows for a more nuanced understanding of the impact of the AI aid’s accuracy on the human’s reliance behavior. In addition, we investigate how participant confidence and AI confidence scores affect the trial-by-trial reliance strategy used by participants.

In our third goal, we use the computational model to quantify the effectiveness of the reliance strategies employed by the human. In some instances, people adopt sub-optimal reliance policies when working with an AI. For example, it has been found that people will prefer to use their own (less accurate) forecasts instead of an algorithm if they have seen the algorithm make mistakes (Dietvorst et al., 2015). In another study, people placed too much trust in an automated system (Cummings, 2017). Over- and under-reliance on AI advice may depend on particular task domains and methods of interaction (Promberger and Baron, 2006; Castelo et al., 2019; Logg, 2017). Whereas in these previous studies, the reliance was assessed at the aggregate level, our cognitive modeling approach enables us to estimate the trial-by-trial variations in reliance depending on factors such as the confidence state of the participant and the level of confidence of the AI for particular problem instances. For particular combinations of self- and AI confidence (e.g., low self-confidence and high AI confidence) and particular combinations of human and AI overall accuracy, we can expect joint decision-making accuracy to be better than the human or AI alone (Steyvers et al., 2022). An empirical question is whether participants are able to adopt such a policy. We compare the reliance policies adopted by participants to optimal policies and show that in our experiment, people were quite effective in their adoption of AI advice.

Cognitive Model

Before describing the computational model, we note some key aspects of the concurrent advice-taking paradigm in particular that motivate the design of the model. In the experiment, participants have to predict the classification label of a set of images and a confidence level associated with their decision. Each participant alternates between two experimental conditions. In the control (no assistance) condition, participants indicate their predictions without help from the AI. In the AI assistance condition, we follow the concurrent approach; the AI provides a recommended set of predictions by highlighting the class labels according to the AI’s confidence scores. The participant can use these recommendations in any way they want to order to maximize their own accuracy (see Fig. 2 for an illustration of the user interface in the experiment). An important aspect of this condition is that the participant’s prediction reflects a combination of their own independent decision-making (which is not observable in this paradigm) and the AI prediction. In other words, the policy used by the participant to rely on and integrate AI predictions with their own predictions is not directly observable from their behavior.

Fig. 2
figure 2

Illustration of the behavioral experiment interface in the AI assistance condition

The main goal of the computational model is to draw inferences about the latent advice-taking policies. The policy can be determined by a number of factors, such as the confidence state of the participant and the confidence scores of the AI as well as the overall accuracy of the AI. We develop a hierarchical Bayesian model to draw inferences about the policies not only at the population level but also at the level of individual participants. In the first part of the model, a Bayesian Item-Response model (Fox, 2010) is applied to the no-assistance condition to infer individual differences in ability as well as differences in difficulty across items (i.e., prediction problems). In the AI-assistance part of the model, these latent person and item parameters are used to explain the observed prediction from a participant which depends on their (unobservable) unaided prediction and the advice-taking policy that determines the likelihood that a participant switches to the AI prediction or stays with their own prediction. Figure 3 visualizes the graphical model of the computational model that explains the human predictions with and without AI assistance.

Fig. 3
figure 3

Graphical model for the AI-assisted decision-making model. In the condition without assistance, \(r_{ij}\) and \(x_{ij}\) and \(z_j\) are observed. In the condition where AI assistance is provided, \(r_{ij}\) and \(x_{ij}\) are latent and \(y_{ijk}\), \(z_j\), \(c_{jk}\), and \(\eta _{jk}\) are observed. For visual clarity, plate notation is omitted

Modeling Human Decisions Before Assistance

The computational model for human predictions without AI assistance is based on a Bayesian Item-Response model (Fox, 2010). The Item-Response model makes it convenient to model individual differences in accuracy as well as differences in item difficulty (where items refer to the individual images participants have to classify). To model the human predictions, we use a three-parameter IRT model to capture the probability \(\theta _{i,j}\) that a correct response is made by person i on item j:

$$\begin{aligned} \begin{aligned} \log \left( \frac{\theta _{i,j}}{1-\theta _{i,j}} \right)&= s_{j} a_i - d_{j}\\ \end{aligned} \end{aligned}$$
(1)

The person parameter \(a_i\) is an ability parameter that determines the overall performance of the person across items. The item parameter \(d_j\) captures differences in the item difficulty while the item parameter \(s_j\) captures discrimination: the tendency of an item to discriminate between high and low ability individuals.

In a typical IRT model, the probability of making a correct response, \(\theta\), is used to sample the correctness of an answer. However, for our model, we code the responses from individuals in terms of the predicted label. Let \(x_{i,j}\) represent the prediction by person i for item j in the absence of AI assistance. Each prediction involves a choice from a set of L labels, i.e., \(x \in \{1,\ldots ,L\}\). Let \(z_j\) represent the true label for item j. We assume that person i produces the correct label \(z_j\) on item j with probability \(\theta _{i,j}\) and otherwise chooses uniformly from all other labels, as follows:

$$\begin{aligned} \begin{aligned} p( x_{i,j} = m )&= {\left\{ \begin{array}{ll} \theta _{i,j} &{} \text{ if } z_j = m \\ (1-\theta _{i,j})/(L-1) &{} \text{ if } z_j \ne m \end{array}\right. }\\ \end{aligned} \end{aligned}$$
(2)

Various model extensions could be considered that allow for response biases such that some labels are preferred a priori over other.

Participants not only make a prediction but also express a confidence level, \(r_{i,j}\), associated with their prediction. In the experimental paradigm, confidence levels are chosen from a small set of labels, \(r_{i,j} \in \{\mathrm {low}, \mathrm {medium}, \mathrm {high}\}\). In the model, we assume that predictions associated with higher accuracy on average lead to higher confidence levels, but that at the item level, the mapping from accuracy to confidence is noisy. To capture the noisy relationship between accuracy and confidence, we use a simple generative model based on an ordered probit model:

$$\begin{aligned} \begin{aligned} r_{i,j}&\sim \mathrm {OrderedProbit}( \theta _{i,j} , v_i , \sigma _i )\\ \end{aligned} \end{aligned}$$
(3)

In this generative model, normally distributed noise with standard deviation \(\sigma _i\) is added to the probability of being correct \(\theta _{i,j}\). The resulting value is then compared against a set of intervals defined by parameters \(v_i\), and the interval which contains the value determines the resulting confidence level. Changes in \(v_i\) can lead the participant to different uses of the response scale (i.e., using one particular confidence level relatively often) while \(\sigma _i\) determines (inversely) the degree to which accuracy and confidence are related. Note that the parameters \(\sigma\) and v are person-specific to allow for individual differences in the confidence generating process. Appendix 1 provides more detail on the ordered probit model.

Modeling Human Decisions After Advice

In the model for human decisions in the presence of advice, let \(y_{i,j,k}\) represent the observed prediction made by person i on item j after AI advice is considered from AI algorithm k. We include a dependence on the type of algorithm as our empirical paradigm will present AI advice from different algorithms. In the advice-taking model, we assume that the participant initially makes their own prediction \(x_{i,j}\) independent of the AI advice but that their final decision \(y_{i,j,k}\) can be influenced by the AI advice. Note that in the no-assistance condition, the independent predictions \(x_{i,j}\) and associated confidence levels \(r_{i,j}\) are directly observable, but they are latent in the AI assistance condition. However, we can use the IRT model in the previous section to simulate the counterfactual situation about the prediction and confidence level that a person would have made if AI advice was not provided. Specifically, we can use the generative model in Eqs. 13 to generate predictions for \(x_{i,j}\) and \(r_{i,j}\) on the basis of information about the participant’s overall skill (a) as well as information about the difficulty of the particular item (\(d_j\))Footnote 1.

In the advice-taking model, we assume that the participant will stay with their original decision \(x_{i,j}\) if it agrees with the AI’s recommendation, denoted by \(c_{j,k}\). However, when the original decision is not the same as the AI’s recommendation, we assume the participant switches to the AI’s recommendation with probability \(\alpha _{i,j,k}\). Therefore, we can model the probability that the participant chooses label m for their final prediction as follows:

$$\begin{aligned} \begin{aligned} p( y_{i,j,k} = m )&= {\left\{ \begin{array}{ll} \alpha _{i,j,k} &{} \text{ if } x_{i,j} \ne m \wedge c_{j,k} = m\\ 1 &{} \text{ if } x_{i,j} = m \wedge c_{j,k} = m\\ 0 &{} \text{ if } x_{i,j} \ne m \wedge c_{j,k} \ne m \end{array}\right. }\\ \end{aligned} \end{aligned}$$
(4)

The variable \(\alpha _{i,j,k}\) determines the tendency of participant i to trust the AI advice from algorithm k related to item j. In the next section, we describe how this latent variable can depend on factors such as the confidence state of the participant as well as the confidence score of the AI.

Note that in this model, when the participant is provided with AI assistance, the independent prediction \(x_{i,j}\) is latent in our experimental paradigm. Instead of explicitly simulating the process of first sampling an independent prediction \(x_{i,j}\) and then a final prediction \(y_{i,j,k}\), we can simplify the generative process by marginalizing out \(x_{i,j}\):

$$\begin{aligned} \begin{aligned} p( y_{i,j,k} = m )&= {\left\{ \begin{array}{ll} \theta _{i,j} + (1-\theta _{i,j}) \alpha _{i,j,k} &{} \text{ if } z_{j} = m \wedge c_{j,k} = m\\ \frac{1-\theta _{i,j}}{L-1} + \left( 1-\frac{1-\theta _{i,j}}{L-1} \right) \alpha _{i,j,k} &{} \text{ if } z_{j} \ne m \wedge c_{j,k} = m\\ \frac{1-\theta _{i,j}}{L-1} ( 1-\alpha _{i,j,k} ) &{} \text{ if } z_{j} \ne m \wedge c_{j,k} \ne m \end{array}\right. }\\ \end{aligned} \end{aligned}$$
(5)

In this equation, the probability that the participant selects label m is split into three different cases. The first case reflects the probability that the participant makes the correct decision independently (which happened to agree with the AI recommendation) or makes an incorrect decision initially but then adopts the correct AI advice. The second case reflects the probability that the participant initially selects an incorrect decision (which happened to agree with the AI recommendation) or makes another decision different from the AI but then adopts the incorrect AI advice. The third case reflects the probability that the participant makes an incorrect independent decision and decides not to switch to the AI’s recommendation.

Modeling Individual Differences in Advice-Taking

The key latent variable of interest in the model is \(\alpha _{i,j,k}\), which determines the willingness of the participant per item to switch to the AI’s recommended prediction if it differs from their own prediction. Generally, \(\alpha _{i,j,k}\) can depend on many characteristics related to the person, item, and classifier. Here, we will consider functions where \(\alpha\) depends on the confidence state of the participant for item j (\(r_{i,j}\)), the AI confidence score associated with item j (\(\eta _{j,k}\)), and the type of classifier k:

$$\begin{aligned} \begin{aligned} \alpha _{i,j,k} = f( r_{i,j} , \eta _{j,k} , k ) \end{aligned} \end{aligned}$$
(6)

One way to specify function f is based on a linear model that captures main effects as well as interaction between the two putative factors. However, to avoid specifying the exact functional form of f, we will instead simplify the model and treat function f as a lookup table that specifies the \(\alpha\) values based on a small number of combinations of participant confidence, AI confidence, and classifier type. Specifically, we create 3 × 4 × 3 lookup table that specifies the \(\alpha\) value based on 3 levels of participant confidence (“low,” “medium,” “high”), 4 levels of AI confidence, and 3 types of classifiers (k). We use a hierarchical Bayesian modeling approach to estimate individual differences in the policy \(\alpha\) (see Appendix 2 for details).

Experiments

To validate our cognitive model, we investigated human performance with and without AI assistance in two paradigms: the concurrent and sequential paradigm. We will apply the cognitive model to the concurrent paradigm to infer the AI reliance strategies by individual participants. The results from the sequential paradigm serve as a means to validate our cognitive model, as the sequential paradigm allow us to empirically analyze participant strategies when integrating AI assistance.

In both paradigms, participants have to classify noisy images into 16 different categories (see Fig. 2 for an example of the user interface). There were two experimental manipulations. First, the image noise was varied to produce substantial difference in classification difficulty (Fig. 4). Second, we varied the overall accuracy of the AI predictions across three conditions: classifier A, classifier B, and classifier C. Classifier A was designed to produce predictions that are, on average, less accurate than human performance. Classifiers B and C were designed to produce predictions that are, on average, as accurate and more accurate than human performance. Each participant was paired with one type of classifier.

The main difference between the two paradigms is that in the concurrent paradigm, participants alternated between blocks of trials where AI assistance was or was not provided. In the sequential paradigm, there were no alternating blocks. On each trial, the participant first made an independent prediction for a image classification problem and was then given an opportunity to revise their prediction after AI assistance was provided.

Fig. 4
figure 4

Illustration of three images under different levels of phase noise. Original images (left) were not used in experiments and are shown only for illustrative purposes

Methods

Participants

A total of 60 and 75 participants were recruited using Amazon Mechanical Turk for the concurrent and sequential experiments respectively. To ensure that participants understood the task, they were given a set of instructions describing the experiment and what they would have to do. Upon reading all of the instructions, participants were then tasked with a comprehension quiz to ensure they fully understood the task. The quiz consisted of having participants classify five different noisy images with AI help turned off. In order to participate in the study, participants had to correctly classify four of the five images in the quiz. Participants were given two opportunities to pass the quiz. Successful participants were then allowed to proceed with the rest of the experiment.

Images

All images used for this experiment come from the ImageNet Large Scale Visual Recognition Challenge (ILSRVR) 2012 validation dataset (Russakovsky et al., 2015). Following (Geirhos et al., 2019), a subset of 256 images was selected divided equally among 16 classes (chair, oven, knife, bottle, keyboard, clock, boat, bicycle, airplane, truck, car, elephant, bear, dog, cat, and bird). To manipulate the classification difficulty, images were distorted by phase noise at each spatial frequency, where the phase noise is uniformly distributed in the interval \([-\omega ,\omega ]\) (Geirhos et al., 2019). Eight levels of phase noise, \(\omega =\{0, 80, 95, 110, 125, 140, 155, 170\}\), were applied to the images, a different noise level for each unique image, resulting in 2 unique images per category per noise level (see Fig. 4 for examples of the phase noise manipulation).

AI Predictions

We used a convolutional neural network (CNN), based on the VGG-19 architecture (Simonyan and Zisserman, 2014), pretrained on the ImageNet dataset as the basis for the AI assistance. Our choice of VGG-19 was motivated by previous experiments (Steyvers et al., 2022) that showed that the performance of the VGG-19 model could be manipulated to produce above-human performance for the challenging image noise conditions in the experiment.

Three different levels of classifier performance were created by differentially fine-tuning the VGG-19 architecture to the phase noise used in our experiment. All models were trained with all levels of phase noise. However, to generate these different levels of performance, the models were fine-tuned for different periods of time. We used a pilot experiment with 145 participants to assess human performance at the different noise levels. Classifier A was produced by fine-tuning for less than one epoch (10% of batches of the first epoch) and produced a performance level that was on average below human performance. Classifier B was produced by fine-tuning for the entirety of one epoch and produced a performance level that was on average near human performance. Classifier C was fine-tuned for 10 epochs and produced a performance level above average human performance.

Procedure

In both the concurrent and sequential paradigms, participants were instructed to classify images as best as possible and to leverage AI assistance, when provided, to optimize performance. Each participant was assigned to a single classifier level (A, B, or C) at the start of the experiment and each was only presented with AI assistance from that particular classifier; 20 participants were assigned to each classifier level in concurrent paradigm, and 25 participants to each classifier level in the sequential paradigm. Participants were given no information about the accuracy of the classifier.

Concurrent paradigm

In the concurrent paradigm, there were 256 trials total. Each trial presented a unique image randomly selected from the set of 256 images. The classification trials were separated into 4 blocks where each block consisted of 48 consecutive trials in which AI assistance was turned on, and 16 consecutive trials without AI assistance. The larger number of trials with AI assistance was used to better assess participants AI reliance strategies under different levels of AI confidence. Because of the random ordering of images across participants, a particular image was shown for some participants in the AI assistance condition and for other participants in the control condition without AI assistance. Each unique image was shown to a median of 15 participants in the control condition and 45 participants in the AI assistance condition.

On each trial, participants were shown an interface as illustrated in Fig. 2. Participants classified images into 16 categories by pressing the response buttons that represented the categories with visual icons as well as labels (when the participant hovers the mouse over the button). For each classification, the participant provided a discrete confidence level (low, medium, and high). Finally, the rightmost column of the interface was used for AI assistance. When AI assistance was turned off, this column displayed nothing. However, when AI assistance was turned on, a grid of the 16 category options was shown with the same layout as the participant response options. Each of the 16 categories would be highlighted based on a gradient scale associated with the probability that the AI classifier assigned to the category. The darker the hue of the highlighted category, the more confident the classifier was in that selection. Instances in which the classifier was extremely confident in a single category, there would only be one category highlighted with an extremely dark hue. However, in instances where the classifier was not confident in a classification, there would be multiple categories highlighted with low hue levels. Participants were to use the AI assistance to aid their classification decision so as to optimize their own performance on the task. At the end of each trial, feedback was provided to enable the participant to develop an AI reliance strategy tailored to the particular AI algorithm they were paired with. In the feedback phase, the correct response option was highlighted in blue. If the participant was incorrect, the incorrect response was highlighted in red.

Sequential paradigm

In the sequential paradigm, there were 192 trials total. Each trial presented a unique image randomly selected from the set of 256 images. On each trial, participants were first tasked with classifying an image on their own and were shown the interface as displayed in Fig. 2 but without AI assistance (the third column showing AI assistance was completely blank). After selecting their initial classification decision and submitting their response by selecting a confidence level, participants then were provided with AI assistance. The user interface at this stage looked exactly like Fig. 2 and the procedure for displaying AI confidence was the same as in the concurrent procedure. With AI assistance turned on, participants then made a final classification decision for the image shown and submitted their response by selecting their confidence level. Once a final classification was made, participants were provided feedback for 3 s.

Results

Figure 5 shows the average accuracy across noise levels, AI classifier accuracy levels, AI assistance conditions, and the concurrent and sequential advice-taking paradigms. In both the concurrent and sequential procedures, substantial performance differences are observed as the level of image noise varies, ranging from near ceiling performance at the zero noise level to close to chance-level performance (i.e., 1/16 = 0.0625) at the highest noise level. Across all classifier conditions, human performance improves with AI assistance, especially at intermediate levels of noise, as illustrated in Fig. 6. For classifiers B and C, the AI assistance produces performance levels comparable to the AI alone. For classifier A, the AI assistance improves human performance even though the AI assistance’s accuracy is below human performance, on average. Note that this result is possible when participants rely on AI assistance on select trials when participants are in a low confidence state and the classifier is in a relatively high confidence state (see Appendix 5 for an analysis of the relationship between human and AI confidence). Overall, these results show that participants are able to rely on AI assistance to produce complimentarity — the joint human-AI accuracy is equal to or better than either the human or the AI alone.

The results are very similar across the concurrent and sequential paradigms. The average human accuracy with AI assistance for classifiers A, B, and C is 57%, 62%, and 68% respectively in the concurrent paradigm and 56%, 61%, and 65% respectively in the sequential paradigm. A Bayesian independent samples t-test showed no evidence for a difference in performance for any the classifiers (i.e., all Bayes Factors < 1)Footnote 2. That these results are consistent and very similar in both the concurrent and sequential experiments suggests that the experimental advice-taking paradigm does not produce important differences in how humans rely on and integrate AI assistance.

Fig. 5
figure 5

Human accuracy with and without AI assistance as well as AI accuracy as a function of noise level (horizontal axis) across the concurrent and sequential paradigms (rows). Columns show different types of AI classifiers: classifier A’s accuracy is below average human accuracy, classifier B’s accuracy is comparable to average human accuracy, and classifier C’s accuracy is above average human accuracy. Error bars reflect the 95% confidence interval of the mean based on a binomial model

Fig. 6
figure 6

Differences in accuracy with AI assistance relative to no AI assistance and AI only. Results are shown as a function of noise level (horizontal axis) and type of AI (columns) across the concurrent and sequential advice-taking paradigms. Error bars reflect the 95% confidence intervals

Model-Based Analysis

The empirical results showed that the concurrent and sequential advice-taking paradigms produce similar levels of accuracy across all experimental manipulations. In this section, we report the results of applying the cognitive model to the data from the concurrent paradigm.

We used a Markov chain Monte Carlo (MCMC) procedure to infer model parameters for the graphical model as illustrated in Fig. 3 (see Appendix 2 for details). Generally, the model is able to capture all the qualitative trends in the concurrent paradigm (see Appendix 4 on an out-of-sample assessment of model fit). We focus our analysis on two key parameters estimated by the model: \(\beta\), the advice-taking policy at the population level, and \(\alpha\), the advice-taking policy for individual participants. In the next sections, we illustrate the inferred policies and compare the results against the empirically observed strategies from the sequential advice-taking paradigm. In addition, we analyze how effective the policies are relative to the set of all possible policies that participants could have adopted, ranging from the worst to best policies.

Inferred Advice-Taking Policies

Figure 7, top row, shows the inferred advice-taking policy \(\beta\) as a function of classifier confidence, participant confidence and classifier. These policies represent the behavior of an average participant at the population level of the model. Figure 8 shows examples of inferred advice-taking policies (\(\alpha\)) from a subset of individual participants. Overall, the probability of taking AI advice differs substantially across classifiers. Advice is more likely to be accepted when the participant is in a low confidence decision-state and the classifier provides high confidence recommendations. In addition, across the different levels of classifier accuracy, advice is more likely to be accepted from high accuracy classifiers. Overall, these results show that the advice-taking behavior depends on a number of factors and is not based on simple strategies that rely solely on the confidence level of the AI or the confidence level of the participant. In addition, the results show that the advice-taking behavior is adjusted when the AI assistance becomes more accurate, from classifier A to classifier C, showing that participants are sensitive to AI accuracy.

Fig. 7
figure 7

Advice-taking policies inferred from the advice-taking behavior in the concurrent paradigm (top row) and observed in the sequential paradigm (bottom row). The policy determines the probability of taking the AI advice as a function of human confidence (colors), classifier confidence (horizontal axis), and type of classifier (columns). The colored areas in the top row show 95% posterior credible intervals. The colored areas in the bottom row reflect the 95% confidence interval of the mean based on a binomial model. The inferred advice-taking parameters (\(\beta\)) are converted from log-odds to probabilities in this visualization

Fig. 8
figure 8

Inferred advice-taking policies for a subset of 7 individual participants in the concurrent paradigm. The policy determines the probability (\(\alpha\)) of taking the classifier advice as a function of human confidence (colors), classifier confidence (horizontal axis), and type of classifier (rows). Colored areas show 95% posterior credible intervals

Figure 7, bottom row, shows the empirically observed reliance strategies for the sequential paradigm. This analysis focuses on the subset of trials where the initial prediction from the participant differs from the AI prediction (which is not yet shown) and then calculating the proportion of trials where the participant switches to the AI prediction. Importantly, even though there are some quantitative differences that can be observed between the reliance strategies in the two paradigms, the qualitative patterns are the same. Thus, the results from the sequential paradigm provide a key validation of the cognitive model. The latent strategies uncovered by the cognitive model in the concurrent paradigm are very similar to those observed in the sequential paradigm.

Effectiveness of the Advice-Taking Policies

We now address the question of how effective are the participants’ advice-taking policies. How much better (or worse) could participants have performed if they changed their advice-taking strategy? Figure 9 shows the range of all possible outcomes across different instantiations of the advice-taking policies. The accuracies of the worst and best possible advice-taking strategies were inferred by an analysis that optimizes performance conditional on the performance of the participants (Appendix 3). Note that the worst to best accuracies span the range of all possible outcomes. To understand how effective the average participants’ policies (\(\beta\)) are on this range, we used a Monte Carlo sampling procedure to derive the accuracy distribution over all strategies (see Appendix 3 for details) and compute the percentile rank of the participant strategies in this distribution. These results show that the actual policies adopted by participants were highly effective, scoring in or near the top 10% of all possible strategies. Figure 10 shows the percentile rank for all individual participants when the effectiveness analysis is applied to the individual participant data. While a small subset of participants used suboptimal reliance strategies, the majority of participants used highly effective strategies.

Fig. 9
figure 9

Accuracy of the advice-taking policy at the population level relative to the best and worst possible advice-taking policies. The distributions show the accuracy of randomly sampled advice-taking policies. To quantify the participants’ performance levels, percentages show the percentile rank of their performance relative to the accuracy distribution over all possible policies

Fig. 10
figure 10

Individual differences in the effectiveness of advice-taking strategies as assessed by the percentile rank relative to the distribution of all possible advice-taking policies

Discussion

Appropriate reliance on AI advice is critical to effective collaboration between humans and AI. Most research on AI-assisted decision-making has focused on gaining insight into the human’s reliance on AI though empirical observations based on trust ratings and comparisons of observed accuracy and final decisions by humans and AI. For instance, in work that uses trust as a proxy for reliance, individuals are required to report their trust in the AI assistant (Lee and See, 2004). However, self-reported trust is not a reliable indicator of trust (Schaffer et al., 2019). Researchers have also compared the accuracy of the human-AI team when AI assistance is provided to the accuracy without assistance (Lai and Tan, 2019). However, this difference in accuracy is directly correlated with the performance of the AI. Another method used to investigate reliance is based on analyzing the agreement between the human’s final decision and the AI’s prediction (Zhang et al., 2020). This approach is problematic when used in the concurrent paradigm — while agreement can occur because of an individual’s trust in the AI, it can also occur because the individual might have arrived at the same prediction as the AI even without the AI’s assistance. Finally, in experiments using the sequential paradigm, reliance can be assessed by the propensity of individuals to switch to the AI’s recommendation for those cases where their initial independent decision differs from the AI (Zhang et al., 2020; Yin et al., 2019). While this is a simple and straightforward procedure to gain insight into a reliance strategy, it cannot be applied to the concurrent paradigm as the individual’s independent response is inherently unobservable.

Instead of using empirical measures to assess reliance, we developed a cognitive modeling approach that treats reliance as a latent construct. The modeling framework provides a principled way to reveal the latent reliance strategy of the individual by using a probabilistic model of the advice-taking behavior in the concurrent paradigm. It can be used to infer the likelihood that a human would have made a correct decision for a particular item independently even when their independent decision is not directly observed. The model is able to make this inference because it assumes that people, at the same levels of skill, will likely make the same prediction. The model allows us to investigate the difference between agreement with the AI and switching to AI advice (two metrics often used to assess trust) without explicitly asking the human to respond independently to each problem. In order to apply the model, empirical observations are needed that assess people’s independent decisions without the assistance of an AI.

We showed that the AI reliance strategy inferred by the cognitive model on the basis of the concurrent paradigm is qualitatively similar to the AI reliance strategy observed in the sequential paradigm. Therefore, this demonstrates that a latent modeling approach can be used to investigate AI-assisted decision-making. The reliance strategy estimated by the model showed that participants discriminatively relied on the AI and varied their reliance from problem to problem. Participants were more likely to rely on the AI if they were less confident in their own decisions or when the AI was relatively confident. In addition, participants relied more heavily on AIs that were more accurate overall. This finding is consistent with (Liang et al., 2022) who showed that people rely on AI assistance more when the task is difficult and when they were given feedback about their performance and the AI’s performance.

The results also showed that participants were able to build very effective reliance strategies compared to the optimal reliance strategy. We believe that participants were able to achieve this because of the following reasons. First, this is a simple image classification task and most people are experts at identifying everyday objects from images. This enables people to have a good understanding of their own expertise and confidence on any presented image. Second, in our experiment, people received feedback after each trial, which gave them the opportunity to learn about the AI assistant’s accuracy and confidence calibration. This feedback allowed people to build reasonable mental models of the AI assistant when paired with any of the three classifiers.

Finally, our results showed that the concurrent and sequential AI assistance paradigms led to comparable accuracy. Some researchers have argued that the sequential paradigm is superior to the concurrent paradigm because the initial unassisted prediction encourages independent reflection which could lead to retrieval of additional problem-relevant information (Green and Chen, 2019). However, consistent with our study, other studies have found no difference in overall performance between the concurrent and sequential paradigm (Buçinca et al., 2021). Another factor that could be relevant is the timing of AI assistance. The AI advice can be presented after some delay which provides the decision-maker additional time to reflect on the problem and improve their own decision-making accuracy (Park et al., 2019). Another possibility is to vary the amount of time available for people to process the AI prediction after it is shown making it more likely for people to detect AI errors (Rastogi et al., 2022). Overall, more research is needed to understand the effects of soliciting independent human predictions and varying the timing of the AI recommendation.

Our empirical and theoretical work comes with a number of limitations. First, we provided trial-by-trial feedback to help participants with the task of building a suitable mental model of AI performance. However, feedback is not always possible in real-world scenarios (Lu and Yin, 2021). Future research should investigate modeling extensions that model the cognitive process when participants do not receive feedback at all or receive feedback after a delay. Second, while the cognitive model captured the general process of advice taking based on a latent reliance policy, it did not model the process of establishing the reliance policy over time. Therefore, one important model extension, which we leave for future research, is to model the trial-by-trial adjustments of the reliance policy as a function of beliefs held a priori by participants about the accuracy of AI algorithms and external signals of AI confidence and accuracy as well as internally generated confidence signals.