1 Introduction

Since the 1970s the use of Risk Assessment Instruments (RAI) in high stakes contexts such as medicine or criminal justice, together with their risks and benefits, have been a subject of debate across various disciplines. RAIs may increase the accuracy, robustness, and efficiency in decision making (Kleinberg et al. 2018); however, they can also lead to biased decisions and, consequently, to discriminatory outcomes (Angwin et al. 2016; Skeem et al. 2016). Understanding the performance of a RAI requires looking beyond the statistical properties of a predictive algorithm, and considering the quality and reliability of the decisions made by humans using the RAI (Green 2021). This is because high-stakes decisions are rarely made by algorithms alone, and humans are almost invariably ‘in-the-loop’, i.e., involved to some extent in the decision making process (Binns and Veale 2021). Indeed, the General Data Protection Regulation (GDPR)Footnote 1 in Europe gives data subjects the right ‘not to be subject to a decision based solely on automated processing’ (Article 22), and the proposed Artificial Intelligence ActFootnote 2 in its draft published in April 2021 and approved by the European Parliament on 2024, considers criminal risk assessment a ‘high-risk’ application subject to stringent human oversight.

Fig. 1
figure 1

Sequence of studies and number of participants. The figure shows the three experimental studies (one targeted and two crowdsourced) and the different focus groups made during our study. Additional studies were made after by changing the statements about violent recidivism

Our work involves a sequence of studies outlined in Fig. 1 and described in the next sections. We develop and test different user interfaces of a machine learning version of RisCanvi, the main RAI used by Catalonia’s criminal justice system. We ask participants to predict the re-incarceration riskFootnote 3 based on the same factors used by RisCanvi, such as criminal history, which empirically affect the recidivism risk of individuals. Some participants are additionally shown the risk that the RAI predicts using the same factors. Our primary goal is to assess how the interaction with the studied RAI affects human predictions, their accuracy, and their willingness to rely on a RAI for this task.

As most previous studies on this topic, we partially rely on crowdsourced participants (Dressel and Farid 2018; Grgic-Hlaca et al. 2019; Green and Chen 2020; Fogliato et al. 2021). Controlled in-lab/survey experiments and crowdsourced experiments have the limitation that participants do not face the real world consequences that professional decisions have on the lives of inmates. In addition, untrained crowdworkers may exhibit different decision making behaviour than trained professionals. The former limitation can only be addressed through studies that analyze the real-world adoption of a RAI through observational methods (Berk 2017; Stevenson 2018; Stevenson and Doleac 2021). However, these studies usually face the difficulty of isolating the effect of RAI adoption from other changes that co-occur in the study period. The latter can be addressed in an experimental setting by recruiting professional participants, as we do in this paper. To the best of our knowledge, most studies focus on crowdsourced participants. This might be the first study that, in addition to a crowdsourced study, runs a targeted study which results are supported with a validation from a focus group. We recruited students and professionals of data science as well as domain experts (with a background in criminology and social work), including people who work within Catalonia’s criminal justice system and use RisCanvi in a professional capacity. Finally, we conducted a qualitative study with small sub-groups of the targeted user study, particularly professionals within the Justice Department of Catalonia, as well as data scientists. Our main contributions are:

  • We confirm previous results that show how accuracy in decision making slightly improves with algorithmic support, how participants adjust their own predictions in the direction of the algorithmic predictions, and how different scales in risk communication yield different levels of accuracy.

  • We describe differences between targeted participants and crowdsourced workers. Despite identical experimental conditions and tasks, we find that the predictions differ between these groups and targeted participants outperform crowdsourced participants in terms of accuracy.

  • We provide insights into how professionals use RAIs in real-world applications from our focus groups. Our interviewees would not foresee using a fully automated system in criminal risk-assessment, but they see benefits in using algorithmic support for training and standardization, and for fine-tuning and double-checking particularly difficult cases.

The remainder of this paper is structured as follows. First, we give an overview of related work (Sect. 2). Next, we describe our approach, including the variables of our study (Sect. 3), as well as the materials and procedures we employ (Sect. 4). We present the experiment setup for crowdsourced and targeted participants (Sect. 5), and the obtained results from both groups (Sect. 6). Then we present the results from the focus groups (Sect. 7). Finally, we discuss our findings (Sect. 8), as well as the limitations of this study and possible directions for future work (Sect. 9).

2 Related work

2.1 Risk assessment instruments (RAI) for criminal recidivism

Law enforcement systems increasingly use statistical algorithms, e.g., methods that predict the risk of arrestees to re-offend, to support their decision making (Goel et al. 2019; Chiusi et al. 2020). RAIs for criminal recidivism risk are in use in various countries including Austria (Rettenberger et al. 2010), Canada (Kröner et al. 2007), Germany (Dahle et al. 2014), Spain (Andrés-Pueyo et al. 2018), the U.K. (Howard and Dixon 2012), and the U.S. (Desmarais and Singh 2013). There are ethical and legal aspects to consider, as algorithms may exhibit biases, which are sometimes inherited from the data on which they are trained (Barocas and Selbst 2016). However, some argue that RAIs bear the potential for considerable welfare gains (Kleinberg et al. 2018). The literature shows that decisions based on RAIs’ scores are never made by an algorithm alone. Decisions in criminal justice are made by professionals (e.g., judges or case workers) (Bao et al. 2021), sometimes using RAIs (Stevenson and Doleac 2021). Consequently, algorithms aimed at supporting decision processes, especially in high-risk contexts such as criminal justice, cannot be developed without taking into account the influences that institutional, behavioural, and social aspects have on the decisions (Selbst et al. 2019). Furthermore, human factors such as biases, preferences and deviating objectives can also influence the effectiveness of algorithm-supported decision making (Jahanbakhsh et al. 2020; Mallari et al. 2020). Experienced decision makers may be more inclined to deviate from an algorithmic recommendation, relying more on their own cognitive processes (Green and Chen 2020). Moreover, trained professionals, such as probation officers, may prefer to rely on their own decision and not just on a single numerical RAI prediction. Any additional information that they consider may be used as a reason to deviate from what a RAI might recommend for a case (McCallum et al. 2017). There are other reasons why humans disagree with an algorithmic recommendation. For instance, the human’s objectives might be misaligned with the objective for which the algorithm is optimized (Green 2020), or the context may create incentives for the human decision maker not to follow the algorithm’s recommendation (Stevenson and Doleac 2021). Sometimes humans are unable to evaluate the performance of themselves or the risk assessment, and engage in ‘disparate interactions’ reproducing biased predictions by the algorithm (Green and Chen 2019). Another reason could be algorithm aversion, e.g., human decision makers may discontinue the use of an algorithm after observing a mistake, even if the algorithm is on average more accurate than them (Dietvorst et al. 2015; Burton et al. 2020). In contrast, controlled user studies in criminal risk assessment indicate that crowdsourced participants tend to exhibit automation bias, i.e., a tendency to over-rely on the algorithm’s prediction (Dressel and Farid 2018; Bansak 2019).

Effective human-algorithm interaction depends on users’ training with the tool, on the experience of the human decision maker with the algorithm, and on the specific professional domain in which the decision is made. Therefore, some researchers have studied the impact of the adoption of RAIs in criminal justice decision-making in real-world applications (Berk 2017; Stevenson 2018; Stevenson and Doleac 2021). These observational studies yield valuable insights, but the conditions of adoption as well as the design of the RAI cannot be controlled, making it difficult to isolate the effect of the RAI on the studied outcome.

2.2 Controlled user studies and interaction design of RAIs

Algorithm-supported human decision making has also been studied in controlled experiments (Dressel and Farid 2018; Green and Chen 2019, 2019; Grgić-Hlača et al. 2019; Lin et al. 2020; Fogliato et al. 2021). Among these, an influential study by Dressel and Farid in 2018 (Dressel and Farid 2018), showed how crowdsourced users recruited from Amazon Mechanical Turk (AMT) were able to outperform the predictions of COMPAS, a RAI that has been subject to significant scrutiny since the seminal work of Angwin et al. (2016). Follow-up studies criticized Dressel and Farid’s study, noting that participants were shown the ground truth of each case (i.e., whether or not the person actually recidivated) immediately after every prediction they make, which does not correspond to how these instruments are used in practice. Without this feedback, human predictions that were not supported by algorithms performed worse than the algorithm under analysis (Lin et al. 2020).

The way risk assessments are communicated and integrated in the decision process plays a crucial role in the quality of the predictions. For instance, criminal forensics clinicians have a preference for (non-numerical) categorical statements (such as ‘low risk’ and ‘high risk’) over numerical risk levels. However, an experimental survey showed that a RAI providing numerical information elicits better predictive accuracy than if categorical risk levels are used (Zoe Hilton et al. 2008). One issue with categorical expressions is that professionals tend to disagree about the limits of the categories and how these categories represent different numerical risk estimations (Hilton et al. 2015). However, numerical expressions introduce other challenges. For instance, participants in a study perceived violence risk as higher when the risk was presented in a frequency format instead of a percentage format (Hilton et al. 2015). Another question is whether numerical risks should be presented on an absolute or a relative scale. A study with clinicians showed that participants hardly distinguish between absolute probability of violence and comparative risk (Zoe Hilton et al. 2008). Furthermore, besides showing only risk levels, risk assessments could include additional information about the nature of the crime, the factors of the RAI and other factors that may have preventive effects on future re-offense (Heilbrun et al. 1999). Complementary and graphical information can improve the understanding of risk evaluations (Hilton 2017). However, it can also increase the overestimation of risk factors while ignoring other contextual information (Batastini 2019). Nevertheless, the use of different visualization methods is mainly unexplored.

Given the experience from previous work, we build our user interface to test and measure the performance of participants using different categorical risk levels and numerical expressions for risk, specifically absolute and relative risk scales. We conduct a recidivism prediction experiment with crowdsourced participants, but also complement it with targeted participants. One of the main novelties of our study resides in assessing how targeted participants, including domain experts and data scientists, perform differently than crowdsourced participants. Another key difference of this study with respect to previous work (Dressel and Farid 2018; Green and Chen 2019, 2019; Grgić-Hlača et al. 2019; Lin et al. 2020; Fogliato et al. 2021), is to include experts and conduct focus groups to validate our findings, and to understand their rationale throughout their decision-making process. Additionally, focus groups and interviews with professionals provided valuable insights into how RAIs are perceived and used in practice.

3 Approach and research questions

This paper takes an experimental approach. Participants in our experiments are asked to determine the probability that an inmate will be re-arrested, based on a list of criminologically relevant characteristics of the case. We focus on three main outcome variables (Sect. 3.1): the accuracy of predictions, the changes that participants make to their predictions when given the chance to revise them after seeing the RAI’s recommendation, and their willingness to rely on RAIs. The main independent variables (Sect. 3.2) are the background of the participants, and the type of risk scale used. Our research questions (Sect. 3.3) are about the interaction of these variables.

3.1 Outcome variables

3.1.1 Predictive accuracy

The performance of predictive tools including RAIs is often evaluated in terms of the extent to which they lead to correct predictions. Due to the inherent class imbalance in this domain, as most people do not recidivate, most studies (e.g., experimental (Dressel and Farid 2018; Harris et al. 2015; Green and Chen 2019)) do not use the metric accuracy, which is the probability of issuing a correct prediction. Instead, it is more common to measure the area under the receiver operating characteristic (AUC-ROC or simply AUC). The AUC can be interpreted as the probability that a randomly drawn recidivist obtains a higher score than a randomly drawn non-recidivist.

3.1.2 Prediction alignment with the RAI

In this work, we observe users’ reliance on the algorithmic support system indirectly by looking at changes in their predictions after observing an algorithmic prediction. We assume that if users change their initial predictions to align them with those of a RAI, they are implicitly signaling more reliance on that RAI than their initial prediction. In general, the extent to which people are willing to trust and rely on a computer system is related to people’s engagement and confidence in it van Maanen et al. (2007), Chancey et al. (2017), Lee and See (2004), and in the case of predictive algorithms, to their perceived and actual accuracy (Yin et al. 2019). Different types of information disclosure can elicit different levels of trust and reliance (Du et al. 2019). Performing joint decisions, i.e., being the human in the loop (De-Arteaga et al. 2020), can increase willingness to rely on a system (Zhang et al. 2020).

3.1.3 Preferred level of automation

The experience of interacting with an algorithm-based RAI may also affect the acceptability of similar algorithms in the future. Algorithm-based RAIs may operate in ways that differ by their level of automation (Cummings 2004). At the lowest level of automation, the human makes all decisions completely disregarding the RAI; at the highest level of automation, the RAI makes all decisions without human intervention; intermediate levels represent various types of automated interventions. In general, the level of automation chosen by a user should be proportionate to the performance of the automated system. Both algorithm aversion (Burton et al. 2020) or under-reliance, as well as automation bias (Mosier et al. 1998) or over-reliance, negatively affect the predictive accuracy of users.

3.2 Participant groups and conditions

In this section we describe the main independent variables that we tested in the experiments.

3.2.1 Participant’s educational and professional background

Most user studies on recidivism risk prediction rely on crowdsourced participants from online platforms. The background of participants may change the way they interact with a RAI. Data scientists and statisticians have training on statistics, probability, and predictive instruments. Domain experts with a background in psychology, criminology, or who work within the prison system, have a deeper knowledge of factors driving criminal recidivism. Additionally, domain experts who use RAIs receive training on their usage, and they often have a fair amount of training in applied statistics.

Naturally, in real-world applications case worker decisions are far more consequential than the consequences faced by crowdworkers in their lab-like decision scenarios. Similar to previous work (Green and Chen 2019; Cheng et al. 2019; Yu et al. 2020; Dressel and Farid 2018), we add an incentive (in the form of a bonus payment) for correct predictions in the crowdsourced studies. However, this is to encourage appropriate effort, not to simulate a high-stakes scenario.

We consider three participant groups: (1) crowdsourced workers from unspecified backgrounds, (2) students and practitioners of data science, and (3) students of criminology and people with expertise in the prison system. Recruitment procedures are described in Sect. 4.3.

3.2.2 Risk scales

The literature on risk communication suggests that both numerical and categorical information are useful for different purposes (Zoe Hilton et al. 2008; Jung et al. 2013; McCallum et al. 2017; Storey et al. 2015). Categories alone can be misleading when similar cases are assigned to different categories despite only small differences in their risk (Jung et al. 2013). In our research, we initially used only a categorical scale,Footnote 4 but then switched to scales that combine both categorical and numerical values; further, we test two different types of numerical scales. The first scale is based on the probability of recidivism, which we denote ‘absolute scale’ as it expresses a probability in absolute terms. The second scale we use is based on quantiles of the risk distribution in the data, and we call it the ‘relative scale’ since it is relative to the risk levels of other cases in the data. We also use five categories, for easier comparison with the absolute scale.

Fig. 2
figure 2

Risk scales used in our experiments (left: absolute scale, right: relative scale)

Both scales are depicted in Fig. 2. Other elements in that figure are discussed in the following sections.

3.2.3 Additional variables

Many additional variables could have been included but we were mindful of survey length and wanted to minimize survey drop-out. We included three additional variables: numeracy, decision-making style, and current emotional state. Numeracy is the ability to understand and manage numerical expressions. The decision confidence and the type of information that professionals rely on when using RAIs depends on their numerical proficiency (Scurich 2015). Ideally, professionals working with RAIs should have a fairly high level of numerical literacy, as interpreting RAIs requires the understanding of probabilities, which is not common knowledge. Other factors that have been shown to affect people’s decision making behaviour are their decision-making style and current emotional state (Beale and Peter 2008; Lee and Selart 2012).

3.3 Research questions

Based on the variables we have presented, we pose the following research questions:

  • RQ1: Under which conditions do participants using a RAI to predict recidivism achieve the highest predictive accuracy?

  • RQ2: To what extent do participants rely on the RAI to predict recidivism?

4 Materials and methods

In this section, we describe the materials (Sect. 4.1) for our user study which consist of a risk prediction instrument based on RisCanvi (Sect. 4.1.1) and a selection of cases used for assessment (Sect. 4.1.2). Next, we present a description of the procedure followed by participants (Sect. 4.2), and the way in which they were recruited (Sect. 4.3).

4.1 Materials

4.1.1 RisCanvi

This is one of several risk assessment tools used by the Justice Department of Catalonia since 2010 (Andres Pueyo, Arbach-Lucioni and Redondo 2018). This tool is applied multiple times during an inmate’s time in prison; in most cases, once every six months.

RisCanvi consists of 43 items that are completed by professionals based on an inmate’s record and suitable interviews. Then, a team of professionals (with some overlaps with the various interviewers) makes a decision based on the values of the items and the output of RisCanv’s algorithm. RisCanvi’s algorithm predicts the risks of four different outcomes: committing further violent offenses (violent recidivism), violence in the prison facilities to other inmates or prison staff, self-injury, and breaking of prison permits.

We focus on violent recidivism, which is computed based on 23 of the 43 risk factors as used in RisCanvi, including criminal/penitentiary record, biographical factors, family/social factors, clinical factors, and attitude/personality factors.

The original RisCanvi uses integer coefficients determined by a group of experts; instead, we use a predictor of violent recidivism created using logistic regression that has a better AUC (0.76) than the original RisCanvi (0.72) and is more accurate than models created using other ML methods such as random forests or neural networks (Karimi-Haghighi and Castillo 2021). This is done to reduce any effects of potential shortcomings of RisCanvi originating from using hand-picked integer coefficients, instead using a state-of-the-art predictor based on the same items. In consultation with RisCanvi creators, we kept exactly the same features it uses, which are not only the result of a statistical analysis, but also of recommendations from a group of experts formed when RisCanvi was being developed. When training the logistic regression, we observed the effect of model multiplicity, where multiple algorithms have similar accuracy (Black et al. 2022).

4.1.2 Cases

In this study, we use a dataset of cases used in previous work (Karimi-Haghighi and Castillo 2021). It consists of the last RisCanvi protocol items for the inmates released between 2010 and 2013, and for which recidivism was evaluated by the Department of Justice of Catalonia. Upon recommendation of our Ethics Review Board, we do not show participants the data of any individual inmate, but instead created semi-synthetic cases using a cross-over of cases having similar features and similar risk levels (for details, see Appendix  2).

We selected 14 cases which contain a mixture of recidivists and non-recidivists, combining cases in which the majority of humans make correct predictions and cases in which they tend to err, and cases in which the algorithm makes a correct prediction and cases in which it errs. In our first crowdsourcing experiment (referred to as R1 in the following) we observed that these cases were not representative of the performance of the algorithm on the overall dataset. Hence, for the second crowdsourcing experiment (R2 in the following) we exchanged 2 cases to bring the AUC from 0.61 to 0.75 which is closer to the AUC of the algorithm on the original data (0.76). Out of the 14 cases, 3 were used as examples during the ‘training’ phase of the experiments, while participants were asked to predict recidivism for the remaining 11 cases. All participants evaluate the same 11 cases, but in randomized order.

4.2 Procedure

The study obtained the approval of our university’s Ethics Review Board in December, 2020. All user studies were conducted between December, 2020 and July, 2021, and done remotely due to the pandemic caused by the SARS-COVID-19 virus. The survey is designed to be completed in less than 30 min and used an interface hosted in our university’s server created using standard web technologies (Python and Flask). The survey is structured as follows:

4.2.1 Landing page and consent form

The recruitment (4.3) leads potential participants from different groups to different landing pages, which record which group the participant belongs to. There, participants learn about the research and we ask for their explicit consent for participating.

4.2.2 Demographics and additional variables

Consenting participants are asked three optional demographic questions: age (range), gender, and educational level. Then, three sets of questions are asked to capture the following additional variables (described in Sect. 3.2.3):

- Numeracy: We use a test by Lipkus et al. (2001), which has been used in previous work (Hilton 2017). It consists of three questions about probabilities, proportions, and percentages, such as ‘If a fair dice is rolled 1,000 times, how many times it will come even (2, 4, or 6)?’ (Answer: 500). We measure ‘numeracy’ as the number of correct answers (0 to 3).

- Decision making style: The General Decision Making Style (GDMS) (Scott and Bruce 1995) is a well known survey that identifies five types of individual decision making style: rational, intuitive, dependent, avoidant, and spontaneous.

- Current emotional state: We used a Visual Analogue Scale (VAS) to account for 7 attitudes (happiness, sadness, anger, surprise, anxiety, tranquility, and vigor). This survey has been used in previous work (Portela and Granell-canut 2017).

4.2.3 Past experience and attitudes towards RAIs

Participants are asked about their knowledge about and experience with RAIs, as well as what they consider as the three most determining features to predict recidivism, out of the ones used by RisCanvi. The final question of this part is about the level of automation they would prefer for determining the risk of recidivism (see Appendix 1).

4.2.4 Training

The training part consists of the risk assessment of three cases (two non-recidivists and one recidivist). The purpose of this part is to prepare participants for the actual evaluation part and to calibrate their assessment to a ground truth reference. Therefore, unlike the actual risk assessments of the evaluation tasks, participants are shown the ground truth (recidivism or no-recidivism) after each one.

4.2.5 Evaluation tasks

The evaluation tasks are the core part of the study and ask participants to predict the probability of violent recidivism for eleven cases. Participants see a list of 23 items that are used by RisCanvi to predict violent recidivism (see Appendix 3 for an illustrated reference), and they are asked to select a number, which can be a recidivism probability or a risk level, depending on the condition (see Fig. 2). Additionally, they are asked to select from the list of items the three items that they considered most important in their evaluation, and to indicate their confidence with their prediction on a 5-points scale.

Participants in the control group are shown just one screen per case to enter their prediction, while participants in a treatment group are shown a second screen for each case, displaying the algorithm’s prediction. This second screen also shows participants their initial prediction for comparison, and allows them to optionally change it. In both screens, participants indicate the confidence in their prediction before continuing.

4.2.6 Closing survey

The experiment ends with a final questionnaire and an evaluation of the entire process. This questionnaire repeats some of the questions made in the beginning, such as the preferred level of automation, the emotional state, and the three features they consider most important in predicting recidivism. Additionally, participants can leave a comment or feedback about the study.

4.3 Participant recruitment

Table 1 Demographics by study group

A summary of the participants’ demographics is shown in Table 1. The crowdsourced study consisted of three rounds (R1, R2 and R3) for which we recruited participants via Prolific.Footnote 5 We selected residents of Catalonia, between 18 to 75 years old, and with more than 75% of successful completion of other studies in the platform (a parameter suggested by Prolific as a quality-assurance method). Participants were payed a platform-standard rate of 7.5 GBPFootnote 6 per hour for participating in the survey. They took an average of 20−25 min to complete the survey. Additionally, we offered a bonus payment of 1 GBP to those who achieved an AUC greater than 0.7. This is common practice and incentivizes conscientious completion of the survey (see, e.g., Green and Chen (2019), Cheng et al. (2019), Yu et al. (2020), Dressel and Farid (2018)).

For the targeted studies, participants were recruited through students’ mailing lists from two universities in Catalonia, as well as social media groups of professionals of data science in countries having the same official language as Catalonia. Additionally, we invited professionals from the Justice Department of Catalonia to participate; the invitation to participate was done by their Department of Research and Training and the Centre for Legal Studies and Specialised Training from the Catalonia regional government.

As a reference, the number of participants in previous crowdsourced user studies is usually a few hundred: 103 in Grgic-Hlaca et al. (2019), 202 in Cheng et al. (2019), 400 in Lin et al. (2020), 462 in Dressel and Farid (2018) and 600 in Green and Chen (2019).

In line with the previous studies, we had 609 participants in total (541 crowdsouced and 68 targeted). Nevertheless, we performed a power analysis (independent samples t-test) to test our sample size. Our analysis was made with parameters for a t-test with alpha = 5%, beta = 95% and effect size 0.5, i.e., a "medium" effect size. We obtained a power of 95% with sample size of 88 members for each group (178 in total). These results are in line with the minimum number of participants from previous studies.

5 Participants and experimental setup

Along our study, we designed our experiments with different kinds of participants (crowdsourced and targeted with specific knowledge). We use a naming convention to make it space-saving and clear to the reader. Naming are explained at Table 2, where different groups can be compared in terms of experimental setup for a clear understanding.

Table 2 Naming for different experimental groups per type of treatment

5.1 Crowdsourced: first round (R1)

In the first round (R1) we compared two experimental groups. The treatment group was shown the machine prediction and the control group was not. In treatment group G1 machine predictions are shown only as categorical information, while in G2 machine predictions are shown as categorical and numerical information. In this round, 247 participants completed the evaluation: 48 in the control group, 100 in treatment group G1, and 99 in treatment group G2. Additionally, 74 participants were excluded, either because they did not complete the survey or did not evaluate all of the eleven cases, or finished the experiment either too fast (less than five minutes) or too slowly (more than one hour).

As described in Sect. 4.1.2, we used in R1 a set of cases for which the AUC of the machine predictions was 0.61. To bring this more in line with the observed AUC in the entire dataset (0.76), we exchanged two cases for the second round (R2), and the AUC measured on the new set of cases became 0.75.

5.2 Crowdsourced: second round (R2)

In the second round (R2) we compared two experimental groups, where the treatment group was shown the machine prediction and the control group was not. In treatment group G1 machine predictions are shown on an absolute scale as categorical and numerical information (similar to R1G2), while in G2 we introduce the machine predictions shown on a relative scale as categorical and numerical information. In this round, 146 participants completed the evaluation: 17 in the control group, 66 in treatment group G1, and 63 in treatment group G2. Additionally, 137 participants were excluded for the same reasons as in R1.

5.3 Crowdsourced: third round (R3)

In the third round (R3) we compared the same experimental groups like in R2 (G1 and G2) with the purpose of evaluating an iteration of our same interface, but explicitly stating in all screens the fact that they were evaluating violent recidivism (see more in Appendix 3). In this round, 148 participants completed the evaluation: 17 in the control group, 66 in treatment group G1, and 65 in treatment group G2.

5.4 Targeted study

The targeted study seeks to establish the effect (if any) of the participant’s background when interacting with the RAI. We used the same experimental setup and treatment groups from crowdsourcing (R2). Due to the limited number of participants, we considered as a baseline the control group of R2 as well as Targeted groups. We used the name ’Targeted’ because we refer to participants with specific professional expertise. We considered both students and professionals with a background either in data science, or in a field relevant to the prison system and the application of RisCanvi, such as psychology, criminology, or social work. For data science, we recruited 14 students at the undergraduate and graduate level, and 11 professionals. For a domain-specific background, we recruited 4 students at the graduate level (Master in Criminology students), and 25 professionals. An additional group of 14 professionals participated (known as TargetV group) to contrast the crowdsourced R3 setting. All professionals were recruited trough the Justice Department of Catalonia and samples for both targeted groups were part of the same population.

6 Results

In this section, we present our main findings. The main takeaways are:

  1. 1.

    Human predictive accuracy improves after the RAI suggestion.

  2. 2.

    The improvement in the accuracy is also visible over time, after evaluating several cases.

  3. 3.

    Participants disagree on the relative importance of risk factors, this is validated qualitatively in professionals' focus group.

  4. 4.

    Acceptance of automation is limited. All participants foresee automation with, with a clear preference for human discretion.

  5. 5.

    Categorical scales are preferred over numerical scales, and result in higher human predictive accuracy.

6.1 Predictive accuracy

6.1.1 Accuracy

Fig. 3
figure 3

Average AUC with 95% confidence interval by group. See Table 7 in Appendix 3 for details

Figure 3 shows the average AUC and corresponding confidence intervals for each experimental group. AUC values depicted in the figure can also be found in Appendix 3, Table 7.

Given the relatively small sample sizes in experimental data, we test the statistical significance of the differences between the experimental groups using a permutation t-test with 999 permutations (see Table 9). For R1 we observe no difference in the initial predictions across control and treatment groups, which have AUC from 0.58 to 0.60. However, for R2 we find a significant difference (p<0.1) with a higher AUC for the control group (0.65) than for the initial prediction of treatment group G1 (0.58) with the absolute scale.

Despite the small number of participants in the targeted group, we observe important differences compared to the previous groups. The predictive accuracy of the initial prediction is higher (+0.02 to +0.09 AUC points) than any crowdsourced group. For the targeted group G1 (absolute scale) this difference is significant at \(p<0.1\) against R2’s G2 and even at \(p<0.05\) against the initial predictions of the other crowdsourced groups (see Appendix 3). Participants from a data science background and domain experts have similar initial AUCs (see Table 7 in Appendix).

We acknowledge a lower AUC on the initial prediction at TargetV for both treatment groups. With a reduced number of participants (N = 14) we would not consider this difference as important and instead observe this result seem to be inherently noisy in the presence of small samples, hence the large standard deviation observed. For instance, we observe R3 results are within error bars of TargetV (see Fig. 3). Besides, the average AUC from Targeted and TargetV groups together is initially 0.64, and 0.71 revised, which is still higher compared to results obtained in R2 and R3.

The resulting AUC is comparable to previous forensic studies that achieved AUCs on average in the range of 0.65−0.78 using non-algorithmic RAIs (Desmarais et al. 2016; Douglas et al. 2003; Singh et al. 2011).

6.1.2 Self-reported importance of risk items

Having asked to select the top 3 items (risk factors) that participants considered in their risk prediction, we find that crowdsourced and targeted participants tend to select the same 10–11 (out of 23) items as more important than the rest. However, among these top 10 items we find that domain experts prefer dynamic factors (i.e., factors that can change), such as ‘limited response to psychological treatment’, while data scientists and crowdsourced participants refer more often than domain experts to static factors (i.e., factors that cannot change), such as ‘history of violence’ (details are in Fig. 12 and Table 14 in Appendix 7).

6.2 Prediction changes due to the RAI

The observed probability of a participant changing a prediction after observing the machine prediction is 20% (19% in G1 and 21% in G2). Crowdsourced participants revised their prediction in about 26% of the cases they examined (27% in G1, 25% in G2). Domain experts revised their prediction in 37% of the cases, and data scientists in only 13% of the cases.

Fig. 4
figure 4

Average difference between human and algorithm prediction by case order, absolute scale

Figure 4 shows the average difference in risk predictions by human and algorithm for each case. Target and TargetV participants started with predictions that were in general with higher difference than the crowdsourced groups. As they progress through the evaluation tasks, participants tend to align more and more their predictions with the machine predictions (even in their initial predictions) and the difference between initial and revised predictions diminishes. For the last three cases, for the R2 crowdsourced and Target groups, which are already close to the machine predictions, do not change, while the R2 and TargetV groups maintains a larger difference between initial and revised predictions. We also acknowledge a deviation in the last cases for the TargetV, which might be explained by a lack of attention and could be the cause for the reduced AUC in our results. Nevertheless, this is just an assumption by interpreting the plots and should be contrasted with more evidence.

In general we see that when revising their prediction participants improve their accuracy (Fig 5). By comparing the average AUC in Fig. 3 and Table 7, we can see that revised predictions from crowdsourced groups tend to be more accurate than their initial ones in terms of AUC. This difference is significant for R1’s G2 (\(p<0.1\)) and for R2’s G1 (\(p<0.05\)), as shown in Table 9 (in Appendix 3). For the Target group, we see an improvement in the range from +0.05 to +0.08 AUC points on average, while for the TargetV the difference is much bigger (+0.03 to +0.15 respectively). In almost all cases, revised predictions by the treatment groups are more accurate than those of the control groups. However, few of these differences are statistically significant.

In general, the average self-reported confidence is in the range 3.5–3.9 out of 5.0 (1.0 = least confident, 5.0 = most confident), and basically does not change from the initial to the final prediction. The self-reported confidence of crowdworkers is, by a small but statistically significant margin (\(p<0.001\)), higher than the one of targeted participants (see Appendix 5).

6.3 Preferred level of automation

Fig. 5
figure 5

AUC of participant predictions before and after algorithmic support for participants who received algorithmic support (excludes control group). The p-value of the permutation t-test is \(\ll 0.0001\), and the number of permutations for permutation t-test is 999

Fig. 6
figure 6

Distribution of answers about level of automation for participants who received algorithmic support (excludes control group). The p-value of the permutation t-test is \(< 0.01\), and the number of permutations for permutation t-test is 999

As shown in Fig. 6, most participants prefer an intermediate level of automation, between levels 2–4 on a scale of 5 levels. While data scientists had an initial level of acceptance with a broader range (levels 1–4), domain experts limited their answers to a more narrow set of choices in intermediate levels of acceptance (levels 2–3). The same figure also shows that most of the treated groups reduce their preferred level of automation after the experiment, meaning they prefer more expert involvement and less reliance on machine predictions.

On average, however, the desired level of acceptance for targeted groups concentrated in the middle-low part of the scale: 32% of the data scientists and 48% of the domain experts selected level 3 (‘the computational system suggests one option, but the expert can provide an alternative option’). Level 2 (‘the computational system suggests some options, but it is the expert who defines the risk level’), was the option selected by 36% of the surveyed data scientists and 38% of the domain experts. Details can be found in Appendix 4, and the description of the automation levels can be found in Appendix 1.

6.4 Risk scales

6.4.1 Categorical versus numerical risk

According to Fig. 3, adding numerical values to the categorical scale does not change the AUC. In G1, where only categorical information is shown, the AUC of revised predictions is slightly higher than the revised predictions in G2, where categorical and numerical values are shown: 0.62 against 0.61 AUC.

6.4.2 Categorical absolute versus relative scales

The results of R2 show that for the initial prediction, the absolute scale (G1) leads to slightly lower AUC compared to the relative scale (G2) (0.58 against 0.61 AUC). However, with algorithmic support, the absolute scale leads to higher AUC than the relative scale (0.67 against 0.62 AUC). Neither of these differences is statistically significant. Additionally, the average AUC of the R2 control group (0.65) is fairly high, and the only higher AUC observation is in the revised predictions using the absolute scale (0.67). The revised and some initial predictions of the targeted participants using the absolute scale significantly outperform all the R1 groups, as well as the R2 groups (\(p<0.05\), see Table 9 in Appendix 3).

6.4.3 Respondent characteristics

With respect to numeracy, over 60% of the crowdsourced participants answered correctly 2 or 3 out of the 3 test questions. The targeted group had more respondents answering all 3 numeracy questions correctly than crowdworkers, as shown in Table 1: 96% of data scientists obtained results in the highest scores (68% in the top score), while only 59% of domain experts obtained similar results (52% in the top score). We find no correlation between participants’ numeracy and their accuracy. The correlation between decision making style and emotional state with accuracy is not significant either (results are in Appendix 6).

7 Qualitative study

The last study is a qualitative study using focus groups, i.e., groups of participants having a focused discussion on a particular topic (Morgan et al. 1998). The focus groups help us interpret the quantitative results from the targeted study, by listening to and learning from participants’ experiences and opinions.

7.1 Participants and procedure

Participants (9 women, 4 men) were recruited from the targeted experiment, and due to their busy schedules, divided into four groups (FG1–FG4) as follows: FG1 (N = 3) data scientists; FG2 (N = 4) domain experts, students from criminology in undergraduate and master levels; FG3 (N = 2) and FG4 (N = 4) domain experts working with the Department of Justice, most of them psychologists.

While we did not want to give too much structure to the conversation, to try to uncover new perspectives that we had not thought about, we did prepare a series of questions to stimulate a discussion (available in Appendix 2). The questions address participants’ experience with algorithmic predictions and RAIs, their opinion about different scales and categorical/numerical presentation, their understanding of risk factors, and their desired level of automation. Each session lasted between 60 and 90 min and was held online. Following the protocol approved by our Ethics Review Board, participants were asked for their consent to participate and to have the meeting recorded and transcribed. The language of the focus group was the local language of Catalonia; the quotes we present in the next section were taken from our transcriptions and paraphrased in English.

7.2 Findings

We focus on our research questions, but note that there were many other insightful comments during the focus groups.

7.2.1 Professional background

All participants were aware that some demographics are over/under represented among prison populations, and thus expected that a RAI trained on such data may lead to discriminatory outcomes. However, the way in which data science participants approached risk prediction was to a large extent based on considering a set of ‘anchors’ or prototypes (Scurich et al. 2012, p. 13): ‘I think about a maximum and a minimum risk. The minimum would be like a person who stole for the first time [...] the maximum would be a killer’ (FG1.1). In general, data scientists did not question the presented case characteristics, but domain experts did. Participants in FG3 and FG4 indicated that the risk items, which in RisCanvi only have three levels (Yes/Maybe/No), do not accurately represent the reality of inmates and they were missing the ability to explore or negotiate the risk items during the case evaluations. Furthermore, they indicated that, during the assignment of levels to risk factors, they sometimes ‘compensate’ higher values in one item with lower values in other items, such that the final risk score matches what they would consider appropriate for the evaluated person. One participant (FG4.1) said that personal biases may also affect the coding of items, as some professionals adopt a more punitive approach, while others take a more protective or rehabilitative approach. Other domain experts agreed with this perspective. Therefore, most professionals expressed the need for teams reviews and validation mechanisms for risk factor codings.

Among domain experts, the psychologists we interviewed were the most concerned about the evidence they collect and the representation of the actual risk. To them, RAIs are tools that add objectivity to their case reports, but their focus was on how to present evidence to judges, since these might discard professional reports in favor of the RAI’s outcome. Overall, for domain experts RAIs such as RisCanvi should be used by a group of experienced evaluators checking one another, and not by one professional alone.

7.2.2 Interpreting numbers

All participants had some training in statistics, and stated that they understand numerical expressions well. Generally, participants preferred a relative scale (e.g., 3.7/10.0) over an absolute scale (e.g., 37%). It is noteworthy how domain experts interpret probabilities.

First, extremely low risks were considered unlikely in practice, since almost everyone can commit a crime at some point.

Second, all interviewed domain experts stated that recidivism risk cannot be eliminated but it could be reduced to an acceptably low level (e.g., reducing the risk from 37% to 20%).

This emphasis on risk reduction is in line with the ‘interventions over predictions’ debate in the literature (Barabas et al. 2018). Third, domain experts consider a recidivism risk of above 30% as high, and a reason for concern. A risk above 50% was considered difficult -but not impossible- to reduce by treatment/interventions. Overall, domain experts thought of different ranges on the risk spectrum along which inmates are placed. Data scientists, too, considered different risk ranges, and for some of them even a 50% recidivism risk was not considered ‘high’.

7.2.3 Interaction with machine predictions and calibration

Many participants admitted that they went quickly, and without giving it much thought, through the first few evaluations. However, they also noticed that they slowed down to rethink when they felt contested by the algorithm, i.e., when their risk assessment was far from the algorithm’s prediction. Data scientists indicated that they reacted to such differences by simply adjusting the risk to half-way between their initial prediction and the one of the algorithm. Domain experts indicated to react similarly in some cases, but they also stressed that they kept their initial prediction when they felt confident about it.

Some of the domain experts believed that they were interacting with exactly the same RisCanvi algorithm they use, despite a clear indication in the introduction of the study that this was another algorithm. We believe their experience with the real RisCanvi affected their disposition to rely on the machine predictions we presented.

7.2.4 Preferred level of automation

Overall, domain experts and data scientists differed in the level of automation they would prefer, with data scientists being more open to automation. For instance, participant FG1.2 believed that an algorithm could improve enough to make almost-autonomous decisions ‘in the future’. This participant considered the errors that could be made by the algorithm were ‘acceptable’. In contrast, e.g., FG1.3 was sceptical about using an algorithm for automated decision-making because of the impossibility to solve all algorithm-specific errors.

All participants agreed that algorithmic support is useful in many instances, e.g., to contrast their own predictions, to give them a chance to rethink them, or to provide reassurance about them. Domain experts also considered them useful to train new case workers in writing evaluations. In that regard, participants from FG1 and FG2, expressed that the ‘objectivity’ of the algorithm could help reduce the effect of the ‘emotional’ response to the evidence by the professional who is evaluating.

Participants also acknowledged the risk of ‘relying too much’ on the algorithm, leading to reduced professional responsibility: ‘The decision you make is yours, it is yours with your biases and everything, which also brings experience because it sometimes helps you be aware and review your own prejudices’ (FG2.1). Another drawback of using a RAI noted by participants was the concern that it may reproduce potentially outdated societal prejudices. To address this concern, domain experts expected frequent updates to the algorithms.

8 Discussion

RQ1: Under which conditions do participants using a RAI to predict recidivism achieve the highest predictive accuracy? Overall, our findings suggest that human decision makers achieve higher accuracy for their risk-assessment when they are supported by an algorithm. Almost all treatment groups achieve a higher AUC than their corresponding control group after the treatment, although these differences are not statistically significant, particularly in the case of crowdsourced participants (Figs. 3 and 5). Nevertheless, considering the evidence presented in the literature, further studying this phenomenon, possibly with larger populations, is needed. The algorithm also influences human predictions for each decision and over time, as shown in Fig. 4. This further suggests that algorithmic support establishes reference points to human predictions. In Fig. 11 we do not see the influence of algorithmic’s accuracy on the improvement of the human decision. Instead, we consider the recurrent use of the tool by professionals as a form of improvement of their own practice. In practical terms, the implementation of RisCanvi or any other RAI may have an influence in the long term regardless its accuracy. We consider this should be studied in depth. The lower accuracy of the initial predictions of treatment group participants compared to control group participants is noteworthy. One possible explanation for this is that treated participants put less effort in their initial predictions in anticipation of algorithmic support and a potential opportunity to revise their initial prediction. The exposure to a particular tool is considered important in the field of automated decision-making. Many factors can affect predictive accuracy, as we mention in our limitation Sect. 9.

The finding that targeted participants (domain experts and data scientists) outperform crowdsourced participants contradicts the idea from previous work (see Sect. 2) that crowdsourced participants are comparable to domain experts or professionals when testing RAIs. This highlights the importance of testing RAIs in the context of professional knowledge, training and usage.

Finally, using an absolute rather than a relative scale leads to more accurate predictions (in our study in the revised predictions). The focus group further confirmed the preference of professionals for the absolute scale as the one closer to the real application. Our findings agree with Zoe Hilton et al. (2008), who found that risk categories are generally hard to agree upon across professions and individuals, and also with Hanson et al. (2017), who found that categories can be effective following a common agreement in correspondence to ranges of the absolute probability of recidivism. Thus, further studies should focus on the underlying support of numerical information in helping ground categorical distinctions for predictive risk assessment.

RQ2: To what extent do participants rely on the RAI to predict recidivism?

In line with previous studies (e.g., Tan et al. (2018)) humans and algorithms tend to agree on very low and very high risk cases (see Appendix 2, particularly Fig. 7), but there are cases that are difficult to predict for humans, for algorithms, or for both. A promising next step would be to identify cases that are clearly difficult for the machine, and or are potentially difficult to humans. In these cases one could more safely defer to humans, or ask them to invest more time in a specific evaluation, improving efficiency in the design of human-algorithm decision processes. We suggest that any supporting decision system should indicate its confidence for each case, to allow the human to make a more informed decision.

Our findings show that participants prefer a partially automated assistance with a large degree of human discretion. This implies that easy-to-use mechanisms for overriding or changing the algorithm suggestion are needed, and professionals should be encouraged by their institutions to use them when appropriate. In addition, all experimental groups tend to downgrade the acceptable level of automation after the experiment (see Fig. 6). Explanations for this could be that the differences between human and machine predictions caused the participants to realize strong human oversight was more necessary than what they initially thought.

Finally, the focus group discussions revealed that professionals’ reliance in an algorithm could be increased when the algorithm providers ensure good prediction performance and frequent system updates corresponding to new societal and institutional developments. This suggests that RisCanvi and possibly other RAIs are elements of negotiation that should be taken with care and without assuming its outcome as objective, and that need frequent updates and audits. So far, all discussions and feedback around the use and the improvement of the algorithm are welcomed by its users. Thus, it is recommended to promote spaces within their organization to hold sessions of discussion and feedback about their experience using a RAI.

9 Limitations and future work

This paper has to be seen in light of some limitations. First, the dataset used for training the algorithm has some drawbacks. It has only about 597 cases, which may affect the algorithm’s accuracy; however, we note that its AUC-ROC is in line with that of most recidivism prediction tools. We also note that in this dataset the ground-truth label is re-incarceration and not re-offense. Re-arrest and re-incarceration are not necessarily a good proxy for re-offense and further exhibits racial and geographical disparities (Fogliato et al. 2021). Since the focus of this study is the assessment of user behaviour (not the algorithm), we do not expect these drawbacks to notably affect our main results. Second, in line with previous work, this study focuses on accuracy as a measure of algorithmic performance. However, decision support algorithms can be evaluated in many different ways (Sambasivan et al. 2021). Third, Fig. 4 shows that participants are still calibrating their predictions after the training phase as they progress through the evaluation tasks, suggesting that the initial training phase may have been too short. The impact should be limited as the majority of the cases are evaluated after this learning curve has flattened.

The generalization of this work to other contexts is restricted by other factors. Due to resource constraints (money to pay crowdsourcing participants, and critically, time availability of domain expert participants), the findings draw from a study centered around 14 cases; a study with more cases would be an improvement, but would be more time-consuming for all participants. These constrains were tackled by selecting a variety of cases that represent different levels of difficulties to assess for humans and the algorithmic system. In addition, as usual in experimental user studies, the crowdsourced participants are not representative of the overall population. Table 1 shows that most have university-level education and good numeracy. Further, we only recruited participants in a single country. Thus, the pool of users might not exhibit a large cultural diversity, a factor that could bias outcomes (Beale and Peter 2008; Lee and Selart 2012). However, we also remark that crime and recidivism is different in different criminal systems and jurisdictions, and hence RAIs should be evaluated with careful attention to their context (Selbst et al. 2019). With the variations R3 and TargetV we tested if explicitly repeating that we refer to violent recidivism in each screen affected the outcome and influence the predictive performance of participants (see Appendix 3) and we have not noticed any substantial change. Nevertheless, we acknowledge that generalizing recidivism definitions can influence in different contexts and these results are not enough representative to reflect these changes. Sample size may be another limitation. While the size of our participant pool in the crowdsourced study (N = 247, N = 146) is in line with previous work, the number of participants in the targeted studies (N = 68) is relatively small. We speculated that given that understanding data and probabilities is a complex task, data scientists might be ideal candidates for testing if the results against the crowdsourced and domain expert groups are significantly different. As mentioned in the article a numeracy test didn’t fulfill our expectations because criminologists presented a high level as data scientists did. Following this argument, it was confirmed in the focus groups that how data scientists treat information about criminal recidivism is different from professionals and domain experts. We might include impacted or adjacent voices in the study. Despite these limitations in sample size, our results suggest consistent and in some cases statistically significant differences in the outcomes between crowdsourced and targeted participants.

It is also important to notice that responses to surveys may incorporate some biases. For example, participants might feel pressure to report socially acceptable answers or suffer from the Hawthorne effect (participants know they are being observed). They also might feel pressured to answer in a short time. However, we required them to answer within a window of 30 min, and most of the participants did it in less than 20 min. All these effects are common when using surveys. Future research is needed to explore the reasons and conditions of these differences. This is particularly important in the public sector, where there is a lack of evidence on how algorithms affect public policies (Zuiderwijk et al. 2021). For example, in the recidivism prediction context, decision making processes are open, negotiated and mediated, and if a RAI is used for reducing inter-professional communication rather to increase it, it can have adverse effects in decision quality. There is a clear need to pay attention to the usage contexts and the ways in which RAIs are deployed, to reduce the risks of automation and understand better in which conditions the assistance of an algorithm can be most helpful.