Keywords

Evidence for Strong Performance of AI/ML in Healthcare and Health Science Problem Solving

There is a growing literature that establishes the ability of AI/ML for complex problem solving in a variety of health domains, and compares ML techniques among themselves, to traditional statistical methods, and occasionally to human experts.

In a meta analysis of the ML-based Neurosurgical Outcome Prediction literature involving 30 studies it was found that ML models predicted outcomes after neurosurgery with excellent predictivity (median accuracy and area under the receiver operating curve of 94.5% and 0.83, respectively), and significantly better than logistic regression (median absolute improvement in accuracy and area under the receiver operating curve of 15% and 0.06, respectively). Some studies also demonstrated a better performance in ML models compared with established prognostic indices and clinical experts [1].

In a systematic review of 27 studies applying machine learning to oral cavity cancer outcomes, it was found that the accuracy of models ranged from 0.85 to 0.97 for malignant transformation prediction, 0.78–0.91 for cervical lymph node metastasis prediction, 0.64–1.00 for treatment response prediction, and 0.71–0.99 for prognosis prediction. In general, most trained algorithms predicting these outcomes performed better than alternate methods of prediction. They also found that models including molecular markers in training data had better accuracy estimates for malignant transformation, treatment response, and prognosis prediction [2].

In a meta-analysis and systematic review of applications of machine learning algorithms to predict therapeutic outcomes in depression (20 studies), classification models were able to predict therapeutic outcomes with an overall accuracy of 0.82 (95% confidence interval of [0.77, 0.87]). Also, pooled estimates of classification accuracy were significantly greater (p < 0.01) in models informed by multiple data types (e.g., composite of phenomenological patient features and neuroimaging or peripheral gene expression data; pooled proportion [95% CI] = 0.93[0.86, 0.97]) when compared to models with lower-dimension data types (pooled proportion = 0.68[0.62,0.74] to 0.85[0.81,0.88]) [3].

In another systematic review and critical appraisal of ML applications in vascular surgery over 212 studies were identified in which ML techniques were used for diagnosis, prognosis, and image segmentation in carotid stenosis, aortic aneurysm/dissection, peripheral artery disease, diabetic foot ulcer, venous disease, and renal artery stenosis. The median area under the receiver operating characteristic curve (AUROC) was 0.88 (range 0.61–1.00), with 79.5% [62/78] studies reporting AUROC ≥0.80. Out of 22 studies comparing ML techniques to existing prediction tools, clinicians, or traditional regression models, 20 performed better and 2 performed similarly [4].

A systematic review of ML investigations evaluating suicidal behaviors with 87 studies analyzed, found high levels of risk classification accuracy (>90%) and Area Under the Curve (AUC) in the prediction of suicidal behaviors [5].

In a systematic review of 23 studies of applications of machine learning to undifferentiated chest pain in the emergency department, it was found that multiple studies achieved high accuracy in both the diagnosis of acute myocardial infarction (AMI) in the ED setting, and in predicting mortality and composite outcomes over various timeframes. ML outperformed existing risk stratification scores in all cases, and physicians in three out of four cases [6].

In a systematic review and meta-analysis comparing deep learning performance against health-care professionals in detecting diseases from medical imaging, based on 69 studies, it was established that ML models exhibited sensitivity ranging from 9.7% to 100.0% (mean 79.1%, SD 0.2) and specificity ranging from 38.9% to 100.0% (mean 88.3%, SD 0.1). 14 of these 69 studies compared the performance between ML models and health-care professionals. Restricting the analysis to the contingency table for each study reporting the highest accuracy found a pooled sensitivity of 87.0% (95% CI 83.0-90.2) for deep learning models and 86.4% (79.9-91.0) for health-care professionals, and a pooled specificity of 92.5% (95% CI 85.1-96.4) for deep learning models and 90.5% (80.6-95.7) for health-care professionals [7].

Finally a systematic review of over 42 studies evaluated the applications of AI in pediatric oncology [8]. Of these 42, 20 studies related to CNS tumors, 13 to solid tumors, and nine to leukemia. ML tasks included classification, prediction of treatment response, and dose optimization. The identified studies matched or outperformed physician comparators via automated analysis and predicting therapeutic response.

Quantitative Comparisons of AI/Ml Versus Human Experts

In addition to studies [1, 4,5,6,7,8] in the previous section that not only studied ML model performance in absolute terms but also compared to human experts, several more studies have focused on the comparison between humans and AI/ML problem solving performance.

A large meta-analysis of 136 studies that were conducted between 1966-1988 compared the prediction performance of “mechanical procedures” (i.e., data science models of various forms: statistical models, actuarial tables, ML, or other) with that of human experts. The meta-analysis found that given the same information about the cases, the mechanical procedures outperformed the humans in 33–47% of the studies by being substantially more accurate than clinical predictions and in only 6–16% of the studies, human predictions were substantially more accurate than the mechanical ones. In the remaining 37%–61% of studies, humans and machines performed equally [9]. This shows that even with the comparatively limited technology of the 70s and 80s automated decision making was equal or superior to humans in 84–94% of the included studies.

Early AI computer aided diagnosis (CAD or CADx) research produced similarly promising results. In an application for diagnosis of abdominal pain, clinicians without access to the CAD tool arrived at the correct diagnosis with 71.6% accuracy, but with the aid of the CAD tool, the accuracy reached 91.8% [10].

In addition to the DL vs human study of [7], another systematic review found similar results, namely that the performance of AI was on par with that of clinicians and exceeded that of clinicians with less experience [11].

Many of the above studies report model performance “in the lab”, that is not embedded in real-life clinical workflow/environment.

Pitfall 11.1

Even if a clinical AI system meets or exceeds expert-level performance in the lab, this does NOT mean that (i) the system can be readily adopted into clinical practice, (ii) will perform similarly when deployed in practice, or (iii) that the evaluation metrics used accurately reflect clinically impactful use of the AI model.

See the chapter entitled “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” for details of bringing models into practice.

We also note that the above comparisons refer to problem–solving tasks that both humans and machines can accomplish (albeit with different accuracy, ease, etc.). There exist problems that are currently entirely outside the capabilities of human decision making (e.g. making decisions using hundreds of thousands, or more, molecular and genetic factors, something that is routine in molecular oncology ML models; or inferring complex causal relationships involving hundreds or thousands of variables by inspecting transcriptomic or other data).

Human Biases Versus Machine Biases

Humans and computers approach problems fundamentally differently. In the chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”, and “Foundations of Causal ML” we visited the fundamental architecture and properties of AI and ML methods. We will assume that the reader is already familiar with the kinds of reasoning that humans can accomplish either in professional domains (i.e., health care practice, or health science) or in everyday living.

We will highlight a few important shortcomings of human decision making and examine to what extent machines can help overcome these shortcomings. Reviewing the theory of human learning, judgement and decision making at length is outside the scope of this book and the interested reader is referred to the highly informative and concise summary of human cognitive biases in [12], the prescriptive theory of medical decision making in [13], the several investigations on clinical decision making biases in [14] and the classic Nobel Prize-winning work of Kahneman [15] on human heuristics and biases.

We will also shed some light, from a technical perspective, on what are machine biases and how they arise or are prevented.

Human Memory Is Not a Storage Device

Contrary to common belief, human memory is not a device that stores “material” (medical or science facts, experimental data, interpretations, patient information, thoughts, discussion points, feelings, etc.) and recalls them later. A study found that more than 90% of the points made in a discussion were forgotten. The mind builds a mental model of the “material” and fills in details through inference. Our confidence in how accurate our recollection is relates to our confidence in the inference rather than to our ability to recall facts [16].

Humans Are Influenced by Context

Since memory is inferred, it is context-dependent. It is not just the memory, but most of human interpretation is context-dependent, even down to minute details. For example, in a list of items, we attribute more weight to the first item than later items and we remember the most recent (last) items better [17]. In some cases, where the contrast among multiple alternatives is too high, we can become unable to assess the middle alternative. In a similar vein, when a multitude of characteristics is evaluated simultaneously, these evaluations influence each other and the results correlate. This latter trait is exemplified by real-life “superheros” who are perceived to be excellent in almost all character traits we care about [18].

Questions can create a context [19]. The way questions are formed can influence the answer. When open-ended questions are used, the respondent may not consider all alternative responses, and may not think about the most appropriate response. Conversely, when closed questions are used, where the respondent has to select an answer from a list of alternatives, alternatives can artificially increase the frequency of answers that would otherwise be uncommon. When the question concerns a measurement, the alternatives to a closed question can suggest a baseline, a “normal” value, against which the respondents measure themselves. Even the order in which questions are asked and the order in which potential answers (for closed questions) are presented can influence the answers. Fortunately, when the subject is knowledgeable about the topic of the question, such influences are smaller. To reduce bias in answers by respondents who are not knowledgeable, surveys often include an option for “don’t know”. Finally, the wording of the questions itself can influence the answer. Answers can change depending on whether the question is asked in terms of gains or losses, lives saved versus deaths.

Humans Are Not Inherently Rational Decision Makers.

A fundamental normative model of human decision making is the maximum expected utility theory [15, 20, 21]. It describes decision making as following from assigning utilities to outcomes and choosing actions so that the expected utility will be maximized. Essential to such an endeavor are: (a) to describe with accuracy and completeness all relevant actions and outcomes in a problem of interest; (b) the ability to assign consistently individual preferences (measured by utility) to alternative outcomes; (c) the ability to calculate accurately the probability of outcomes given actions, including calculating probabilities of intermediate events outside the control of the decision maker; and (d) the ability to calculate accurately the action that maximizes the expected utility.

Not surprisingly, several examples have been shown in the literature where humans make decisions that do not abide by several components of expected utility-based reasoning. One important deviation is that losses are more important than gains to a part of the population. Kahneman et al. illustrates this through the following example. When decision makers are presented with a pair of alternatives, where one alternative is a sure loss of $500 and the other alternative is a gamble of losing $1000 with 50% chance and losing $0 with the remaining 50% chance, subjects tend to choose the second option (i.e., avoid the sure loss). However, when we frame the alternatives in terms of gains, with the first alternative being a sure gain of $500 and the second alternative is a gamble with 50% of winning $1000 and 50% chance not winning any money, people tend to select the sure gain.

Prospect theory [21] is a decision making model that uses the perceived value of gains and losses (as opposed to utility) as the basis of decision making and utilizes value functions that take the above asymmetry between the two into account. This model has the ability to describe “irrationalities” (deviations from the normative “rational” behavior that expected utility theory assumes) such as:

  1. (i)

    the diminishing value of gains and losses (the first $500 gain or loss is more important than the second).

  2. (ii)

    Certainty effects such as removing the last bullet from a gun in Russian roulette is worth more (has more value) than removing one of four bullets, although the expected utility (reduction in probability of death) is the same.

  3. (iii)

    Framing, where the expected value differs based on the reference point against which gains and losses are computed although the expected utility remains the same.

  4. (iv)

    Avoiding regret: humans are willing to give up “utility” to reduce the chance of feeling regret.

Humans Extensively Use Heuristics in Their Decision Making and Suffer From Related Biases

Because the human brain lacks the ability to execute complex calculations fast, it has evolved to use approximate (so-called “heuristic”) decision strategies that provide a fast solution that has high likelihood to be correct (e.g., in an evolutionary context, to prevent loss of life from predators or other circumstances that require rapid decision making as opposed to accurate but slow decisions). These heuristics have offered evolutionary benefits, however, they introduce significant biases into the human decisions. Here we review some of the heuristics we use and refer the interested reader to [12, 15] for more complete and thorough treatments.

  • Humans tend to determine (incorrectly) the probability that on object came from a group by the “representativeness” of the object with respect to the group. Given two groups, one group being more specific than the other, humans often attribute higher probability to belonging to the more detailed (more specific) group than the other, which contradicts fundamental axioms of probability.

  • Humans follow a bias called law of small numbers in which they believe (erroneously) that small random sequences resemble large random sequences. In other words that properties conferred by the law of large numbers in statistics also apply to small numbers. This is related to the representative heuristic, entails difficulty to distinguish between random and non-random sequences, and suffer from gambler’s fallacy (believing that a small random sequence will exhibit distribution characteristics of a large random sequence).

  • Humans suffer from attribution biases e.g., attributing accidental (random) successes to skill; and non-random failures to circumstances.

  • The frequency (or probability) of events is often estimated by how easy it is to recall an occurrence of that event (availability bias). While it is easier to recall more frequent events, our ability to recall events also depends on factors other than frequency. Such factors include how easy it is to imagine the event happening and also the desirability or undesirability of the outcome. We tend to “block out” and thus underestimate the probability of undesirable outcomes.

  • While assessing the probability of simple events is difficult, assessing the probability of compound events, which are conjunctions or disjunctions of simple events, is even more bias-prone.

  • Once people form an initial assessment of the probability, they are slow to adjust it (Anchor bias). They adjust it in the right direction, but only to an insufficient extent. When the probability of an event is assessed relative to an anchor (in the form of higher or lower than a certain value, being the anchor), this anchor can bias the probability estimate upward or downward.

  • Although experts are less affected, risk assessment is even more bias-prone than assessing the probability (or rather, rate) of events. This is because non-experts define risk more broadly than “number of events per time period”.

  • A bias especially detrimental for medical and scientific reasoning is confusion of the inverse. In this bias, physicians confuse the sensitivity of a test for a disease diagnosis with the posterior probability of having the disease given the test. Or scientists confuse the p-value (i.e., probability of rejecting the null hypothesis) with 1-posterior probability that the alternative hypothesis is true. Or that the 95% CI of an estimated quantity contains with probability 95% the true value of the quantity.

  • Calibration biases are confidence errors and typically overconfidence errors where humans believe that their probability of being correct is much higher than the true value.

There are many more human decision making biases that affect most humans, including highly trained scientists and clinicians. We refer the reader to the references above for related discussion. The most important lesson is that there exist numerous cognitive biases and it is exceedingly difficult to remove human cognitive biases entirely from every decision humans make, even in the context of highly specialized training received by clinicians and scientists. This represents a great value proposition of AI/ML because none of the human cognitive biases affect ordinary AI/Ml systems (unless the designer intentionally constructs the AI.ML models to exhibit such biases, e.g., to simulate and study human cognition).

AI/ML Biases

Recall from the chapter “Foundations and Properties of AI/ML Systems” that all ML systems have an inductive bias, which—as we explained—is not a defect in their capacity for problem solving but simply denotes what technical family of models they prefer (so that they can model the data better). As long as the inductive bias elements (i..e., ML model family, data fitting procedure, model performance function, model selection procedure) are accurate representations of the domain, the ML model can exhibit no negative “bias”.

One inadvertent negative bias, however, that can enter into ML models is bias in the data provided for training. Let’s consider the highly-publicized case of racial bias in a patient care prioritization model [22] in which the model was supposed to prioritize care for high risk patients, yet it was found to prioritize white patients higher than black patients with the same risk. In brief, this model manifested a social/inequity bias because it was given wrong data. Specifically, the data, instead of presenting the actual severity of each patient, substituted it with the healthcare costs for that patient. However, there is a systemic bias in which higher costs are associated with white patients than black patients of the same risk. By training the ML algorithm with the wrong (biased) data, a model was produced that exhibited unwanted (socially/racially biased) behavior.

We did not cover factors such as fatigue, distraction, illness, etc. because they are not cognitive biases. However they are important and are discussed in the context of computer-human decision making next.

Computer-Human Decision Making

In this section, we examine the question of whether, and if yes how, computer problem solving can be combined with human problem solving in order to improve performance.

We frame this question as follows:

  • Under what general conditions can

  • AI/ML-Assisted Decision Making (i.e., computer + human)

  • outperform both

  • Autonomous AI/ML Decision Making (i.e., computer only),

  • and

  • Autonomous Human Decision Making (i.e., human only).

Relative Strengths of human and AI/ML Decision Making

Let us begin our discussion by comparing the relative strengths of human and AI/ML decision making.

Table 1 describes reason when human decision making can be superior to AI/ML and vice versa, reasons why AI/ML decision making can be superior to human. The table shows that human and AI/ML machine learning has complementary strengths, combining them may offer benefits.

Table 1 Reasons why human decision making can be superior to AI/ML decision making (top part) and reasons why AI/ML decision making can be superior to human (bottom)

Assumption. For the discussion in this section, we assume that the human and the AI/ML model has access to the same data.

Combining AI/ML can be discussed from several inter-related perspectives, which we discuss next.

Potential Complementariness of Errors Made by Humans and AI/ML

Figure 1 shows the three possible scenarios describing how the errors of the AI-Assisted Decision Making relates to the errors of the autonomous human and those of the autonomous AI/ML.

Fig. 1
3 graphs for error relationships. a. 2 circles that represent humans and computers are adjacent to each other. b. a. 2 circles for humans and computers overlap each other completely. c. 2 circles for humans and computers intersect partially.

Illustration of error relationships of computer vs human decision making. (a) humans and computers make errors independently; (b) errors made by humans and computers always coincide; (c) errors made by humans and computers partly coincide

Both computers and humans implement a function each that maps from the problem input domain set to the decision set. By Error Domain of a computer model or human decision function, we refer to the subset of the input domain where the computer or the human (respectively) make decision mistakes.

The blue circle represents the error domain of the human and the red circle represents the error domain of the AI/ML model.

There are three possible scenarios. In scenario (a) on the left, there is no overlap between the error domains of the human and the AI/ML model: they make independent mistakes. Some of these mistakes are predictable (blue/red parts of the error domains), so when we encounter a case with input features in the human identifiable error domain we will use the computer to make the decision. Conversely when we encounter a case with input features in the AI/ML identifiable error domain we will use the human to make the decision. When the input does not belong to any identifiable error domain we can pick either decision randomly.

In scenario (b) in the middle, there is a perfect overlap between the error domains of the human and the AI/ML model. They make mistakes exactly for the same inputs, so we cannot easily (if at all) correct the mistake.

Scenario (c) is in between (a) and (b). There is overlap in the error domains, and decision mistakes can be corrected easily in the non-gray (identifiable) and non-overlapping portions of the error domains.

Weak Learning Theory

Roughly, ensemble theory states that weak learners, learners that perform only minimally better than random, can be combined to form a highly performant ensemble [23]. Neither the human, and most likely, neither the AI/ML model is a weak learner; in a clinical setting, we would expect them to perform substantially better than random. However, the weak learning theory still allows us to combine these (actually strong) learners to form an even more performant system. We have already seen examples of methods to combine weak learners, gradient boosting in chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”. Other methods, including error correcting coding, etc., exist [23].

Knowledge About the Target Function

While the human and the AI/ML have access to the same data, the human can have knowledge of some regions of the target function, that the AI/ML algorithm was unable to learn. Universal function approximation, optimal classification and other theories operate in the sample limit (assuming access to unlimited number of observations), however, most AI/ML endeavors face a (sometimes severe) limitation of samples. Therefore, it is possible, that AI/ML cannot learn the target function correctly in some regions of the input space. In contrast, the human may have knowledge about the target function (which, of course, the human did not obtain from the data).

ROC Convex Hull

We have already seen examples in the “Evaluation” chapter (Figure 9.1.2.) of different classifiers (with different inductive biases) having different performance characteristics, resulting in ROC curves that favored one classifier in one region of the curve and a different one in a different region. Provost and Fawcett [24] proposed a method of combining classifiers so that the resulting ROC curve is the convex hull of the original ROC curves. Although such combinations can improve performance over the individual classifiers, this strategy is not yet optimal. Further development of combining ROC curves (e.g. [25, 26]) successfully improved upon the optimality of the combined classifiers. By viewing the human decision making as a black-box classifier, these techniques can be used to combine the human and AI/ML decision making.

Stacking

We have already discussed stacking in the “Ensemble Methods” section of the “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” chapter. Recall that stacking is a modeling approach, where multiple models (base learners) are trained on the same data (or some variant, e.g. bootstrap re-sample, of the data) and the outputs from these models are combined by another model (meta-learners). Stacking strategies differ in the ways base models are trained and combined. Optimal stacking strategies have been explored in several domains [27, 28].

In the application of stacking to the problem of combining human and AI/ML, base models can be (i) any AI/ML model, (ii) the human, (iii) or an AI/ML model “mimicking” the human. (“Mimicking” refers to approximating an unknown complex function with a simpler function, or several simpler functions, learnt from data, in some of the input regions.)

Strategies for Human-AI/ML Hybrid Decision Making

The inputs to the decision making are described in the left half of Table 2. These include:

DATA

Data that both the computer and human have access to. There is no other data that either the human or the computer can access.

C

A fixed computer model that was constructed before attempting to build the combined decision support system.

HRT

“real-time” human decision making (i.e., per usual practice)

HEST

“estimated” a model previously constructed to accurately mimic human decision making. Typically this model will use data under ideal human decision making conditions (expert human without superlative performance record, without distractions or fatigue, etc.).

The properties and priorities are described in the right side of the table. The column “Likely Pred. Opt.” denotes whether the combined decision making is likely predictively optimal, specifically from the perspective of exposing the hybrid learner to all available data and the interactions among features that may have not been captured by either prior computer models C, or humans H (and thus cannot be modeled by simply ensembling C, H).

The “Computer → Human Effects Risk” column indicates whether or not a risk of error due to human biases (rubber stamping decisions of C, alert fatigue, automation bias, etc.) exists. Specifically, “±” denotes that these biases if present can be addressed in the data design and workflow engineering, including in both cases, by blinding the human to the computer’s decisions.

The column “Human Risk” refers to the susceptibility to unmanageable error because of fatigue, distractions, work overloading.

Finally, “Gaming Risk” refers to conscious or subconscious tendency to “steer” the model’s output towards desired outcomes by modifying the human decisions accordingly.

The rows of Table 2 contain the actual strategies ranging from S1 to S10. The phrase “rand” refers to randomization, where the human (HRT) is randomly blinded (or not) to the computer predictions. This allows for estimating the influence of the computer output on the human decision maker.

Table 2 Concrete strategies for constructing Human-AI/ML hybrid decision making

The strategies in Table 2 range from pure AI/ML-only decision making (S1) to pure human-only decision making (S10).

S1-S8 are hybrid designs where we build models to learn accurate decisions using inputs described in the columns under “Inputs for modeling”. For each decision function, the inputs are designated by the ‘+’ signs in the appropriate input columns. For example, S6 constructs a decision function using data (DATA D), the pre-exising fixed AI/ML model (C), and the real-time human decision (HRT). This results in a decision function

F (instance(i) from D, Computer model outputs on instance(i), Human decisions on instance (i))

which is trained on all instance(i) in D, s.t. the decision error is minimized.

Note that strategies S5, S6 do not require blinding the human to the computer decisions since any negative effects of the human being exposed to a computer model’s decisions are eliminated by the absence of exposure to them in S5 and by randomization on purpose in S6.

Note that S7, S8 also do not require blinding of the human to the computer since any negative effects of the human being exposed to a computer model’s decisions are eliminated by the absence of exposure to them. However in model application time with a real time human decision making, the new model may negatively affect the human and thus the S8 models will require monitoring for such effects.

Human risks in Human-AI/ML Hybrid Decision Making

Human risks include decision fatigue [29], and other factors, such as distractions, illness, overwork, etc., As we saw, human decision making is sensitive to context. Thus a factor to consider is the alteration of the human performance in the context of combined decision making with the aid of a computer system. One instance of this problem is automation bias in which when humans are presented with many correct decisions by an AI model, humans become more susceptible to just “rubber stamping” the model’s decisions without much critical thinking [30]. This is a critical factor that influenced regulatory rationale, and which we discuss further in chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”. The converse is true, as well. When a clinical decision support system burdens a clinician with wrong/unnecessary alerts, alert fatigue [31] arises and the human may ignore subsequent alerts.

Empirical studies of Human-AI/ML Hybrid Decision Making

In terms of empirical demonstration of the potential to improve performance by combining human and AI/Ml capabilities, image analysis is particularly conducive to AI-clinician collaboration. Computer-aided detection (CADe) systems do not aim to offer a diagnosis instead of the physician, but detect and highlight regions in the image that are suspicious and require closer inspection. In a study where skin lesions were classified as malignant vs non-malignant, experts achieved 66% accuracy, while the CADe system (expert aided by AI) achieved 72.1% (±0.9%) [11].

In two highly-cited systematic reviews [32, 33] Haynes et al. reported that across 97 studies of the effectiveness of CDS, strong evidence was found that the performance of human decision making was improved (although these improvements were not directly linked to patient outcomes) [32]. A more recent systematic review of 38 studies found that 66% of the studies reported positive provider performance (while none reported negative provider performance) and 61% of the reviewed studies reported positive patient outcomes (while none reported negative patient outcomes) [34].

Conclusions

In the present chapter, we presented systematic review and meta analysis literature, which covers hundreds of studies, providing strong evidence that AI/ML in healthcare and health science problem solving exhibits strong performance, that is usually superior to human decision making (when direct comparisons were made). The chapter also addressed human biases and contrasted them with AI/ML biases. These are of fundamentally different nature and we have different ability to control them. In the final part of the chapter we provided an analysis of conditions under which humans+AI/ML can outperform both humans alone and AI/ML alone. We also provided broad guidance for implementing such combined decision making and examples from the literature where AI/ML-assisted decision making provided performance benefits.

We close this chapter with a needed clarification: the approach to the topic was driven by a data science functional perspective in which there are several knowns and unknowns. Specifically, we know the exact inner workings and input-output behavior of AI/ML models we have built; we can also measure the input-output behavior of humans; much is known about human decision making biases and how machines overcome those. At the same time little is known about the inner workings of the human brain’s decision making apparatus. From an empirical perspective we can model this human decision making as a “black box” decision function and analyze the error structures so that we can construct hybrid systems with smaller error.

As the phenomena of alert fatigue, decision fatigue, and automation bias show, the implementation of hybrid systems in practice is much more difficult and less controllable than the implementation of AI/ML models alone. Psychology, cognitive science, neuroscience, human-computer interaction, human factors engineering, and implementation science aim to all help with successful implementation. Here we just scratched the surface of what is possible.

Key Messages and Concepts Discussed in This Chapter

AI systems, both from the first and second wave of AI in medicine, have achieved diagnostic performance on par with expert clinicians, often outperforming less experienced professionals.

These systems achieved best success when AI was used to complement the professionals (e.g. they were used for consulting or for computer-aided detection) rather than trying to offer diagnoses without input from the professional.

Adoption of these systems is slow for a variety of reasons that relate to two key themes (i) lack of trust and (ii) lack of consideration for the clinical workflow.

The strengths of AI systems are complementary to those of human professionals.

Human decision making introduces biases in several ways which are largely different from the biases that AI systems introduce.

Pitfalls discussed in This Chapter

Pitfall 11.1 Even if a clinical AI system meets or exceeds expert-level performance in the lab, this does NOT mean that (i) the system can be readily adopted into clinical practice, (ii) will perform similarly when deployed in practice, or (iii) that the evaluation metrics used accurately reflect clinically impactful use of the AI model.

Best Practices discussed in This Chapter

Best Practice 11.1. Consider the possibility that a hybrid system may outperform human or computer decisions.

Best Practice 11.2. Examine the topology of errors in human and computer models.

Best Practice 11.3. Explore ensemble learning as a strategy for building hybrid decision models.

Best Practice 11.4. Work with implementation experts for bringing complex human/AI decision making into the clinical or scientific settings.

Classroom Assignments & Discussion Topics in This Chapter

  1. 1.

    How would you ensure that an AI system versus a clinical expert comparison is carried out in a fair manner? How would you establish a gold standard? How would you ensure that the clinicians and the AI system work with the same information? Can you think of evaluation metrics that favor AI or the clinical expert? Can/should blinded comparisons be used?

  2. 2.

    In models of human decision making, we described the difference between utility and value. What implications does this distinction have on the design of AI objective functions?

  3. 3.

    We explained that the way questions are formed can influence the answer. Also, when the respondent knows(!) the answer, such effects are minimal. How could the way questions are asked influence clinical decisions made by a clinician alone, a clinician in a shared decision making framework with the patient, and by an AI system?

  4. 4.

    We described four main groups of factors that introduce bias into human decision making. These were (i) memory, (ii) context, (iii) rational/irrational decision making, and (iv) heuristics.

    1. (a)

      Which of these factors present in a patient can influence the decision making?

    2. (b)

      Which of these factors present in a clinician can influence the decision making?

    3. (c)

      Which of these factors are relevant to healthcare research?

    4. (d)

      Which of these factors will impact a purely AI decision making? Think of potential impact during the development and the use of the system.

    5. (e)

      Which of these factors will influence an AI assisted decision making?

    6. (f)

      Which of these factors can be corrected by AI assisted decision making?