Keywords

Clinical-Grade and Other Mission-Critical AI/ML Models vs. Exploratory and Feasibility Models

As we have seen in chapter “Artificial Intelligence and Machine Learning for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML”, AI/ML is widely applicable across industries, endeavors, and objectives. One of the differentiating characteristics of biomedical AI/ML is the high cost of wrong decisions which is typically not shared with many other types of commercial AI/ML.

Examples of non-biomedical AI/ML tasks that have typically low cost of errors:

  • Recommendation systems in online e-commerce platforms,

  • Recommendation systems for digital media streaming,

  • Ad dollar allocation to increase sales,

  • Language filters in social media,

  • Email spam detectors,

  • Image recognition in search engines.

While some non-biomedical areas do have high-risk applications such as:

  • Autonomous vehicles,

  • Weapons and defense systems,

  • Algorithmic trading,

these are relatively rare compared to healthcare and the health sciences, that are replete with tasks that have very high cost of failure, including in:

  • Diagnosis of serious diseases,

  • Choice of treatment in oncology and other life-threatening diseases,

  • Differentiation of early signs of benign from malignant conditions or conditions that will progress to life-threatening stages or have other dire consequences if the right treatment is delayed,

  • Generating hypotheses for re-organization of healthcare services and other interventions that may severely affect the cost-effectiveness and outcomes of a health system.

Even health science tasks that on the surface may seem tolerant of errors have important serious unintended consequences, for example in:

  • Discovery of drug targets which if having many false positives, will lead to failed novel pharmaceuticals pipelines with up to multibillion-dollar losses,

  • Conflating correlative with causal factors leading to pursuing expensive and risky interventions that cannot possibly improve outcomes,

  • Low signal-to-noise ratio in the biomedical literature due to massive production of false results which disrupts progress in the health sciences,

  • Failure to discover novel effective treatments and practice interventions.

Because the requirements of high-stakes models are very different than those of models with less severe costs and consequences, we differentiate between the following very different classes of models and place strong emphasis on the safe and effective development of high-risk models. We first crystalize the related key concept definitions:

  • Exploratory AI/ML models: models that test scientific hypotheses or generate new hypotheses but without linking critical patient, health system, or health sciences decisions to the quality of such models.

  • Feasibility AI/ML models: models that test the feasibility of constructing a certain type of model but without linking critical patient, health system, or health sciences decisions to the quality of such models.

  • Pre-clinical AI/ML models: models that test the feasibility of constructing (at a later stage) patient or health system critical models.

  • Clinical-grade and mission-critical AI/ML models: models with performance characteristics that allow for effective and safe use for patient, population, health system or health sciences-level decisions [1].

The delineation between exploratory, feasibility and clinical grade models is fundamentally a risk assessment process. Within the categories of feasibility and clinical-grade (or other mission-critical) models, further risk analyses will typically occur. These risks may involve model inaccuracy related risks, legal risks, ethical risks, financial risks, etc.

Risk assessment of health AI/ML models can greatly facilitated via application of the risk management framework provided by the ISO 14971 standard and/or by application of the FDA criteria for regulated medical devices (see chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)” for detailed discussion). Another useful high-level framework is the translational science spectrum [2] comprising steps T0 to T4. Exploratory and feasibility AI/ML would typically fall in stages T0-T1, whereas T2 and beyond corresponds to clinical-grade AI/ML.

T0 research: Basic biomedical research, including preclinical and animal studies, not including interventions with human subjects.

T1 research: Translation to humans, including proof of concept studies, phase 1 clinical trials, and focus on new methods of diagnosis, treatment, and prevention in highly-controlled settings.

T2 research: Translation to patients, including phase 2 and 3 clinical trials, and controlled studies leading to clinical application and evidence-based guidelines.

T3 research: Translation to practice, including comparative effectiveness research, post-marketing studies, clinical outcomes research, as well as health services, and dissemination & implementation research; and

T4 research: Translation to communities, including population level outcomes research, monitoring of morbidity, mortality, benefits, and risks, and impacts of policy and change.

Notice that even if a model is not subject to regulatory oversight per se, it may still pose very significant risks. Consider for example, a model for forecasting resource utilization/needs used by a hospital’s administration business planning; or a model used by health insurers for contract pricing and reimbursability; or, finally a model that helps a pharmaceutical manufacturer prioritize its drug pipeline. It is entirely conceivable for errors in such models to lead to very substantial financial losses and/or disruption of health services or of R&D and production, affecting the well-being not only of such organizations but of large populations of individuals as well.

Also notice that whereas a marketed/deployed AI/ML model may directly affect patients’ outcomes and require regulatory approval, the pre-cursors of such models that investigate feasibility will not typically require such oversight (but be subject to other ethical and legal constraints).

The major (high-level) pitfalls that the present chapter addresses are:

Pitfall 6.1.1

Treating the development of clinical-grade or mission-critical AI/ML models as if they were exploratory, feasibility or pre-clinical ones.

Pitfall 6.1.2

Failing to establish and apply appropriate sufficient criteria and enforce BPs for model development, validation, and lifecycle support that will ensure safe and effective deployment in high risk settings.

Best Practice 6.1.1

Define the goals and process of AI/ML model building as either feasibility/exploratory or as clinical-grade/mission-critical and apply appropriate quality and rigor criteria and best practices.

In the remainder of this chapter (as well in related chapters diving into technical details) we will expand and enrich these pitfalls and present corresponding BPs.

The Lifecycle of Clinical-Grade and Other Mission-Critical AI/ML Models (with Indicative Real-Life Example References). A Development Process

Recall from chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” that learning and other AI methods produce decision support models focused on specific problem solving tasks. In chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” we described a systematic process for designing or evaluating learning methods (algorithms, algorithm families, pipelines, and automodelers) in terms of desirable operating characteristics.

Here we focus our attention to the development process, not of methods but of models with desirable performance characteristics (aka assurances) [3] across the lifecycle of the AI/ML models or systems [4].

Conceptually, and similar to the method development and evaluation process of chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems,” the described model lifecycle process combines elements of tested-and-true software and medical device development processes: the modified waterfall development process where iterative development is allowed (and here in addition, certain parts are occurring in parallel); and the stage-gate development process [5,6,7].

We will assume that the models will be based on existing or newly-developed and validated methods and will introduce pitfalls and a best practice process and steps, the purpose of which is to ensure that performance and safety goals are met. We discuss real life examples to ground the concepts in reality and make them clear. The overarching concepts and BPs of the present chapter are complemented by additional implementation details presented in chapters: “Foundations and Properties of AI/ML Systems,” “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science,” “Foundations of Causal ML”, “Evaluation”, and “Overfitting, Underfitting and General Model Overconfidence and Under-performance Pitfalls and Best practices in Machine Learning and AI” (Fig. 1).

Fig. 1
A flow diagram for A I and M L model development. Establishing performance and safety requirements is followed by data design, first-pass analysis and modeling, model optimization and validation, production models, monitoring, ancillary benefits, and I P and other considerations,

Lifecycle of AI/ML models and a Best Practice workflow for their development. Steps 6 and 5, 9 progress in parallel with steps 1–4, 7–8

For each of the above steps/stages, we discuss where and how relaxing the stringency of requirements is warranted for feasibility, exploratory, or pre-clinical models.

Step 1. Establishing Performance and Safety Requirements

The first step of developing a new AI/ML model or system is deciding what its intended goal is and what other constraints should be considered. This step corresponds to the traditional requirements engineering process familiar in the development of practically any computer, IT or engineering system [8]. Requirements engineering for AI and ML is currently an emerging field and many challenges exist in identifying the categories of goals and the right processes for conducting successful requirements engineering in health AI/ML [9,10,11]. Various desired properties, such as accuracy, explainability, safety etc. are often discussed in this emerging literature. We will incorporate in the components of step 1 these concepts - using a healthcare and health sciences perspective - as well as concepts from the related literature on healthcare and life sciences analytics, clinical prediction modeling, and healthcare risk adjustment [12,13,14].

Step 1.1. Specifying Performance Targets

In order for AI/ML models to be effective there must be a clear minimum set of performance goals that the successful models have to meet or exceed. Typically these can be determined according to the need to meet some desired clinical outcomes goal in either absolute or relative terms. Indicative examples include: severe post-surgical infections less than 1/100 in surgeries of some type; reducing adverse events for some treatment used in patients with condition X by three-fold relative to current incidence; achieving diagnostic accuracy for condition X that is at least 50% better than current human error rates; improving current hypertensive medication compliance of patients by 30%; eliminating 1-month re-admissions in Medicare patients; controlling false positive biomarker identification for some disease X such that no more than 10% of newly identified biomarkers are statistical false positives; improving the cost-effectiveness of existing diagnostics, treatments, or other patient or population-level interventions by some specific margin, etc.

The various metrics often used for these purposes are part of the toolkit of AI/ML model evaluation and are discussed at length in chapter “Evaluation”.

Step 1.2. Establishing and Evaluating Performance Targets in Real-Life and Multi-Objective Context of Use; Stakeholder Engagement

The target requirements represent meaningful goals in real-life discovery and health care settings as a function of the needs of a plurality of stakeholders and may be subject to objective or subjective judgments and values. Such judgment may originate by clinical experts, national guidelines, or may be tied to financial and payer expectations, and may also incorporate patient’s and human subjects’ perspectives (e.g., about subjective quality of life in various health states, or expectations about privacy, health risks etc., respectively) and ELSI (Ethical, Legal, Social Implications) as well as JEDI (social Justice, health Equity, Diversity, Inclusion) criteria and desiderata (see chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI))” for the importance and nature of ethical consideration and goals including extensive discussion of the important elements of health equity and elimination of unfair treatment of disadvantaged populations.

The context of these assessments has to be carefully constructed so that it is well-defined and facilitates the subsequent model development. By way of analogous examples, a well-defined context of model use may be similar to patient inclusion criteria specification for a clinical trial protocol, FDA contexts of use for biomarkers, FDA approved uses for drugs, etc.

Stakeholders and sources of guidance for developing a well-specified context of use of the sought models as well as performance requirements include:

  1. (a)

    External (independent) or organizational (internal) clinical experts;

  2. (b)

    Clinical service directors;

  3. (c)

    The scientific literature, including science-driven hypotheses;

  4. (d)

    National, state and organizational guidelines;

  5. (e)

    Payers and contracts terms between providers and payers;

  6. (f)

    Regulatory bodies and legal compliance experts;

  7. (g)

    Ethics experts;

  8. (h)

    Community representatives;

  9. (i)

    Individual patients and patient advocacy groups;

  10. (j)

    Professional and scientific societies;

  11. (k)

    Scientific standards and standards groups;

  12. (l)

    It is also possible for goals to be generated by data-driven opportunity identification by prior application of AI/ML technology.

The above serve as the most common important sources for setting quality and cost improvement initiatives in the context of QCI of health systems, Learning Health Systems initiatives, and academic and industry R&D initiatives where AI/ML can be an enabling technology. Especially with regards to general data and science standards, models and their underlying data both at development, validation and deployment stages ideally must follow:

  1. (m)

    FAIR principles for scientific data and software (Findable Accessible Interoperable and Reusable data and software).

  2. (n)

    TRUST data principles (Transparency, Responsibility, User focus, Sustainability, and Technology).

  3. (o)

    Open Science practices including open access, open data, open source, and open standards for software, data and scientific findings.

  4. (p)

    Data security and HIPAA and other compliance requirements for data privacy by virtue of state of the art IT security, de-identification, and secure management of all sensitive EHR, RCT and other data.

  5. (q)

    Use of shareable and standardized terminologies and common data models (e.g., RxNorm, ICD9, ICD10, SNOMED CT, CPT, LOINC, HL7, OMOP, PCORnet CDM, i2b2, for research with clinical data; and OBI, GO, VariO, PRO, GO, CL, etc. for research with molecular data. See chapter “Data Preparation, Transforms, Quality, and Management” for detailed discussion and references for data requirements and practices).

An appraisal and synthesis of the above requirements may typically be undertaken by institutional AI/ML governance and oversight committees [15, 16].

Step 1.3. From Accuracy to Value Proposition and Health Economics

AI/ML loss functions and the general theory that governs AI/ML are typically constructed around predictivity measures (e.g., AUC, weighted accuracy, MSE etc., see chapter “Evaluation”). A translation step is typically needed to map predictivity measures and other model properties (e.g., cost of inputs to the model per application of the model) into downstream value.

In healthcare, generally there are four key business drivers of value: (a) revenue growth, (b) operating margins, (c) asset efficiency, and (d) organizational effectiveness. These drivers in turn impact three main business goals: (1) clinical performance, (2) operational performance, and (3) financial performance [4].

There are many metrics used to measure these goals and chapter “Evaluation” discusses both clinical outcome oriented metrics as well as health economic ones. We will briefly mention here that metrics exist that combine clinical with economic value, for example the metric of Incremental Cost Effectiveness Ratio (ICER) which is a typical and widely-used value assessment health economics tool based on economic cost (e.g., expenditure in dollars) per unit of effectiveness typically measured in Quality Adjusted Life Years (QUALYs) gained by use of some intervention, technology, medication, or in our case AI/ML model. The ICER and other cost-effectiveness metrics have several useful properties:

  1. (a)

    They allow placing a variety of wildly heterogeneous possible improvements on current practices, in clinical care or science, in direct comparison to one another.

  2. (b)

    They allow optimal allocation of limited resources to maximize the expected benefit across all possible mixtures of interventions.

  3. (c)

    They enable specifying minimum performance requirements that are necessary for making the decision of incurring the costs required for developing and/or deploying the envisioned or existing AI/M model. See chapter “Evaluation” for details.

Step 1.4. System-Level Goals and interactions vs.” Tunnel-Vision” Model Development

A common shortcoming of AI/ML model development practice is that they have performance requirements that are meaningful in a narrow context but are blind to system-level consequences and interactions.

As an example in clinical settings, consider the need to decide whether patients with COVID-19 infections should be admitted in the ICU versus the hospital clinic and the related problem of whether milder COVID cases should be hospitalized or sent home. AI/ML models focusing exclusively on eliminating the risk of bad outcomes for these patients would tend to overwhelm the hospital and ICU beds to the detriment of patients with other conditions. If a fixed number of hospital beds can be made available in some time horizon, and patients with different risks for dire outcomes are “competing” for the limited beds, sound development and evaluation of Covid admission models should take into account the patients with different conditions and the system of care holistically. As of the time this book is written, the literature on AI/Ml decision support models strongly indicates that such system-level interactions are typically not considered, yet they are critical for the health system holistically.

Step 1.5. Relaxing the Stringency of Requirements for Feasibility, Exploratory, or Pre-clinical Models. Proper Level of Interpretation

The above requirements can be relaxed when dealing with feasibility, exploratory or preclinical models. It is entirely appropriate for such models to have a combination of the following:

  • Unspecified performance targets,

  • Open-ended application contexts,

  • To be driven by scientists’ curiosity,

  • To omit system-level interactions,

  • To forego incorporation of health economics, and

  • To not consider clinical-grade compliance and risk management.

However, even for such early efforts, the closer they are to the ideal specifications of clinical-grade and mission-critical models, the more informative and the closer they are to eventually leading to significant contribution to health care and the health sciences. Additionally, serious caution is warranted regarding the problem of producing too much “noise” in the literature resulting from models lacking or with loose requirements and purpose.

Finally, creators of feasibility, exploratory and pre-clinical models should not over-interpret them and exaggerate their significance. The reader is referred to chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems,” for guidance on over and under interpreting methods and models produced by them, which applies exactly here as well, and will not be repeated as it applies precisely as stated in section “Over-and Under-Interpreting Results,” in the same chapter.

We can therefore refine and expand pitfall 6.1. by including:

Pitfall 6.2.1.1

Failure to specify and evaluate meaningful performance targets and real-life context of use.

Pitfall 6.2.1.2

Failure to engage all appropriate stakeholders.

Pitfall 6.2.1.3

Failure to establish value targets and translate predictivity and other technical model characteristics into real-world value assessments.

Pitfall 6.2.1.4

Failure to consider broader interactions of envisioned models with the health system or with the system of science and of R&D. “Tunnel vision” evaluation with blind spots to the broader implications and consequences.

Pitfall 6.2.1.5

Failure to consider ELSI and JEDI desiderata and consequences.

Pitfall 6.2.1.6

Interpreting results of models beyond what their properties justify.

Pitfall 6.2.1.7

Interpreting results of models below what their known properties justify.

The corresponding best practices are:

Best Practice 6.2.1.1

When pursuing risk and performance-sensitive model development, specify concrete model performance targets for well-defined care or discovery settings.

Best Practice 6.2.1.2

When pursuing risk and performance-sensitive model development, engage all appropriate stakeholders.

Best Practice 6.2.1.3

When pursuing risk and performance-sensitive model development, translate model accuracy to value, establish value targets and translate predictivity and other technical model characteristics into real-world value assessments.

Best Practice 6.2.1.4

When pursuing risk and performance-sensitive model development, carefully consider and plan for system-level goals and interactions. Avoid too narrow (“tunnel vision”) model development.

Best Practice 6.2.1.5

When pursuing clinical-grade and risk-sensitive model development, carefully consider ELSI and JEDI desiderata and consequences.

Best Practice 6.2.1.6

When pursuing feasibility, exploratory, or pre-clinical models, relax stringency of requirements applicable to clinical-grade models.

Best Practice 6.2.1.7

When pursuing clinical-grade and risk-sensitive model development, interpret models and models’ decisions exactly as their known properties justify.

Step 2. Data Design and Collection

The second stage in the clinical-grade and mission-critical model development process is the careful data design that will facilitate data collection and modeling to enable meeting the model performance requirements. In chapter “Data Design” we cover extensively the most relevant data designs and their relative strengths and weaknesses, their connection to modeling methods, their biases, their effect on performance, and other important characteristics.

One of the fundamental premises of this book and of chapter “Data Design” in particular (also discussed in chapters “Foundations and Properties of AI/ML Systems,” “Principles of rigorous development and of Appraisal of ML and AI Methods and Systems,” and “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring Problems, and the Role of Best Practices”), is that the data design is, generally speaking, as important as the actual algorithms and AI/ML analytic protocols and stacks and many failures occur when the data design is deficient. In particular, powerful data designs may render the choice of algorithms inconsequential, and conversely a poor data design will increase the modeling work and difficulty exponentially, up to rendering the whole model development workflow infeasible.

Pitfall 6.2.2.1

Failure to create a rigorous and powerful data design which facilitates modeling that will meet performance and safety requirements.

For feasibility, exploratory, or pre-clinical modeling, on the other hand, developers can and often do utilize “convenience” datasets without extensive efforts for bespoke and optimized data sampling. In such cases the feasibility modeling has to be tailored to the limitations of the convenience datasets and the interpretation has to be careful to not overstate what can be developed with imperfect data designs.

Pitfall 6.2.2.2

Failure to consider and judiciously interpret the limitations of convenience data designs on the performance and meaning of feasibility and exploratory models.

Best Practice 6.2.2.1

When pursuing risk and performance-sensitive model development, create a rigorous and powerful data design which facilitates modeling that will meet performance and safety requirements.

Best Practice 6.2.2.2

When pursuing risk and performance-sensitive model development, judiciously interpret the limitations of convenience data designs for the performance and meaning of feasibility and exploratory models.

Step 3. First-Pass Analysis and Modeling

The next stage of clinical grade and mission-critical AI/ML model development is that of “first-pass” modeling before final optimized models are attempted. This stage essentially asks “how much signal seems to be in the data for the problem at hand, using the immediately-available data?”. If the preliminary signal is high, then transition to high performance/low risk models will be easier, faster, and cheaper. If the preliminary signal is small, then major efforts may be needed in data collection, new method development, sophisticated modeling, etc. and still these efforts may not meet target requirements. In R&D designs where alternative projects compete for the same limited R&D funds, this stage may be the point where some of the projects will be “weeded out” in favor of more promising ones.

The first-pass modeling therefore must be rigorous and involves a number of activities that to large extent mirror aspects of rigorous methods development previously discussed in chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”:

(a) Literature review of what has been accomplished, and how, and what level of performance was reached by prior efforts;

(b) Theoretical analysis of the problem space and its characteristics;

(c) Available or easy /cheap to collect data for initial development and testing;

(d) Approaches and results previously explored both in terms of data design, algorithms and models;

(e) Verifying and reproducing prior literature findings/claims;

(f) Variation of data/populations used to derive prior models and how they match the population for the intended new models; and

(g) Preliminary estimates of predictivity of the first pass models and whether they meet requirements.

These considerations collectively establish that first pass modeling does not imply a haphazard and “anything goes” approach to modeling. Several criteria apply that sharply differentiate feasibility modeling (which is not tied to a clinical-grade or mission-critical modeling) with first pass modeling (which is part of a R&D chain that intends to achieve, eventually, clinical-grade and mission-critical performance).

Finally an important consideration is to avoid overfitting, which as we see in detail in chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI”, will occur when repeated rounds of modeling with additional methods in same data are conducted without appropriate over-fitting avoidance protocols being in place.

The steps in this stage can also serve, at the discretion of the data scientists and the project leaders, as a blueprint for strong exploratory modeling (always remembering that if exploration is the sole goal, then much lower data and performance and safety requirements can be pursued).

Pitfall 6.2.3.1

Failure to ensure the successful transition from first pass modeling to optimized clinical grade models by ignoring: the problem space characteristics; data available for development and testing; prior literature on approaches and results previously explored both in terms of data design, algorithms and models; forfeiting verification and reproducing prior literature findings/claims; and failure to obtain robust preliminary estimates of predictivity of the first pass models and whether they meet requirements.

Pitfall 6.2.3.2

Succumb to overfitting when repeatedly analyzing data from first pass to optimized modeling stages.

Best Practice 6.2.3.1

When moving from first pass modeling to optimized clinical grade models take into account: the problem space characteristics; data available for development and testing; prior literature on approaches and results previously explored both in terms of data design, algorithms and models; verification and reproducing prior literature findings/claims; obtaining robust preliminary estimates of predictivity of the first pass models and whether they meet requirements.

Best Practice 6.2.3.2

Avoid overfitting due to repeatedly analyzing data from first pass to optimized modeling stages.

An Important Protocol for Over-Fitting Resistant Model Selection, and Unbiased Model Error Estimation

Critical to a successful ML method is the choice of model selection and error estimation protocol. Such a protocol should support the following:

  1. (a)

    Unbiased error estimation (i.e., estimate accurately the generalization error of the model in the large sample population).

  2. (b)

    Avoiding overfitting the model to the training data, especially if dimensionality is high, sample size is small, and many methods are applied.

  3. (c)

    Avoid missing strong models for the task at hand (i.e., avoid under-fitting).

Figure 2 depicts the core concept and simplified pseudo-code for a powerful such protocol that has great applicability across a wide range of biomedical ML modeling problems: the Repeated Nested Balanced N-fold Cross Validation (RNBNFCV) protocol [17]. The bottom part shows pseudocode for a bare-bones RNBNFCV. The top part demonstrates a simple example of operation of a three-fold cross validation optimizing two values (1,2) of one hyper-parameter (C) for one algorithm.

Fig. 2
A diagram and a pseudocode for R N B N F C V. Top. A dataset has predictor and outcome variables, and the outer and inner loops have cross-validation for performance estimation and model selection. Bottom. The training set is repeated N times to test the classifier and record average performance.

Repeated Nested Balanced N-fold Cross Validation (RNBNFCV) protocol. Top: simple example of operation of a three-fold cross validation optimizing two values of one hyper-parameter for one algorithm. Balancing labels, repeated application, and construction of final model not shown for simplicity. Bottom: pseudocode for bare-bones RNBNFCV. See text for details

As can be seen the data is split randomly in 3 equal parts P1, P2, P3. Each part is stratified so that has equal number of positive and negative values of the binary response (i.e., the splits are balanced). For each of the P1, P2, P3 datasets, two parts are used for model selection and the remainder (third part) is used for error estimation. For example in split 1 {P1, P2} will be used for model selection and P3 for error estimation. We say that the error estimation occurs in the outer loop and the model selection in the inner loop. This is evident from both the graphic and the pseudocode.

Within each model selection part of a split, we use a random part for fitting a model and the rest for evaluating its performance. For example in inner split 1, P1 will be used to fit a model with hyper parameter C = 1 and tested on P2. Then in inner split 2, P2 will be used to fit a model with C = 1 and evaluated in P1. The average accuracy over the inner splits informs the merit of C = 1. We do the same for C = 2 and find out that C = 1 is the best value for the hyper-parameter. This terminates this most simplistic model selection. Now we join P1 and P2 and train a model with C = 1. We evaluate in P3 in the outer loop and record the accuracy (=89%).

We repeat this procedure for the other outer cross validation splits (i.e., P1,P3/P2 and P2,P3/P1). The trace of the model selection for these splits is not shown, but we can see that the best C for the second split was 2 and for the third split was 1. Based on these values for C, models are fit on {P1, P3} and {P2, P3} respectively, and accuracies estimated on P2, P1, respectively.

The average of these outer loop accuracies are then averaged to provide a final unbiased estimate of the accuracy that this protocol can identify (83%). What is not shown in this simplified figure is:

  1. 1.

    The final model is a model that we will fit from all data {P1,P2,P3} using C = 1. The estimated accuracy of 83% applies to this final model.

  2. 2.

    We can use many algorithms and many hyper-parameters with many values and let the protocol find the best model.

  3. 3.

    We can further reduce the variance of the final estimates and of the model selection by repeating the whole procedure and averaging the error estimates.

Step 4. Model Optimizing and Validation

Step 4.1. Modify or Enhance Data Design, Algorithms, and Protocols

Once a first-pass analysis has been completed and the data scientists have collected important information about the sufficiency of data (e.g., is there need to collect more variables, or larger sample size, to clean up and transform variables etc., or alter other data design aspects), as well as modeling (e.g., whether the algorithm and protocols applied seem sufficient or require inclusion of additional analytic methods and in some cases enhancing methods with new method development efforts), they can proceed with optimized model building and validation. This includes the following key steps:

Step 4.2. Obtain Performance Metrics and Meeting Targets

This is accomplished using model selection and error estimation and evaluation designs and procedures for appropriate performance metrics as explained in chapters “Foundations and Properties of AI/ML Systems” (fundamental methods), “Model Selection”, “Evaluation”, and the chapter on overfitting. The performance metrics can be estimated with purely statistical methods and/or with collection of new data and prospective validations. See chapter “Evaluation” for pros and cons of these approaches.

In case that further refinements are needed because performance targets are not fully met, this will trigger revisiting and iterating between steps 4.1.-4.2, and possibly new method development (chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”). It is also possible that targets cannot be met despite efforts and therefore relaxing the original model specifications and re-architecting the whole problem solving effort toward more feasible goals, may be necessary.

Step 4.3. Uncertainty and Error

This step involves a thorough characterization of the optimized model’s decision uncertainty and goes beyond establishing the main performance metrics. Examples of this step may include calculating: the robustness of model and its structure and parameters as a function of sampling variation (via resampling); sensitivity to modeling decisions (e.g., similarity distance metrics, kernel functions, algorithm families considered etc.); calibration; predictive intervals (measuring uncertainty of decisions). A related procedure at this stage is to “curve out” subpopulations in the model’s inputs distribution space or the model’s output using thresholding, clustering and other methods so that to define decision regions and populations that are characterized by high or low confidence. These steps are essential for managing misclassification risk and discussed further in chapter “Characterizing, Diagnosing and Managing the Risk of Error of ML and AI Models in Clinical and Organizational Application”.

Step 4.4. Model Explanation/Interpretation

AI/ML Models are much more likely to be adopted and used if human stakeholders can understand their function and why they make the decisions they do. It is a critical element in making sure that AI/ML is safe and also commonly important for regulatory approvals. It is essential that model explanations have high fidelity (i.e., correspond precisely to how the AI/ML functions, and are not just persuasive but inaccurate justifications of the models [18]).

We have seen in previous chapters that some AI/ML methods are inherently understandable (e.g., logic, rule based systems, Bayes Nets, causal graphs, decision trees, linear regression etc.) while many other methods produce models that are black box and hard to understand.

The fields of explainable AI (XAI) and interpretable ML and their incarnations in health domains study methods to make black box models interpretable [19].

In this section we will delve into essential concepts and techniques for AI/ML interpretation.

Murdock et al. [20] define interpretation, in the context of ML, as producing insight from ML models. Specifically, they define interpretable machine learning as the extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model. Other definitions include “Interpretability is the degree to which a human can understand the cause of a decision” [21].

Benefits of interpretable machine learning methods include: [21,22,23].

  • Model interpretability allows the user of the model to investigate and understand why an unexpected prediction was made by the model.

  • By extension, model interpretability allows us to debug models: the model developer can track down why the model made an erroneous prediction.

  • Model interpretability can shed light on the data-generating mechanisms of the domain that is being modeled.

  • Interpretable models offer a level of safety in that the user can anticipate model behavior.

  • In the absence of strict standards that govern the interoperability of ML models (or ML models and humans), understanding exactly how the individual ML models operate, can increase trust in their effective co-operation.

As we discuss in Chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”, interpretable models also allow for the detection of biases present in the training data and are required for equitable and fair models.

A Taxonomy of AI/ML Interpretability

Methods for model interpretation can be categorized along the following dimensions [21, 24]:

  • Inherently interpretable vs. post hoc. Models such as linear regression models, k-NN classifiers, decision trees are inherently interpretable; their language (see chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML”) is particularly suitable for human interpretation unless their size (e.g., number of leaves or number of variables) is excessive. In contrast, models such as kernel-SVM or Deep Learning are not easily interpretable, and the models are explained post hoc, after they have been constructed.

  • Global vs. local interpretability. Global interpretation of a model attempts to explain the model in the entirety of the input space; local interpretation attempts to explain how the model operates in small part of the input space, possibly for individual instances. “Interpretation” generally refers to explaining how the model operates, while “explanation” generally refers to explaining how a model made a decision (or prediction) for a particular instance.

  • Model-agnostic vs. model-specific interpretation. Model-specific interpretation techniques require that the model to be interpreted is a specific type of model, while model-agnostic interpretation methods aims to explain any type of models.

Properties of Interpretation Methods

Numerous properties for interpretation methods have been defined [24, 25]. Below we list some; the reader is referred to [24, 25] for a more complete list.

  • Accuracy is the extent to which the explanation model accurately predicts unseen instances.

  • Fidelity refers to the extent to which the model is able to accurately imitate the black-box predictor to be explained.

  • Consistency (among explanation methods) shows how similar explanations are from different methods for the same task.

  • Stability is the degree to which similar explanations are generated for similar instances.

Figure 3 depicts the taxonomy of interpretation method. We have already discussed the inherently interpretable methods in chapters “Foundations and Properties of AI/ML Systems,” “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science,” and “Foundations of Causal ML”. We will provide some additional detail on the local and global model-agnostic methods and their properties.

Fig. 3
A classification chart of interpretation methods. Inherently interpretable includes k N N, G L M, and D T. Post-hoc interpretation has global model agnostic that includes P D P and feature importance, local model agnostic that includes I C E and line, and model-specific that includes Tree SHAP.

Taxonomy of interpretation methods (with example methods in blue text underneath)

Partial Dependence Plots (PDPs) [26] is a methodology for visualizing the effect of predictor variables on an outcome in a reduced feature space. For a feature S of interest, and for each unique value s it takes, the partial dependence is the average effect of the remaining variables (keeping S = s fixed). The partial dependence plot depicts the predicted outcome as a function of S and can be converted into feature importance by computing the feature’s deviation from the average effect or by taking the difference between the maximal and minimal effects.

Properties. PDP is a global method. The marginalization is over the remaining variables, which prevents the user from detecting “local” effects, that is relationships between S and the outcome that hold true only in a subspace. PDP is a plot, offering an intuitive interpretation. The method makes the assumption of independence between S and the remaining variables. Under this assumption, the interpretation is that the partial dependence is the marginal effect of S on the outcome.

Individual Conditional Expectation (ICE) [27] plot visualizes the dependence of the predicted outcome on a feature S for each instance i separately, resulting in one line per instance. The curve for instance i is created by fixing all features except S at their actual values and varying the value for S according to a grid of feature values. The predicted outcomes for different i’s at the minimal value of S can differ, making it difficult to visually inspect a large set of curves at the same time. The centered ICE brings all the curves together to a common starting point.

Properties. ICE is a local method, providing a curve for each individual instance. Relationships that only hold true in parts of the input space can be detected. This is a plot, so the key method of explanation is visualization. Too many curves in a plot can create visual clutter. The partial dependence plot can be viewed as the average of the ICE curves and can be superimposed on the plot.

Local Interpretable Model-Agnostic Explanations (LIME) [28] and Anchors [29]. Rationale: (a) Even if a modeling method produces inherently interpretable models (e.g. linear regression), having a large number of variables erodes interpretability. The explainer model should only have few variables. (b) LIME assumes that in the local context of a particular instance, only few variables are particularly impactful. These locally impactful variables can differ from variables that are locally impactful in a different part of the input space. The explainer model should then have high local fidelity. The explainer model is constructed locally on records generated randomly in the neighborhood of the record to be explained, the query record q, and the records are weighted according to their proximity to q. The explainer model is an inherently interpretable model, such as linear regression, trained on the weighted records. The explanation for q is obtained by explaining the explainer model.

Anchors is an extension to LIME, which instead of building a local surrogate model, constructs an IF-THEN rule. They are called “anchors”, because they sufficiently “anchor” the prediction locally – such that changes to the rest of the feature values of the instance do not matter; i.e., similar instances covered by the same anchor have the same prediction outcome. Two advantages of anchors over LIME include (1) the improved interpretability the rules and (2) scoping: while the scope of a LIME explanation is not always clear, anchors have a clearly delineated scope.

Properties of LIME. It is a local, model-agnostic method. The explainer model can be any interpretable model; the choice of model can be tailored to the audience. The explainer model can utilize features that the original model does not. These features must use the same instances (sampling units) as the original data set.

Permutation Feature Importance. Rationale: if information in features that are useful for prediction is destroyed by randomly shuffling the feature values, the predictive performance of the model should decrease. If the decrease is small, then the information in the original predictor wasn’t very impactful; if the decrease is large, then the information in the original predictor had a large impact on the predictions. A model is constructed on the training set and the importance of feature S is estimated by comparing the predictive performance (or error) of the model on a test set with and without permuting S.

Note. Many variants of the method exist depending on (1) whether they use the training or test set, and (2) whether they re-train the model after permuting the feature values. Originally, Breiman [30] used the out-of-bag samples for variable importance estimation.

Properties. This is a global, model-agnostic method. It is computationally efficient: the model does not need to be retrained. This definition of importance relates to predictive performance instead of the effect of the variable on the outcome. The feature importance depends on how the underlying learning method attributes the effect of sets of collinear features to the individual features. Whether the model is retrained after feature value permutation or not, will particularly affect the importance of collinear features.

Conversion of black model to decision trees of other interpretable global surrogate model via meta-learning. This is a global model-agnostic method whereby the black box model is sampled for a large number of inputs (that must follow a modeled joint distribution of inputs or from a random sample from the population) and a new dataset, Dsam, is created. The response in this new dataset is the black box model’s predictions (not the original response variables). This dataset is then processed with a Markov Boundary feature selector so that the input variables are reduced to the smallest set of features that carry maximum explanatory signal for the black box model’s behavior. Then decision trees or other interpretable models are built on the projection of Dsam on the Markov Boundary features. The method can be used to explain human decision as well, (if we first construct an accurate model of the human decision which we can sample as much as we need). Figure 4 demonstrates the process and resulting decision trees in a project where the goal was to explain physician diagnostic decisions for differentiating melanomas from benign skin lesions [31]. The method can be combined with equivalence class modeling of all equivalent Markov Boundaries so that an unbiased and complete view of the black model can be generated. Notice that separate interpretable models have been constructed for each physician. Moreover we can use the interpretable models to identify guideline compliance for each physician. Caveats and caution when using this approach are: we may not be able to create an accurate approximation of the original black box model (or the human subjects we are studying). Application of Markov Boundary feature selection may not retain all original signal. Decision trees are sensitive to the order of variables so we need ensure that when some variables do not appear in a subtree that they are indeed ignored locally. Finally when equivalence classes exist, we need examine all equivalence classes. These caveats apply in general to all model-agnostic surrogate methods, however.

Fig. 4
A diagram, 2 decision trees, and a table. Top. F S is followed by S V M black box with regular learning, apply S V M, meta-learning, and build D T. Center. A table lists blue veil, irregular border, and streaks for patient 001. Bottom. 2 decision trees for benign or malignant by experts 1 and 3.

Explanation by meta learning, Markov Boundary feature extraction and conversion to interpretable models. Example in explaining physician diagnostic decisions for differentiating melanomas from benign skin lesions. Top: Original data from human decisions are feature-selected, then an accurate (but black box) model is created, then we sample the black model and create a dataset on which meta learning produces interpretable models (decision trees in this instance). Bottom: For a particular patient, we see key characteristics and how the interpretable model for expert physician 1 (left) compares with the model of physician 3 (right). In this case we can see that they agree, but with substantially different reasoning

Explaining human decision making. Once we create accurate models of human decision making, in principle all XAI methods discussed in this chapter can be used to explain human decision making (not just AI/ML models). This is a valuable technique that should be kept in mind when “Human-in-the-loop” systems are at play [32, 33]. See also chapter “From ‘Human versus Machine’ to ‘Human with Machine’”.

Pitfalls of Using Shapley Values (and Derivative Methods) for AI/ML Explanation

Shapley Values (SVs) is a framework for allocating rewards among agents who participate in collaborative coalitions to produce economic value that needs to be distributed fairly. This framework was devised by Lloyd Shapley who received the Nobel prize in economics for his discovery [34]. In economic game theory it has been shown that this framework has a number of desirable and unique properties and its value is undisputed. However it has been adopted rather uncritically first for explaining regression models [35] and then, more recently for explaining ML models [36]. As shown in an important recent work by Ma and Tourani [37] Shapley Values exhibit a number of severe shortcomings when it comes to explaining ML models:

  1. 1.

    SVs do not explain properly even linear models.

  2. 2.

    To calculate SVs one needs to build models that are unrelated to the model that needs be explained.

  3. 3.

    If SVs are calculated by data imputation and not de novo model rebuilds, the choice of imputation greatly affects the calculated values.

  4. 4.

    SVs for variables conditionally independent of T given blocking variable sets are (falsely) non zero.

  5. 5.

    SVs for variables not in the Markov Boundary of the response are non zero.

  6. 6.

    Causal effects of variables given their confounders are not monotonic to SVs.

  7. 7.

    SVs do not respect the causal Markov Condition and thus cannot be used to explain causal models.

  8. 8.

    SVs for a variable set L with less information content than that of set S can be larger than those of S. SVs are thus not monotonic to information content.

  9. 9.

    SVs are an improper measure of feature relevance (i.e., weakly relevant features can have higher SV than strongly relevant features).

Pitfalls of using feature-importance methods. In general these, just like Shapley Values methods, attempt to summarize the importance of features using a single summary value. This is a very problematic endeavor when dealing with complex multi-faceted models. For example, they do not separate the unique, overlapping, and non-linear interaction information content of the variables in conjunction with other features in the model with which they may have overlapping content, or interact non-linearly to produce strong signal when jointly observed. See chapter “Foundations and Properties of AI/ML Systems” (section on feature selection, for more insight). They also cannot explain the causal effects of features under different conditioning, manipulation and observational setups (see chapter “Foundations of Causal ML” for insight). Finally these methods are not modeling effects of equivalence classes in the models.

Pitfalls of Explaining Transparent But Very Large Models

While a method may be entirely transparent for small scale models, the produced models may be very hard to interpret when they grow very large, because of the limits of human memory and cognition. Examples include large causal graphs, large decision trees, linear models with numerous variables with each one having a very small coefficient, large bagged model ensembles, boosted models, etc. Explaining such large models can be tailored to the specifics of each model family, or addressed with the usual array of surrogate models (i.e., as if the original model was a black box model). Table 1. shows example approaches, as indicative examples of the former approach.

Table 1 Interpretation strategies for interpretable but very large models

Pitfalls of humans explaining models. Humans experts are often engaged (especially in the health sciences) for interpreting models. It is important to verify generalizable fidelity of human expert surrogate models.

Pitfalls 6.2.4.1 in Explanation of AI/ML

  1. 1.

    Using black box models is an obstacle to adoption and trust.

  2. 2.

    Shapley values (and derivative methods) have several severe shortcomings for AI/ML explanation.

  3. 3.

    Feature importance methods, that attempt to compress complex model behaviors into single values per variable, lack the necessary information capacity and may be inconsistent with feature selection theory and causality.

  4. 4.

    Inherently transparent but very large models often resist interpretability.

  5. 5.

    Human experts explaining models must be evaluated for validity just like ML surrogate models.

Best Practices for Explanation of AI/ML:

Best Practice 6.2.4.1. Everything else being equal, prefer interpretable model families when interpretability is desired.

Best Practice 6.2.4.2. Use standardized coefficients (if applicable) when comparing feature contributions in linear models.

Best Practice 6.2.4.3. Very large models, even when produced with intrinsically interpretable methods, may still be hard to interpret because of sheer scale. Isolating critical information from large models or simplification are recommended.

Best Practice 6.2.4.4. Apply feature selectors that maximally reduce dimensionality without loss of predictivity. Compact models are always easier to explain. Combine with interpretable model families or surrogate models as appropriate.

Best Practice 6.2.4.5. If accuracy is of paramount importance and if the black box models have significant accuracy advantages over the best interpretable models you can build, then use the black box model but apply explanation methods:

  • Global surrogate models aiming to have high fidelity everywhere in the input space over all patterns that will be classified by the model. Verify generalizable fidelity of surrogate model before using.

  • Local surrogate models aiming to have high fidelity in the local input space for every pattern pi that will be classified by the model. Verify generalizable fidelity of surrogate model before using.

  • Human expert surrogate models which must be high fidelity everywhere in the input space and over-fitting resistant. Verify generalizability and fidelity of human expert explanations of models.

Best Practice 6.2.4.6. Shapley values, Shapley value approximations and feature importance methods that try to summarize complex model behaviors in one or few values are not advised as general or routinely used methods for explaining ML models.

A Note on Over- and Under-Interpreting Models

The structure and operating characteristics of AI/ML models offer important information about the data generating processes they model. This is especially important for discovery in the health sciences but also instrumental when we try and understand and improve systems and processes of healthcare. The appropriate interpretation of models refers to not deriving more or less (or more or less general or specific) conclusions than what the actual models entail. Because the appropriate model interpretation stems directly from the properties of the model family, knowledge of the model families (chapters “Foundations and Properties of AI/ML Systems,” “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science,” and “Foundations of Causal ML”) is essential for proper model interpretation. See chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” also for examples of appropriate and inappropriate model interpretation.

Step 4.5. Modeling Equivalence Classes

An extremely common pitfall in AI/ML modeling practices is that of ignoring equivalence classes of models. In order to understand the nature of this pervasive problem, and how to address it, we will present a simplified example.

Consider Fig. 5 where a system under study has the causal structure depicted. The outcome of interest is depicted in green. This is a terminal variable which means that no variables relevant to the system are measured (or have importance) after it (e.g. of such a biological terminal variable is death). Red variables are measured direct causes. Purple variables are measured indirect causes. Grey variables are unmeasured (“hidden” aka “latent”) variables. Uncolored variables are confounded variables to the outcome (via measured or unmeasured confounders). Finally “‘ ” (prime) variables (A’, C1’, C2’) have the same information about the outcome as (i.e., they are information equivalent with) the corresponding non-prime variables, e.g. A and A’ are information equivalent about the outcome.

Fig. 5
A decision tree for equivalence classes. True indirect causes of outcomes E, F, and C are followed by confounded variables with outcome L, C 1 prime, C 2 prime, A prime and K, true direct cause of outcomes A and B, unmeasured variables H 1 and H 2, information equivalent variables, and outcome.

Demonstration of problems in a domain with equivalence classes. Data generating structure. See text for details

Formally, information equivalency of two variables with respect to a third variable is defined as follows:

Definition

Target Information Equivalency (TIE) of variables A, A’ with respect to variable T, (simplified definition that can be used to understand the example presented, adapted from [38]):

A TIET A’ ↔ [(A T), (A’ T), (A ⫫ T | A’), (A’ ⫫ T | A’) 

↔ denoting equivalence

denoting dependence

⫫ denoting independence

In words:

Information_equivalent (A, A’) with respect to T iff:

  • Not Independent (A, T),

  • Not Independent (A’, T),

  • Independent (A, T), given A’ and

  • Independent (A’, T) given A.

In other words, A and A’ have exactly the same (and non-zero) information about T.

Note 1:A, A’ do not need to be strongly correlated (i.e., co-linear) or strongly associated. They can be TIE with respect to T even with small mutual association or correlation.

Note 2:A, A’ may have different information content for other variables (different than T).

As can be seen in the example of Fig. 5, the minimum set of maximally predictive features in the data generating function is {A, B, H1, H2} and its equivalent {A’, B, H1, H2}. This is because we can substitute A with A’ and vice versa since they are equivalent with respect to the outcome. In the measured data (because of hidden variables) there are 6 Markov Boundaries and this set can be precisely discovered also (Fig. 6).

Fig. 6
2 diagrams for equivalence classes. Top. Inferred Markov boundaries equivalence class with 6 sets of variables K, A, B, A prime, C, C 1 prime, and C 2 prime. Bottom. True Markov boundaries equivalence class in measured data. Sets 1 and 4 have C, 2 and 5 have C 1 prime, and 3 and 6 have C 2 prime.

Demonstration of equivalence class - related problems. Equivalence class consequences on predictive modeling. See text for details

Note 3: In [38] more general definitions of information equivalency are presented that admit sets of equivalent variables, equivalences involving more than two equivalent sets, conditional independence with respect to arbitrary sets, and context-Independent equivalence (holding given any subset of other, non-equivalent variable sets). The simplified form of information equivalency covered here for conveying the basic idea of predictive equivalence, is considerably simpler than these more general cases.

The situation is similarly complicated for causal discovery. As seen in Fig. 7 direct causes of the outcome are {A, B, H1, H2}. In the measured data the apparent direct causes (barring successful application of latent variable detection algorithms) has an equivalence class with 3 members.

Fig. 7
3 diagrams for equivalence classes. Inferred direct causes sets equivalence class has 3 sets of variables where the sets on the left, center, and right have C, C 1 prime, and C 2 prime. True direct causes in measured data have A, B, and C. True direct causes in system have A, B, and 2 H 1 variables.

Demonstration of equivalence class -related problems. Equivalence class consequences on local causal structure discovery. See text for details

When the outcome is a terminal variable (e.g., death), and if no unmeasured confounders exist, and no information equivalencies, then the (single) Markov Boundary contains precisely the set of direct causes of the outcome.

If information equivalencies exist, without unmeasured variables, then one Markov Boundary member of the equivalence class will contain the direct causes precisely. When unmeasured variables exist in addition to equivalences, Markov Boundaries will contain a mixture of direct causes, indirect causes and non-causal confounded variables. Notice that when a variable appears in all Markov Boundaries then in the absence of latents it is guaranteed to be a direct cause (e.g. B).

As a reminder of earlier material (Chapters “Foundations and Properties of AI/ML Systems” and “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”), the following predictive modeling errors may occur when not employing optimal feature selection:

  1. 1.

    Predictivity less than optimal predictivity achievable with the measure data.

  2. 2.

    Predictor model size larger than most parsimonious model among maximally predictive ones.

The following additional causal modeling errors may occur when not employing causal methods:

  1. 3.

    Including variables that are not causal although their confounders are fully measured (e.g., E, L, A and B in a SVM or PCA model although causal algorithms detect that only A, B (and their equivalents) can be direct causes and E, L (and their equivalents) may be indirectly causal or confounded.

  2. 4.

    Discarding maximum causal effect variables in favor of ones with smaller causal effects (e.g. preferring E over A and B because it may have larger effect than either A or B alone, known as “information synthesis”, for example in decision trees and other machine learning methods).

  3. 5.

    Failure to discover confounding by latents although it is discoverable.

We can now see the importance of not modeling equivalence classes which adds to the above errors:

Pitfall 6.2.4.2

Additional predictive modeling errors in analyses where the equivalence class of the optimal feature sets (Markov Boundaries) are not inferred.

  1. 1.

    The predictor model will be a random member of the Markov Boundary equivalence class. This may not be the cheapest, easiest or most convenient model to deploy clinically.

  2. 2.

    In domains with large equivalence classes, intellectual property cannot be defended since a third party can use an equivalent Markov Boundary and easily bypass a patent or other IP protections.

Pitfall 6.2.4.3

Additional causal modeling errors in analyses where equivalence classes of Direct Causes are not inferred.

  1. 1.

    Discarding causal variables in favor of non-causal ones (e.g., discarding A because its correlation with outcome vanishes when we include non-causal but information equivalent A’ in a regression model).

  2. 2.

    Over-interpreting models: e.g., believing that because A’ is a model returned by an algorithm, without equivalence class modeling, and A is not, then A’ is biologically more important than A.

Because the size of the equivalence classes can be immense (i.e., exponential to number of variables see chapters “Foundations and Properties of AI/ML Systems” and “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”) finding the true causes when selecting a random member (as all algorithms not equipped for equivalence class modeling do) amounts to “playing a lottery ticket” with astronomical odds against the analyst. Similarly, finding the feature set that is most suitable for clinical application is astronomically unlikely if the Markov Boundary equivalence class is large and is not modeled.

Best Practice 6.2.4.7

  1. (a)

    Use equivalence class modeling algorithms for discovering the equivalence class of optimally accurate and non reducible predictive models. E.g. TIE* instantiated with GLL-MB or other sound Markov boundary subroutines [38] (see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

  2. (b)

    Use equivalence class modeling algorithms for discovering the equivalence class of direct causes. E.g. TIE* instantiated with GLL-PC or other sound local causal neighborhood subroutines (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

  3. (c)

    When experiments can be conducted, consider using ML-driven experimentation algorithms that model equivalence classes. Experimentation may be needed to resolve the equivalence classes and unmeasured confounding. Such algorithms minimize the number of experiments needed. E.g., ODLP* algorithm [39].

Pitfall 6.2.4.4

Omit important steps in optimizing model performance and characterizing error and other properties that are essential for safe and effective deployment.

Best Practice 6.2.4.8

When pursuing risk and performance-sensitive model development: optimize model performance verifying that targets are met; otherwise modify or enhance data design, algorithms, and protocols, or relax requirements; once modeling is complete characterize error and other properties that are essential for safe and effective deployment and explain models and check their face validity.

Step 5. IP Considerations

Throughout the model development process, and depending on organizational policies regarding intellectual property (IP), close collaboration with commercialization experts and IP legal counsel may be needed. These experts may decide to file for patents and other forms of IP protection at various steps of the model development effort (and the same applies as well to method development efforts). The data scientists are essential part of this process as they have to provide technical details, text, data, and to teach the working of methods and models to the IP experts. This process involves several challenging aspects:

  1. 1.

    First the potential tension (and occasional controversy) between open science and open access principles with inventor or institutional desires for IP protection. On one extreme, lies the position that all IP protection must be avoided. On the other extreme, lies a very strict sense of proprietary ownership of methods and models and complete lack of transparency about how methods and models work.

    In between these extremes lie more balanced approaches where, for example, it may be acknowledged that certain inventions can work best (e.g. reach patients most effectively) via productization which requires commercial investment, which in turn often relies on some form of robust IP.

    Simultaneously, this does not preclude, for example, allowing non-profit use of methods and models openly and without restrictions.

    Patents (and patent applications even when not issued) of algorithms and models typically require “opening up” the new technology to scrutiny and understanding. This is true even if the patent applications are rejected by the USPTO or other (international) patent-issuing bodies. On the contrary, if a patent application is not pursued, the “trade secret” nature of the technology (unless of course disclosed openly) hides critical details from scientific inspection and opens up the possibility for errors and risk that would be entirely avoidable.

    To complicate matters further, sound science and technology requires the ability to stabilize methods and model implementations such that they are not perpetually subject to unavoidable re-implementation or modification errors. Open source is, unfortunately, particularly vulnerable to such errors. “Locking” models and methods and their implementation is necessary if absolute certainty that their properties hold exactly as established by their validation studies.

  2. 2.

    If patent protection of AI/ML algorithms is desired, the inventors must understand that in recent years it has become increasingly hard to obtain such protection (i.e., ~5% issuance rate) since the SCOTUS has issued rulings that impose significant hurdles on grounds that the patentable inventions are not seen by patent examination as abstract ideas, or laws of nature. The USPTO has become increasingly resistant to granting patents on highly mathematical inventions such as AI/ML ones and often these have to be filed, e.g., in the form of tangible systems.

  3. 3.

    While AI/ML models and systems are much easier to patent than algortihms, a major hurdle lies with the (occasionally vast) equivalence classes of models with optimal or near –optimal performance. As explained previously, ML model M1 for optimal treatment of condition X may have an equivalence class with infinite size (just the number of fully information equivalent optimal and minimal predictor variables sets grows exponentially to the number of variables in the data). In such cases IP protection of the general methodology of producing the models may be pursued. Alternatively, a much newer approach relies on filing for IP protection of the equivalence class of optimal and near-optimal models. This latter approach requires specialized algorithms that derive the full equivalence class such as the ones described in (see comparisons of multiple methods in [38] and also discussion in chapters “Foundations and Properties of AI/ML Systems” and “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

Pitfall 6.2.5.1

Establish and exercise IP rights that defy fundamental principles of scientific reproducibility, openness to model and method scrutiny, and validation.

Pitfall 6.2.5.2

Failing to establish IP rights that are critical for successful dissemination and benefit from AI/ML innovation.

Pitfall 6.2.5.3

Failing to protect IP rights from bypassing IP by exploiting model equivalence classes.

Best Practice 6.2.5.1

Do not establish and exercise IP rights in ways that undermine fundamental principles of scientific reproducibility, openness to model and method scrutiny, and validation.

Best Practice 6.2.5.2

Establish IP rights that are critical for successful dissemination and patient and society benefit from AI/ML innovation.

Best Practice 6.2.5.3

Protect IP rights from “bypassing” that exploits model equivalence classes.

Step 6. Regulatory, Bias, ELSI, JEDI Considerations

Throughout all stages in the design, development, validation and deployment of clinical-grade and error-sensitive AI/ML models there are important considerations that cover legal compliance, and social and ethic dimensions. These are covered in chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”. At a high level in addition to regulatory approval, major legal issues are liability, and conformance to data privacy laws. With regards to ethics and health equity, access by all patients who might need the technology, eliminating various racial and other medically unjustified biases from the data and the models, are major concerns and specific techniques and practices (detailed in chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”) should be used to address these important dimensions.

As usual, these requirements can be relaxed (but not ignored) for pure feasibility and exploratory models.

Pitfall 6.2.6.1

Failing to address regulatory, legal, bias, ethical, social justice, and health equity requirements from design to clinical-grade and mission-critical model development validation and deployment.

Best Practice 6.2.6.1

When developing clinical-grade and mission-critical models address regulatory, legal, bias, ethical, social justice, and health equity issues.

Step 7. Production Models and Model Delivery. Health Economic and Implementation Science Considerations

Step 7.1. Converting Optimized Models to Production-Level Models

Very often, development of models is conducted with vast numbers of data inputs and resulting models may be impractical to deploy either because of the sheer complexity of the resulting decision support, or because of cost of measuring necessary inputs or cost of extracting those with bespoke interfaces to the EHR or other data sources. An essential aspect is the application of strong feature selection that maintains the full information content of the data, but discards all unnecessary information. This leads to more cost-effective and easy to deploy CDS (clinical decision support) or RDS (research decision support).

Another aspect of model inputs is whether they are objective or subjective. Examples of objective data inputs are: body measurements (e.g., weight, height) taken with standardized instruments and protocols or obtained from formal sources (e.g., age, death status, marital status); clinical labs; medications prescribed; psychological, cognitive and other questionnaires/instruments; gene and protein measurements; genetic polymorphisms (germline) or tumor mutations; and so on.

Subjective data inputs examples include: surgeon’s assessment of degree by which all tumor tissue was removed; radiology interpretative reports (although many radiological findings can be very objectively measured); psychiatric mood evaluation; determination by dermatologist of color or smoothness of a skin lesion, etc.

In many situations it is possible for AI/ML model developers to choose between objective and subjective data inputs that convey the same information. Such choice then can be driven by how practical is the use of each data element, what is the cost, whether there are concerns that a model user may attempt to (intentionally or due to implicit biases) “game” the model by skewing subjective data inputs, whether the use of models needs to be fully automated and reproducible, or whether it is used in conjunction with human judgment, and so on.

Especially in the research realm, the effort to use as much objective data inputs as possible, is crucial to the reproducibility of the model’s encapsulated knowledge and scalable application for discovery.

Step 7.2. Workflow Integration

Integrating AI/ML models to clinical or research workflows is another important element of successful deployment. The classical example often mentioned in “informatics 101” courses or discussions, is that of a care provider who has to stop and spend hours to input numerous data elements about a patient in a decision support AI/ML tool. Such a setup is entirely disruptive and unworkable and will lead to failure of adoption of that tool with very high degree of probability. Newer developments in digital health data harmonization and interoperability address this issue by use of standardized terminologies, and protocols that use those to provide seamless access to EHR so that AI/ML and other clincial decision support (CDS) can be integrated in practice easily. Still, there are many aspects of workflow integration that need be addressed and these require close collaboration of the AI/ML team with the clinical teams and the IT departments of care provider organizations.

Along these lines the sought integration may be “loose integration”, or “tight integration”. Loosely-integrated CDS often exists in the form of web-services that are being contacted by a hospital or other provider IT system using an asynchronous query-response protocol. More tightly-integrated CDS is typically part of the EHR, the computerized provider order entry (CPOE), the various alerting and guideline systems in active use, and may support both “push” and “pull” operating modes.

Step 7.3. Sandboxing

Sandboxing refers to the safe pre-deployment testing of an AI/ML model decision support in real-life settings but without affecting directly or indirectly critical patient care decisions. The sandboxing ensures both integration aspects, especially for proper data access, and many other elements of prospective validation and risk management (see chapter “Characterizing, Diagnosing and Managing the Risk of Error of ML and AI Models in Clinical and Organizational Application”). Recent regulatory efforts in the EU have elevated sandboxing into a legal requirement serving as a precursor to safe delivery of health AI/ML (see chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”).

Step 7.4. Scaling of CDS

The ability to scale the deployment of a CDS is critical when large numbers of patients are to be managed with input by the AI/ML model or when many providers across geographical regions are supported, and when the response time and availability of the decision support are critical. Fundamental aspects of scalability include horizontally and vertically scalable architectures. An overview of these and many other aspects of scalability are described in [40] and will not be elaborated here further.

Step 7.5. Implementation Science Aspects and Checklist

Another lens to successful deployment of AI/ML model is that of Implementation Science (IS), [41] a new field aiming to improve the speed of innovation adoption. Some notable IS key elements that we have covered in this chapter and throughout the book include:

  1. 1.

    Ensuring that the various models have been built to the specifications of stakeholders,

  2. 2.

    User and stakeholder education,

  3. 3.

    Cost-effectiveness and value of AI/ML decision support,

  4. 4.

    Sustainability, and

  5. 5.

    Rapid but rigorous validation in increasing degrees of proximity to the ultimate deployment setting.

The above requirements of step 7 are typically not strictly needed for feasibility, exploratory, or pre-clinical models.

Pitfall 6.2.7.1

Failing to pay attention to critical issues of implementation including: (1) conversion to practical, inexpensive, objective production models; (2) ensuring sustainability via reimbursement, cost reductions etc.; (3) demonstrating to stakeholders of meeting clinical or research needs and adding value; (4) providing user education and support; (5) ensuring community and patient buy-in; (6) sandboxing CDS while it is evaluated in care environment; (7) ensuring scaling of CDS; (8) integration to clinical, research and R&D workflows as appropriate.

Best Practice 6.2.7.1

When developing clinical-grade and mission-critical models, address critical issues of implementation including: (1) conversion to practical, inexpensive, objective production models; (2) ensuring sustainability via reimbursement, cost reductions etc.; (3) demonstrating to stakeholders of meeting clinical or research needs and adding value; (4) providing user education and support; (5) ensuring community and patient buy-in; (6) sandboxing CDS while it is evaluated in care environment; (7) ensuring scaling of CDS; (8) integration to clinical, research and R&D workflows as appropriate.

Step 8. Model Monitoring and Safeguards

From the very early AI research years, and especially expert system research, the problem of the AI’s “knowledge cliff” was defined as a major problem with AI systems and a major departure from how humans think and solve problems.

Knowledge cliff” of an AI system: the boundary of expertise, knowledge or problem solving ability of that system. Within this boundary the system will perform well, while outside this boundary performance may drop significantly and abruptly.

Pitfall 6.2.8.1

Creating and deploying AI/ML models and related decision support that lacks protections against falling out of the model’s knowledge cliff.

In the chapters discussing data quality and management and characterizing, and managing the risk of error of ML and AI Models, we will describe in detail practical strategies for creating and deploying AI/ML that stays within their knowledge boundaries. We will also cover important concepts such as: outliers, safe and unsafe decision regions, calibration and recalibration, incorporating patient preferences, data shifts and model performance shifts & monitoring, distribution checking, causes of data shifts and how to address them, population mixture changes, seasonality and trends, epidemic dynamics, various interventional externalities (e.g., changes in standards of care, new vaccines, new populations, new treatments). We will also address data, and model variants and invariants, pristine vs. noisy inputs, model input mapping and harmonization, missing input values and rebuilding models.

Best Practice 6.2.8.1

When developing clinical-grade and mission-critical models ensure that the AI/ML models will stay within their knowledge boundaries by addressing: outliers, safe and unsafe decision regions, calibration and recalibration, incorporating patient preferences, managing data shifts and model performance shifts, population mixture changes, seasonality and trends, epidemic dynamics, various interventional externalities (e.g., changes in standards of care, new vaccines, new populations, new treatments). Carefully consider how models can successfully generalize from the original data/populations used for model development and validation to other populations and settings and address pristine vs. noisy inputs, model input mapping and harmonization, missing input values and rebuilding models.

Step 9 Ancillary Benefits and Work Products

During the course of architecting modeling efforts to address clinical and research problems, creating data designs and collecting data, developing, validating and deploying models several ancillary, several secondary objectives and work products can be produced including: mechanistic studies, reusable data gathering, drug target and biomarker discovery. Ideally there should be a plan on how to benefit from those, staring from capturing and eventually sharing the underlying data and findings for future or parallel use by the same groups and others.

Pitfall 6.2.9.1

Creating and deploying AI/ML models without consideration of ancillary and secondary objectives, benefits and work products preservation and management.

Best Practice 6.2.9.1

When developing clinical-grade and mission-critical models ensure that ancillary and secondary objectives, benefits and work products are managed and preserved.

Best Practices 6.2.9.2

Documentation. Throughout the model development process complete and thorough documentation must be maintained. Key elements of this documentation include:

  1. 1.

    Model goals.

  2. 2.

    Risk assessments.

  3. 3.

    Key interactions and input from stakeholders.

  4. 4.

    AI/ML governance and oversight committee deliberations.

  5. 5.

    Software documentation.

  6. 6.

    Data design documentation.

  7. 7.

    Data documentation.

  8. 8.

    IP documentation.

  9. 9.

    Legal and compliance documentation.

  10. 10.

    User guides and training documentation.

  11. 11.

    Ancillary work products documentation.

  12. 12.

    Checklists and worksheets (e.g., ones provided in this book to keep track of following relevant best practices).

Ideally the above documentation should include both raw information and distilled summaries, time-indexed and observing data privacy, and other appropriate laws and regulations.

This concludes the description of AI/Ml lifecycle stages and related pitfalls and best practices. Chapters “Foundations and Properties of AI/ML Systems” (foundational methods), “Evaluation” (evaluation metrics and designs), “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI” (overfitting) and “Characterizing, Diagnosing and Managing the Risk of Error of ML and AI Models in Clinical and Organizational Application” (model debugging and managing risk) elaborate practically and theoretically on how one can implement them.

Key Concepts Discussed in Chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-SensitiveAI/ML Models”

Exploratory AI/ML models.

Feasibility AI/ML models,

Pre-clinical AI/ML models,

Clinical-grade and mission-critical AI/ML models,

Lifecycle of clinical-grade and other mission-critical AI/ML models,

Targeting model accuracy vs. targeting value proposition,

First-pass analysis and modeling,

IP considerations and tradeoff of different IP strategies,

Ancillary work products,

Health economic and implementation science considerations,

Production-level models,

Clinical and other decision support encapsulating AI/MLM models,

Sandboxing models,

Workflow integration,

Scaling model-based decision support,

Implementation science aspects and checklist,

Monitoring models,

“Knowledge cliff” of an AI system,

Model interpretability and explanation strategies,

Modeling equivalence classes,

Documentation practices.

Pitfalls Discussed in Chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”

Pitfall 6.1.1: Treating the development of clinical-grade or mission-critical AI/ML models as if they were exploratory, feasibility or pre-clinical ones.

Pitfall 6.1.2.: Failing to establish and apply appropriate sufficient criteria and enforce BPs for model development, validation, and lifecycle support that will ensure safe and effective deployment in high risk settings.

Pitfall 6.2.1.1: Failure to specify and evaluate meaningful performance targets and real-life context of use.

Pitfall 6.2.1.2.: Failure to engage all appropriate stakeholders.

Pitfall 6.2.1.3.: Failure to establish value targets and translate predictivity and other technical model characteristics into real-world value assessments.

Pitfall 6.2.1.4.: Failure to consider broader interactions of envisioned models with the health system or with the system of science and of R&D. “Tunnel vision” evaluation with blind spots to the broader implications and consequences.

Pitfall 6.2.1.5.: Failure to consider ELSI and JEDI desiderata and consequences.

Pitfall 6.2.1.6.: Interpreting results of models beyond what their properties justify.

Pitfall 6.2.1.7: Interpreting results of models below what its known properties justify.

Pitfall 6.2.2.1: Failure to create a rigorous and powerful data design which facilitates modeling that will meet performance and safety requirements.

Pitfall 6.2.2.2: Failure to consider and judiciously interpret the limitations of convenience data designs on the performance and meaning of feasibility and exploratory models.

Pitfall 6.2.3.1.: Failure to ensure the transition from first pass modeling to optimized clinical grade models by ignoring: the problem space characteristics; data available for development and testing; prior literature on approaches and results previously explored both in terms of data design, algorithms and models; forfeiting verification and reproducing prior literature findings/claims; and failure to obtain robust preliminary estimates of predictivity of the first pass models and whether they meet requirements.

Pitfall 6.2.3.2.: Succumb to overfitting when repeatedly analyzing data from first pass to optimized modeling stages.

Pitfalls 6.2.4.1 in Explanation of AI/ML

  1. 1.

    Using black box models is an obstacle to adoption and trust.

  2. 2.

    Shapley values (and derivative methods) have several severe shortcomings for AI/ML explanation.

  3. 3.

    Feature importance methods, that attempt to compress complex model behaviors into single values per variable, lack the necessary information capacity and may be inconsistent with feature selection theory and causality.

  4. 4.

    Inherently transparent but very large models often resist interpretability.

  5. 5.

    Human experts explaining models must be evaluated for validity just like ML surrogate models.

Pitfall 6.2.4.2: Additional predictive modeling errors in analyses where the equivalence class of the optimal feature sets (Markov Boundaries) are not inferred.

  1. 1.

    The predictor model will be a random member of the Markov Boundary equivalence class. This may not be the cheapest, easiest or most convenient model to deploy clinically.

  2. 2.

    In domains with large equivalence classes, intellectual property cannot be defended since a 3rd party can use an equivalent Markov Boundary and easily bypass a patent or other IP protections.

Pitfall 6.2.4.3.: Additional causal modeling errors in analyses where equivalence classes of Direct Causes are not inferred:

  1. 1.

    Discarding causal variables in favor of non-causal ones (e.g., discarding A because its correlation with outcome vanishes when we include non-causal but information equivalent A’ in a regression model).

  2. 2.

    Over-interpreting models: e.g., believing that because A’ is a model returned by an algorithm, without equivalence class modeling, and A is not, then A’ is biologically more important than A.

Pitfall 6.2.4.4.: Omit important steps in optimizing model performance and characterizing error and other properties that are essential for safe and effective deployment.

Pitfall 6.2.5.1.: Establish and exercise IP rights that defy fundamental principles of scientific reproducibility, openness to model and method scrutiny, and validation.

Pitfall 6.2.5.2.: Failing to establish IP rights that are critical for successful dissemination and benefit from AI/ML innovation.

Pitfall 6.2.5.3.: Failing to protect IP rights from bypassing IP by exploiting model equivalence classes.

Pitfall 6.2.6.1.: Failing to address regulatory, legal, bias, ethical, social justice, and health equity requirements from design to clinical-grade and mission-critical model development validation and deployment.

Pitfall 6.2.7.1.: Failing to pay attention to critical issues of implementation including: (1) conversion to practical, inexpensive, objective production models; (2) ensuring sustainability via reimbursement, cost reductions etc.; (3) demonstrating to stakeholders of meeting clinical or research needs and adding value; (4) providing user education and support; (5) ensuring community and patient buy-in; (6) sandboxing CDS while it is evaluated in care environment; (7) ensuring scaling of CDS; (8) integration to clinical, research and R&D workflows as appropriate.

Pitfall 6.2.8.1: Creating and deploying AI/ML models and related decision support that lacks protections against falling out of the model’s knowledge cliff.

Pitfall 6.2.9.1: Creating and deploying AI/ML models without consideration of ancillary and secondary objectives, benefits and work products preservation and management.

Best Practices Discussed in Chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”

Best Practice 6.1.1.: Define the goals and process of AI/ML model building as either feasibility/exploratory or as clinical-grade/mission-critical and apply appropriate quality and rigor criteria and best practices.

Best Practice 6.2.1.1: When pursuing risk and performance-sensitive model development specify concrete model performance targets for well defined care or discovery settings.

Best Practice 6.2.1.2: When pursuing risk and performance-sensitive model development engage all appropriate stakeholders.

Best Practice 6.2.1.3: When pursuing risk and performance-sensitive model development translate model accuracy to value, establish value targets and translate predictivity and other technical model characteristics into real-world value assessments.

Best Practice 6.2.1.4: When pursuing risk and performance-sensitive model development carefully consider and plan for system-level goals and interactions. Avoid too narrow (“tunnel vision”) model development.

Best Practice 6.2.1.5: When pursuing clinical-grade and risk-sensitive model development, carefully consider ELSI and JEDI desiderata and consequences.

Best Practice 6.2.1.6: When pursuing feasibility, exploratory, or pre-clinical models relax stringency of requirements applicable to clinical-grade models.

Best Practice 6.2.1.7: When pursuing clinical-grade and risk-sensitive model development interpret models and models’ decisions exactly as their known properties justify.

Best Practice 6.2.2.1: When pursuing risk and performance-sensitive model development create a rigorous and powerful data design which facilitates modeling that will meet performance and safety requirements.

Best Practice 6.2.2.2: When pursuing risk and performance-sensitive model development judiciously interpret the limitations of convenience data designs on the performance and meaning of feasibility and exploratory models.

Best Practice 6.2.3.1.: When moving from first pass modeling to optimized clinical grade models take into account: the problem space characteristics; data available for development and testing; prior literature on approaches and results previously explored both in terms of data design, algorithms and models; verification and reproducing prior literature findings/claims; and obtaining robust preliminary estimates of predictivity of the first pass models and whether they meet requirements.

Best Practice 6.2.3.2.: Avoid overfitting due to repeatedly analyzing data from first pass to optimized modeling stages.

Best Practice 6.2.4.1. Everything else being equal, prefer interpretable model families when interpretability is desired.

Best Practice 6.2.4.2. Use standardized coefficient (if applicable) when comparing feature contributions in linear models.

Best Practice 6.2.4.3. Very large models even when produced with intrinsically interpretable methods may still be hard to interpret because of sheer scale. Isolating critical information from large models or simplification are recommended.

Best Practice 6.2.4.4. Apply feature selectors that maximally reduce dimensionality without loss of predicivity. Compact models are always easier to explain. Combine with interpretable model families or surrogate models as appropriate.

Best Practice 6.2.4.5. If accuracy is of paramount importance and if the black box models have significant accuracy advantage over the best interpretable models you can build, then use the black box model but apply explanation methods:

  1. (a)

    Global surrogate models aiming to have high fidelity everywhere in the input space over all patterns that will be classified by the model. Verify generalizable fidelity of surrogate model before using.

  2. (b)

    Local surrogate models aiming to have high fidelity in the local input space for every pattern pi that will be classified by the model. Verify generalizable fidelity of surrogate model before using.

  3. (c)

    Human expert surrogate models which must be high fidelity everywhere in the input space and over-fitting resistant. Verify generalizability and fidelity of human expert explanations of models.

Best Practice 6.2.4.6. Shapley values, Shapley value approximations and feature importance methods that try to summarize complex model behaviors in one or few values, are not advised as general or routinely used methods for explaining ML models.

Best Practice 6.2.4.7.

  1. (a)

    Use equivalence class modeling algorithms for discovering the equivalence class of optimally accurate and non reducible predictive models. E.g. TIE* instantiated with GLL-MB or other sound Markov boundary subroutines [38] (see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

  2. (b)

    Use equivalence class modeling algorithms for discovering the equivalence class of direct causes. E.g. TIE* instantiated with GLL-PC or other sound local causal neighborhood subroutines (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

  3. (c)

    When experiments can be conducted, consider using ML-driven experimentation algorithms that model equivalence classes. Experimentation may be needed to resolve the equivalence classes and unmeasured confounding. Such algorithms minimize the number of experiments needed. E.g., ODLP* algorithm [39].

Best Practice 6.2.4.8. When pursuing risk and performance-sensitive model development: optimize model performance verifying that targets are met; otherwise modify or enhance data design, algorithms, and protocols or relax requirements; once modeling is complete characterize error and other properties that are essential for safe and effective deployment and explain models and check their face validity.

Best Practice 6.2.5.1.: Do not establish and exercise IP rights in ways that undermine fundamental principles of scientific reproducibility, openness to model and method scrutiny, and validation.

Best Practice 6.2.5.2.: Do establish IP rights that are critical for successful dissemination and patient and society benefit from AI/ML innovation.

Best Practice 6.2.5.3.: Protect IP rights from “bypassing” that exploit model equivalence classes.

Best Practice 6.2.6.1.: When developing clinical-grade and mission-critical models address regulatory, legal, bias, ethical, social justice, and health equity issues.

Best Practice 6.2.7.1.: When developing clinical-grade and mission-critical models address critical issues of implementation including: (1) conversion to practical, inexpensive, objective production models; (2) ensuring sustainability via reimbursement, cost reductions etc.; (3) demonstrating to stakeholders of meeting clinical or research needs and adding value; (4) providing user education and support; (5) ensuring community and patient buy-in; (6) sandboxing CDS while it is evaluated in care environment; (7) ensuring scaling of CDS; (8) integration to clinical, research and R&D workflows as appropriate.

Best Practice 6.2.8.1.: When developing clinical-grade and mission-critical models ensure that the AI/ML models will stay within their knowledge boundaries by addressing: outliers, safe and unsafe decision regions, calibration and recalibration, incorporating patient preferences, managing data shifts and model performance shifts, population mixture changes, seasonality and trends, epidemic dynamics, various interventional externalities (e.g., changes in standards of care, new vaccines, new populations, new treatments).

Carefully consider how models can successfully generalize from the original data/populations used for model development and validation to other populations and settings and address pristine vs. noisy inputs, model input mapping and harmonization, missing input values and rebuilding models.

Best Practice 6.2.9.1.: When developing clinical-grade and mission-critical models ensure that ancillary and secondary objectives, benefits and work products are managed and preserved.

Best Practices 6.2.9.2. Documentation. Throughout the model development process complete and thorough documentation must be maintained. Key elements of this documentation include:

  1. 1.

    Model goals.

  2. 2.

    Risk assessments.

  3. 3.

    Key interactions and input from stakeholders.

  4. 4.

    AI/ML governance and oversight committee deliberations.

  5. 5.

    Software documentation.

  6. 6.

    Data design documentation.

  7. 7.

    Data documentation.

  8. 8.

    IP documentation.

  9. 9.

    Legal and compliance documentation.

  10. 10.

    User guides and training documentation.

  11. 11.

    Ancillary work products documentation.

  12. 12.

    Checklists and worksheets (e.g., ones provided in this book to keep track of following relevant best practices).

Classroom Assignments and Discussion Topics, Chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”

  1. 1.

    How does training of human clinical experts teach them to understand the limits of their expertise and how to refer problems outside their expertise to other experts? Which of these approaches could be incorporated in clinical grade AI/ML systems?

  2. 2.

    How does training of human scientist experts teach them to understand the limits of their expertise and how to refer problems outside their expertise to other experts? Which of these approaches could be incorporated in mission-critical AI/ML systems supporting health science discovery?

  3. 3.

    Conversely, some human experts (including physicians) are notoriously miscalibrated. How does AI/ML models can address human cognitive decision making errors?

  4. 4.

    You are a study section member reviewing a study proposal that reads like the following vignette:

    “The PI of the present proposal has devoted her career in exploring Boosting-based predictive models in healthcare. Our overarching goal is to create powerful models to predict readmissions. We will use our novel methodology of BoostedBoostBoosting (B3) on readmission data from the ED of Hospital X. Deploying such models in practice has the potential to reduce the readmissions for a range of disorders reducing thus the costs and increasing the quality of care.”

    Would you say this proposal is:

    1. (a)

      Biased in its choice of methods?

    2. (b)

      Well thought-out?

    3. (c)

      Compelling in its logic?

    4. (d)

      Likely to succeed in its stated goals?

    5. (e)

      Likely to yield useful results for clinical science?

    6. (f)

      Likely to yield useful results for patients?

    7. (g)

      Likely to yield useful results for AI/ML science (methods)?

    8. (h)

      Has hedged its bets on the new methods in case they are not as successful as hoped for?

    Use what you learnt in chapters “Foundations of Causal ML” and “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” to answer the above.

  5. 5.

    Can you improve the proposal vignette description to address any problems you identified in #4?

  6. 6.

    Consider a publication that describes a AI/ML regression model for risk of developing high blood pressure using a number of risk factors as inputs to the model. The overall explained variance (aka coefficient of variation) of the model is low, however one of the risk factors X has an odds ratio of 10 for high blood pressure. Is this model and risk factor X useful?

  7. 7.

    Consider an AI/ML risk model for developing Rheumatoid Arthritis incorporating GWAS data inputs (polymorphisms). None of the polymorphisms has an odds ratio with absolute value larger than 1.5 (i.e., the univariate effects are very small). Can this model be useful? How?

  8. 8.

    Consider the case of a very rare outcome Ox for patients with condition X. The probability of Ox in patients with X is 0.001. A new AI/ML model M has been developed recently that identifies patients with X who will develop Ox with probability 0.01 (i.e. a ten-fold increase in risk). At the same time the model has a false positive probability of 0.01. What are the challenges of incorporating this model in clinical practice?

  9. 9.

    Your university-affiliated hospital wishes to increase early diagnosis of cognitive decline across the population it serves. You are tasked to help choose between the following AI/ML technologies/tools for early detection of cognitive decline:

    1. (a)

      AI/ML tool A is 99% accurate.

    2. (b)

      AI/ML tool B has sensitivity 99% and specificity 99%.

    3. (c)

      AI/ML tool C has AUC 99%.

    4. (d)

      AI/Ml tool D has physician satisfaction 99%.

    5. (e)

      AI/ML tool E has 99% uptime.

    Do you have a preference for these tools? Which one will be most useful clinically? If you need additional information, what this might be?

  10. 10.

    An AI/ML classifier model has acceptable error rate in 2/3 of the patient population. This subpopulation is identifiable by the model and its properties. How would you address decisions in the 1/3 of the population for which the model is insufficiently accurate?

  11. 11.

    As explained in chapter “Foundations and Properties of AI/ML Systems” Bayesian Networks (BNs) have great flexibility in expressing flexible-input queries:

    1. (a)

      How can this be used to deal with incomplete evidence at decision making time?

    2. (b)

      For the same type of outcome, depending on the inputs used, the same model can have a wide range of errors. How would you characterize them as safe or not when there are so many of them?

    3. (c)

      How would you create safeguards for clinical-grade use of a BN model?

  12. 12.

    Some critics of Bayesian classifiers have criticized them on grounds that they may impose unrealistic data assumptions (for example the simple Bayes classifier), and that the error of output is multiplicative to errors of parameter specification (e.g., in BNs). How, in general, does the accuracy of model parameters relate to overall model error? Demonstrate cases where these criticisms are valid and cases where they are not.

  13. 13.

    Claim: it is always desirable to “bag” a variety of AI/ML models (i.e., build several models using repeated sampling or resampling and average them out). Similarly, any new model X should be a bagged classifier over all model families we can fit with our data.

    Are these statements true or false?

  14. 14.

    Claim: it is theoretically optimal to use Bayesian Model Averaging (BMA) over all possible models. Thus our new model X should be the BMA over the top 100 models (by predictivity or posterior probability given the data). True or false?

  15. 15.

    Friedman has proposed the following adage which he coined “the fundamental theorem of informatics”: “A computer decision model + human expert is better than either one of them” (paraphrased here for clarity).

    Strictly speaking, this statement is neither a mathematical theorem nor it is always true in the absence of well-defined premises (see chapter “From ‘Human versus. Machine’ to ‘Human and Machine’” for details).

    However, this statement implies useful guidance about:

    1. (a)

      The need to consider the success of models in a human system/context.

    2. (b)

      The need to integrate models in a workflow.

    Can you elaborate?

  16. 16.

    Identify 3 serious negative consequences that new knowledge discovery AI/ML model errors may have on each one of the following areas:

    1. (a)

      Scientific literature.

    2. (b)

      Drug discovery and Pharma.

    3. (c)

      Environmental policy.

    4. (d)

      Health crisis management.

    5. (e)

      Vaccination hesitancy and adoption.

  17. 17.

    Among the various stakeholders (patients, providers, health systems, payers, regulators) in development of AI/ML models, the same model errors may have asymmetrical costs (i.e., a large cost for stakeholder A may be small for B and medium for C and so on). The same is true for benefits.

    1. (a)

      Can you think of examples?

    2. (b)

      What principles and what methods do you think could/should be used to reconcile asymmetrical costs and asymmetrical benefits across multiple stakeholders?

  18. 18.

    AI/ML models with errors often translate to important opportunity costs (i.e., costs incurred by not taking a particular action such as administering useful medical interventions). Can you think of examples of opportunity costs stemming from model errors in the healthcare domain as well as in the research domain?

  19. 19.

    In engineering, machines, buildings, electronics, and other artifacts are expected to have well-defined parameters of safety and function. For example:

    1. (a)

      Bridges, have well specified weight loads, structural integrity in case of winds, earthquakes etc.

    2. (b)

      Cars have well-defined fuel efficiency, braking ability, acceleration, collision safety etc.

    3. (c)

      Electronics have well-defined power supply inputs, are guaranteed to not cause fires, and so on.

    4. (d)

      Maintenance schedules describe how long engineering artifacts can go without service, when they need replacement, what the service should be at specific intervals, etc.

    5. (e)

      Warrantees ensuring that in case of failure, the seller of the technology will incur replacement or repair costs.

    6. (f)

      In case of negligent construction of the products, the manufacturers are liable.

    7. (g)

      Similarly drugs come with labels that specific the intended usage, dosage, side effects, etc.

    In 2023 commercial and academic AI/ML models as well as AI/ML model-producing methods do not typically come with such well-defined and regulated guarantees. Consider the following questions:

    1. (a)

      Would the users of AI/ML models benefit if they would come with precise performance and safety guarantees?

    2. (b)

      Are there downsides to that?

    3. (c)

      What would it take to transition the current way AI/ML is marketed and used, into a better-regulated, and guarantees-driven technology?

    4. (d)

      If, hypothetically, at this time some of the AI/ML technologies and products cannot be provided with performance and safety guarantees, does this imply that they are pre-scientific? What might be the consequences of such a state of affairs?

  20. 20.

    As mentioned in chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML”, one of the most important and impressive AI effort was the INTERNIST-I expert system (by Miller, Myers and Pople Jr) which performed diagnosis across the full spectrum of internal medicine with accuracy meeting or exceeding those of attendant-level diagnosticians in very challenging diagnostic cases (NEJM’s clinico-pathologic challenge cases). Neither this system or its successor QMR (Miller et al) or offshoots such as DxPlain, (Barnet et al) managed to gain wide and long-lasting traction among working internists and health systems, however.

    1. (a)

      If one considers these systems as clinical-grade, then explain this failure in terms of relevant pitfalls discussed in the present chapter.

    2. (b)

      Provide an alternative evaluation of these weaknesses If one considers these efforts as exploratory, feasibility, or pre-clinical AI/ML.

  21. 21.

    A non-profit health system has operating margin of $10million in the last financial year and its Board of Directors decides to use it to improve health outcomes in the next financial year. The available options are as follows:

    1. (a)

      Intervention 1, is a AI/ML-based precision medicine test for breast cancer treatment with ICER=$50,000 per QUALY gained. The applicable population is forecasted to be 500 individuals in the next FY. Each test (ordered once per patient) costs $5000.

    2. (b)

      Intervention 2, is a new Covid-19 infection antiviral with ICER=$100,000 per QUALY gained. The applicable population is forecasted to be 300 individuals in the next FY (which will be the frame of decision making for this exercise). Each treatment regime (administered once per patient) costs $50,000.

    3. (c)

      Intervention 3, is an AI/ML intervention to increase compliance to high blood pressure medication with ICER=$10,000 per QUALY gained. The applicable population is forecasted to be 1000 individuals in the next FY. Each intervention (applied once per patient) costs $1000.

      Answer the following:

      1. 1.

        How many QUALYs would be gained by applying these interventions if unlimited funds were available?

      2. 2.

        Describe an optimal allocation of the available (limited) funds so that QUALYs gained are maximized.

      3. 3.

        If a new AI/ML technology would be employed that would lead to a new ICER of $25,000 for intervention 1, how would that affect the optimal allocation policy of funds and improvement in total QUALYs gained?

      4. 4.

        Comment on the interaction of technology improvement in the AI/ML of intervention 1 with the benefits from intervention 3 in the context of the global optimal funds allocation of (3).

      5. 5.

        Not all patients that could benefit from each intervention will receive them because of limited funds. What may be ethical and social justice principles to decide which patients receive a particular intervention and which do not?

      6. 6.

        The health system’s AI/ML unit proposes an R&D project that will improve the accuracy of the models yielding new ICER for improved intervention 1 of $20,000. The cost of the project is a one-time $1million to be covered by the $10million funds pool. The project can be executed immediately if funded.

        • Should this R&D proposal be approved?

        • What if the estimated risk for failure to obtain the targeted improvement is 50%?

        • What is the break-even point of the risk of failure for the proposed R&D?

      7. 7.

        Can you outline a possible process and give a concrete numerical example of how the AI/ML unit may have calculated the expected ICER for intervention 1 based on anticipated model predictivity improvements?