Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application

Aliferis, Constantin; Ma, Sisi; Wang, Jinhua; Simon, Gyorgy

doi:10.1007/978-3-031-39355-6_13

Constantin Aliferis³,
Sisi Ma³,
Jinhua Wang³ &
…
Gyorgy Simon³

Part of the book series: Health Informatics ((HI))

2850 Accesses

Abstract

This chapter covers essential practical methods for examining models, reviewing their face validity, and characterizing and managing risk of errors of such models at development and at deployment stages. This chapter also briefly discusses broader methods and best practices for detecting and correcting issues with ML modeling and the emerging concept of debugging ML models and analyses. A “toolkit” for application safety measures is presented.

You have full access to this open access chapter, Download chapter PDF

Quality Improvement Science in the PICU

Introduction

Data Driven Patient Safety and Clinical Information Technology

Keywords

Essential Model Diagnostics and Model Characterization

Recall from chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems ” that well-engineered AI/ML methods, have well-characterized properties (theoretical and empirical) across many relevant dimensions that ensure that produced models have appropriate: Representation power; Transparency and Explainability; Soundness; Completeness; Tractable computational complexity of learning models; Tractable computational complexity of using models; Tractable space complexity of learning models; Tractable space complexity of storing and using models; Realistic sample complexity, learning curves, and power-sample requirements; Probability and decision theoretic consistency; Strong comparative and absolute empirical performance in simulation studies; Transparency and explainability; and Strong comparative and absolute performance in real data with hard and soft gold standard known answers.

These properties (and especially the empirical ones) however, must be further studied on a more granular level once specific models are constructed following the best practices described in chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance Sensistive AI/ML Models”. For example, whereas we know that SVM methods are particularly well suited theoretically and empirically to constructing omics classifiers, or DL methods for image recognition, and so on (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”), the specific level of performance and risk of error for a particular model created by a specific dataset for a specific problem solving context requires additional analysis tailored to the particulars of that application.

Therefore, as we transition from method development/characterization to model fit/characterization and we further consider the stages and components of a particular model’s lifecycle we move from general properties and lifecycle stages to very concrete understanding of precisely how well this particular model will perform for the problem solving context in hand.

We clarify that in the present chapter we almost exclusively deal with risks due to prediction and other model output errors, giving more operational post-hoc analysis details for the process described in chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance Sensitive AI/ML Models”. We do not deal with regulatory, ethical, reproducibility etc, risks which are discussed in chapters “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”, and “Reporting Standards, Certification/Accreditation, and Reproducibility”.

An important first diagnostic for predictive models is testing whether the model is statistically significantly different than the null model (i.e., one that does not have any predictive signal). This is typically conducted with a label reshuffling test (LRT, see chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”). The LRT will also inform whether the whole modeling protocol has any propensity to produce unduly optimistic performance generalization estimates under the null hypothesis (i.e., the data not having any predictive signal for the response).

If additional validation datasets are available (other than the ones used in the primary error estimation procedures) they can inform about whether the model and its associated generalization error/performance estimates indeed generalize well in new datasets sampled from the same population. For protocols that have passed the LRT for propensity to produce biased generalization error estimates, observing “shrinkage” of the performance from the original estimates to the new ones may be due to the following factors: (a) normal sampling variation; (b) differences of the new validation data from the discovery data because they originate from a slightly (or radically) different population. The latter possibility warrants further investigation, if observed.

One of the most important diagnostics for new models is their calibration. Chapter “Evaluation” provides a thorough technical description. To summarize the key concepts, calibration refers to how close the predictions of a model are to the true values. A perfectly-calibrated model is not necessarily a perfectly accurate model since the model may output predictions with a wide confidence interval. If for example, a model outputs that the probability of outcome T taking value 1 for input instance i, is .8 and it is perfectly calibrated, then in 80% of identical cases in the future T = 1 and the model will be correct, whereas in 20% of cases it will be wrong. In applications where it is not possible to achieve very accurate predictions it is still essential that a high degree of calibration is achieved. Recalibration refers to the procedure where miscalibrated models’ outputs are adjusted so that they are better calibrated without the need to rebuild the models. The binning method is a very simple but very useful method to recalibrate models. The analyst first estimates the model’s calibration in ranges (“bins”) of probability outputs and then maps the original predictions to the true (calibrated) probabilities. This same technique can be used for converting a non-probabilistic output to a calibrated probability output (Fig. 1).

2 histograms, a table and 5 steps for the binning method. Train S V M classifier in the training set is followed by apply to the validation set, create histogram with Q bins, place the new sample from the testing set in the corresponding bin, and compute probability as a fraction of true positives. — **Fig. 1**

Probability conversion of non-probabilistic outputs can be accomplished by other methods, for example using a mapping function such as a sigmoid filter (Fig. 2).

A line graph, a table and 4 steps for the sigmoid filter. Train S V M classifier in the training set is followed by apply to the validation set, determine parameters of sigmoid function by minimizing negative log from validation set, and compute posterior probability of new sample from testing set. — **Fig. 2**

Analysts have a wealth of calibration metrics to use. These are designed to align with the data design and loss functions used in the project. For example, calibration metrics have been developed that are appropriate for case-control binary classification, n-ary classification, regression, survival and time to event models, time series models, etc., see chapter “Evaluation” for details.

Models’ reliable and unreliable decision regions. It is also possible to invert the logic of the calibration analysis and seek the regions in the model’s output space where acceptable or unacceptable prediction errors are observed. This approach establishes the model output regions where the model’s predictions are trustworthy (i.e. low-error) and regions where they are not. It is advisable in such an analysis to calculate the FPR, FNR, TPR, TNR or other loss functions and evaluation metrics of interest so that the model can be safely deployed (see chapter “Evaluation”).

The above model characterizations remove limitations that are analogous to human cognitive limitations e,g, the famous Dunning-Gruger effect [2] where human decision makers believe that their performance compared to others is higher than what it is and this bias is stronger in decision makers of low ability. Establishing calibration, confidence intervals and credible intervals, and reliable decision regions, ML models can avoid these biases altogether and be equipped with the functional equivalent of self-awareness of their limitations that promotes safe model application.

Another useful post-hoc analysis is that of stability which measures the degree by which the structure of a model or the values of its parameters change as a function of sampling variation. For practical reasons, stability analyses are typically conducted by generating a large number of datasets re-sampled from the original dataset to simulate a sampling distribution. Then the modeling is conducted for each dataset and metrics on the model structures and parameter’s stability are calculated. In common practice, highly unstable structures or parameter values are treated with caution since they may be the result of variation due to small training sample size. We caution however that instability in modeling may be also caused by structural properties of the distribution and/or the learning algorithm operating characteristics and is not necessary proof that a model is unreliable or not generalizable [3]. The existence of equivalence classes in particular, may lead to highly unstable features and models fitted from them by randomized algorithms, however if the unstable features are due to being members of an equivalence class they can still generalize well predictively. See chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” for a detailed discussion of equivalence classes and the importance to model them.

Equivalence classes are also very important for causal modeling since the causal discovery algorithms by necessity (imposed by learning theory) can learn the data generating function within the equivalence class. Whereas some algorithms (e.g., PC, or GES can score or learn representations of certain types of equivalence classes (e.g. due to latent variables or Markov equivalent structures), other algorithms (e.g., MMHC) learn a single member of the class and then the analyst has to generate the equivalence class of that member. See chapter “Foundations of Causal ML” for more details about how these algorithms operate and what classes they output.

Similarly equivalence classes for feature selection can also be critically important since they can be used to investigate all possible sets of important optimal predictive model inputs for insights into the process that generates the data and for choosing the model with inputs that are most convenient, accessible and easier to deploy. Currently very few algorithms exist for inferring feature sets equivalence classes (see chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”, “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”) and among them at this time only TIE* has a well-developed theory, wide applicability and reliable performance [4].

Establishing credible intervals (CrIs) of models’ outputs is also an important model characteristic and diagnostic. The CrI is different than the common statistical Confidence Interval (CI) the latter measuring (e.g., the 95% or other) range of values for a model’s parameters, accuracy or properties when models are built from a number of samples from the population. The 95% (or other width) CrI is the range that contains with probability 95% the true value of the model’s parameter or predicted response. In a Bayesian framework it corresponds to a region of the posterior distribution for an estimated predicted response value or parameter value [5, 6].

In all cases especially after a model has been generated in, or converted to, a human readable form (using the many techniques and best practices discussed in chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”) face validity tests are very useful. Such tests are conducted by domain experts (and in some cases augmented by automated literature extraction and synthesis) who by using domain knowledge will seek to identify implausible patterns and relationships, or apparently dubious decision logic as possible errors in the model’s construction.

If, as a trivial example, a model suggests that an inheritable genetic factor is caused by a lifestyle behavior that obviously denotes a reversal in causal order that is highly suspect. If, in a more sophisticated example, terminal outcomes (e.g., death) appear to have spouse variables (i.e., direct causes or direct effects of the outcome) this violates the data design and measurement constraints and has to be explained by the existence of unmeasured confounders (which is the most likely explanation in this example), by data errors, or by modeling errors.

Best Practice 13.1

Measure calibration and recalibrate as needed.

Best Practice 13.2

Convert scores to probabilities.

Best Practice 13.3

Test models for difference from null model.

Best Practice 13.4

Identify models’ reliable and unreliable decision regions.

Best Practice 13.5

Measure stability and flag unstable models.

Best Practice 13.6

Extract and report model equivalence classes.

Debugging ML

In conventional programming, several techniques and tools exist for debugging programs to ensure that their behavior is the one intended by the programmer. “Bugs” are errors in coding that divert the original intension of the programmer to an unwanted or unanticipated program behavior. A variety of tools, techniques, and resources have been invented to help with conventional debugging. Such methods include interactive debugging, control flow analysis, unit testing, log file analysis, memory dumps, profiling and other techniques [7,8,9].

Debugging ML model building however involves many additional complexities and represents a higher order of difficulty for debugging, because of the following reasons:

1.
ML programs do not implement functions but functionals. An ordinary programming function maps a set of inputs (the domain set) to a set of outputs (the codomain set). A ML functional is a function that takes a set of inputs (training and validation datasets) and maps them to a set of functions (i.e. decision models, which themselves are functions that take as inputs problem domain instances and output instance-specific decisions).
2.
Whereas conventional programming admits a single or small number of correct solutions (e.g., a ranked list of numbers, all paths from point A to point B in a map, etc.) ML programming admits an infinity of acceptable or even optimal outputs (i.e., in predictive ML, any member of the whole equivalence class of models that exhibit optimal generalization performance) or the even larger set that exhibits near-optimal performance.
3.
ML algorithms are inherently stochastic in terms of the inputs and may also involve stochastic operations on them. They have to accommodate infinite possible inputs and to be robust to noisy inputs.
4.
ML algorithms's properties interact with the data design so that the quality of the output is not strictly a function of the data input but also a function of the alignment of the algorithm with the data generation and measurement design choices made by the analyst or user of the algortihm. The ML algorithms however seldom have built-in representations of the data design properties and how it affects their operation (which makes detecting related problems hard).

The process of debugging ML modeling is currently decidedly more of an art than a science. However this art is strongly informed by well-established scientific principles and properties from ML. The following are recommended approaches to tackle ML debugging to achieve model development error prevention and detection. They should be treated as starting point within a much larger and variable space of possibilities.

Best Practices 13.7. ML Debugging Strategies

1.
Start from conventional implementation debugging of ML algorithms: e.g. trace step by step algorithms in simple but representative small scale problems; isolate and unit-test subroutines in data intake, model fitting and output. Same for conventional debugging of AI/ML model implementation.
2.
Debug real data, e.g., :
1. (a)
  Is the data conformant to expected format?
2. (b)
  Is the data distributed according to distributional assumptions that underlie proper use of the algorithms/models used?
3. (c)
  Is the data reflecting the sampling or data generation protocol? For example, for data supposed to be iid, is it? For data from randomized experiments can we predict the exposure? (if yes, then it was not properly randomized).
4. (d)
  Are there outliers or other data abnormalities that violate ML algorithm data requirements?
3.
Debug simulations and resimulations:
1. (a)
  When artificial and semi-artificial data are used to test algorithm performance and the algorithm implementation does not behave according to theoretical expectations, test whether the simulated data conforms to the specification of the simulation.
2. (b)
  If an algorithm or protocol is randomized, save any suspect random instantiations for debugging (because in subsequent runs the bug may not appear).
4.
Know well the behavior of algorithms so that strange behaviors (for better or worse) are immediately apparent in complex analyses. For example:
1. (a)
  If an algorithm is deterministic but outputs different results at each run on the same data this indicates a bug; conversely a randomized algorithm that outputs the same exact results indicates a bug.
2. (b)
  If the algorithm is expected to have boundary behaviors (for example terminate upon meeting certain conditions) but does not, this indicates a bug;
3. (c)
  If the algorithm is expected to converge monotonically toward a performance metric but it converges non-monotonically, this also indicates a bug, and so on.
4. (d)
  If an algorithm is expected to converge and does not, investigate if it is normal non-convergence or systematic.
5. (e)
  Investigate the root causes of happy accidents and surprises (see next section).
5.
Build and use a set of benchmark datasets where the behavior of algorithms is known and new algorithms or new implementations can be readily compared.
6.
Compare the implementation or instantiation and tuning of a ML algorithm or protocol to the same data as published in reference-level prior literature.
7.
Examine the interactions of algorithms with embedding protocols and systems. If the same algorithm implementation behaves differently inside different implementations of the same protocol this indicates that the protocol implementation or interface with the ML algorithm is buggy.

Recognizing and Accommodating “Happy Accidents” & Surprises

In the practice of ML, we often encounter “happy accidents” and pleasant surprises. For example, when developing Markov Boundary algorithms (chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”) the developers were surprised that despite that the GLL algorithms were performing up to 100,000s of conditional independence tests per dataset without conventionally correcting for multiple comparisons, the resulting Markov Boundary sets did not contain significant numbers of false positives. Initially the developing team considered this as a likely bug that should be corrected with false positive rate control measures, however close inspection of the issue revealed that the variable elimination steps in the algorithms were eliminating the false positives. For example, when fed with 1000 random variables and no signal-carrying variables, and with conditional independence tests (CITs) set at minimal alpha of 5%, which would imply 50 false positives, the algorithm would output just 4 false positives [10, 11]. This gave valuable insights in the self-regularizing behavior of this class of algorithms and confirmed their robustness.

In another example, the same authors show that so-called epistatic functions, that is, extremely non-linear and discontinuous functions where only a specific subset of the inputs would reveal signal and all other lower order subsets were devoid of signal, can still be detected by linear learners and the classifier performance can grow arbitrarily close to perfect when the inputs are unbalanced (by data design)) or correlated (naturally). In addition, when the density of positive target is non uniform in the space of inputs, then arbitrarily strong signal can exist that is detectable by linear learners. For example, in the textbook XOR function which is a prime example of this class of functions, and T = XOR (A,B), neither A or B have univariate signal for T and must be considered together for the signal to become fully discoverable. This is a huge problem in high dimensional settings where the combinatorics become quickly intractable. This textbook version of the problem however is very unlikely to exist in practice because for signal to disappear in lower-order effects an unlikely arrangement of data has to exist [11].

In the benchmark of pathway reverse engineering study of [12] it was shown that basic correlation networks would perform well as long as the loss functions were tailored to their inductive bias. Specifically, despite that these techniques are not having causal discovery guarantees and can be shown to output massive numbers of false positives in many situations, in very low sample situations they may perform better in terms of sensitivity (trading off specificity) than causal algorithms simply because there is not enough sample to generate reliable results and proper causal algorithms are designed to avoid producing false positives.

In the domain of cancer genomics, random selection of biomarkers tends to give informative markers and strong signatures [13]. These are truly generalizable and robust and their existence is due to the wide propagation of cancer signal throughout the data generating transcriptomic network.

In a final example, while causal ML algorithms are designed to operate in faithful distributions (which do not have information equivalences), the designers of the Causal feature selection challenge [14] were surprised to discover that their resimulated data built using such algorithms exhibited information equivalencies (mostly as a result of statistical indistinguishability due to finite sample size). This increases the veracity of the resimulated data.

These examples show a common phenomenon in ML, i.e., that empirical results often perform better than expected due to mitigation factors. The modelers should investigate thoroughly any unexpected behaviors to find any errors, but should also keep an open mind about the possible validity of results due to error mitigating factors.

A Toolkit for Ensuring Safe Model Application

Best Practices 13.8. for Safe Model Deployment

1.
Outlier detection. When encountering an application instance, determine whether it is an outlier with respect to the distribution where the model was validated and flag it as such [15, 16]. Refrain from making a prediction or decision for outliers.
2.
Region of reliable operation. When encountering an application instance, determine whether it falls inside or outside the model’s region of reliable operation (section “Essential Model Diagnostics and Model characterization”). Refrain from making a prediction or decision for cases outside the reliable region.
3.
Detect and address distribution shifts. As application instances accumulate, determine if their distribution is different than the one used to validate the model (chapters “Data Design” and “Data Preparation, Transforms, Quality, and Management”).
1. (a)
  If yes, then alert the deployment and development teams for possible need to rebuilt the model because of distribution shifts.
2. (b)
  When distribution shifts are observed, determine if they affect the model performance. If they do not, continue monitoring the shifts but do not withhold the model’s decisions.
3. (c)
  Characterize distribution shifts by seasonal trends, individual variables affected, emerging population mixture changes, etc., as appropriate for the application domain.
4.
When making a prediction or decision, also output the credible interval for that input region (chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI”) as well as other loss function estimates applicable to the region.
5.
When explaining a model or a specific model’s output also report the credible interval and stability of its structure and parameters. Flag unstable and uncertain model characteristics.
6.
Apply continuous statistical process quality control metrics as predictions and decisions are prospectively validated [16]. If predictions and actions are statistically significantly different than expected, then alert deployment and development teams for possible need to rectify or rebuilt the model.
7.
Make the above functions parametric so that model operators can adapt the model deployment better to local application conditions (e.g., health care provider and patient preferences, organizational policies, evolving regulations etc.).
8.
When more than one model are available, apply the model that has best performance and safety profile for each application case [17].
9.
When transferring a model developed from population P1 to population P2 analyze with existing historical data performance and safety before deployment. If operating characteristics are not satisfactory, then consider rebuilding models from P2 data.
10.
If some inputs are expected to be missing and decisions with partial input are desired, consider using flexible input decision models, or dynamic imputation schemes at the design, fitting and validation stages (chapter “Foundations and Properties of AI/ML Systems”). Do not apply models with partial or imputed inputs unless this is part of the models’ design and validation.
11.
If the quality of model data inputs changes from the development and validation data to the application phase, apply appropriate detection mechanisms, flag such inputs and refrain from making predictions or other decisions (chapter “Data Preparation, Transforms, Quality, and Management”).
12.
Develop and deploy ancillary alerting DSS (geared to the model users and developers) that are designed to flag deviations from the conditions that guarantee safe and effective model performance although it may not be directly detectable in data. For example, for a COVID management model, deploy alerts related to new vaccines, population immunity, new variants, and other factors that may affect the model’s validity but may not be detectable from the patient-level data before they create serious degradation in model performance.

Conclusions

Taken together the above practices are designed to establish several synergistic safety layers protecting models from falling off their “knowledge cliff”. The listed safeguards comprise the functional equivalent of AI/ML “self-knowledge” of its limitations in order to avoid making hazardous decisions.

Key Concepts Discussed in This Chapter

Calibration

Recalibration and conversion of scores to probabilities

Reliable and unreliable model decision regions

Credible Intervals

Debugging ML: how is different than conventional code debugging; general strategies for ML debugging.

Model failure mitigation factors

Model deployment safeguards toolkit

Pitfalls Related to the Present Chapter

Pitfall 13.1. Models that are uncalibrated or with unknown calibration

Pitfall 13.2. Models with unknown correspondence of output to probabilities.

Pitfall 13.3. Not checking if model statistically significantly better than the null model.

Pitfall 13.4. Unknown model reliable and unreliable decision regions.

Pitfall 13.5.: Unknown stability and consequences for model safety and performance.

Pitfall 13.6. Being oblivious to model’sequivalence class

Pitfall 13.7. ML with bugs.

Pitfall 13.8. Falling over a model’s “knowledge cliff” (i.e., succumbing to model deployment safety traps):

1.
Outliers.
2.
Falling outside model’s region of reliable operation.
3.
Distribution shifts.
4.
Model decisions carry no information about its expected errors specific to that case.
5.
Failing to flag, report and explain unstable and uncertain model characteristics.
6.
Failing to detect and alert deployment and development teams for possible need to rectify or rebuilt the model.
7.
Rigid specifications of safety functions, lack of adaptability to local application conditions.
8.
Fail to exploit a plurality of tailored models to address the application cases.
9.
Fail to safely transfer models developed from population P1 to population P2.
10.
Fail to address missing inputs.
11.
Failing to detect and manage drops in the quality of model data inputs.
12.
Invisible deviations from the conditions that guarantee safe and effective model performance.

Summary of Best Practices Discussed in This Chapter

Best Practice 13.1. Measure calibration and recalibrate as needed

Best Practice 13.2. Convert scores to probabilities.

Best Practice 13.3. Test models for difference from null model.

Best Practice 13.4. Identify models’ reliable and unreliable decision regions.

Best Practice 13.5. Measure stability and flag unstable models.

Best Practice 13.6. Extract and report model equivalence classes

Best Practice 13.7. Apply strategies for ML debugging:

1.
Start from conventional debugging of ML algorithm implementation.
2.
Debug with real data.
3.
Debug simulations and resimulations.
4.
Know well the behavior of algorithms so that strange behaviors (for better or worse) are immediately apparent in complex analyses. Investigate unusual behaviors (with respect to each algorithm’s expected behavior).
5.
Build and use a set of benchmark datasets.
6.
Compare the implementation or instantiation and tuning of a ML algorithm to the literature.
7.
Examine the interactions of algorithms with embedding protocols and systems.

Best Practice 13.8. Use safe model deployment toolkit:

1.
Detect and manage outliers.
2.
Detect and manage falling outside model’s region of reliable operation.
3.
Detect and manage distribution shifts.
4.
Report Credible Interval and other loss function estimates applicable to the input region for every model decision.
5.
Flag, report and explain unstable and uncertain model characteristics.
6.
Apply continuous QC metrics as predictions and decisions are prospectively validated. Alert deployment and development teams for possible need to rectify or rebuilt the model.
7.
Make the above safety functions parametric so that model operators can adapt the model deployment better to local application conditions.
8.
When more than one model is available, choose the model that has best performance and safety profile for each application case.
9.
Safely transfer models developed from population P1 to population P2, or rebuild models from P2 data.
10.
Address missing inputs at the design, fitting and validation stages. Do not apply models with partial or imputed inputs unless this is part of the models’ design and validation.
11.
Manage drops in the quality of model data inputs from the development and validation data to the application phase.
12.
Develop and deploy ancillary alerting DSS (geared to the model users and developers) that are designed to flag deviations from the conditions that guarantee safe and effective model performance.

Classroom Assignments & Discussion Topics in This Chapter

1.
Which factors can you list that cause distribution shifts? Which ones may jeopardize decision models?
2.
Give: (a) An example where highly unstable biomarker selection does not degrade model predictivity. (b) An example where highly stable biomarkers are not useful.

HINT for (a): consider that markers BM1 and BM2 have exactly the same information for the response and the biomarker discovery procedure chooses them at random. Now increase the number of markers with such equivalent information.
3.
Show how model instability may relate to sampling variation and to increased errors via the BVDE.
4.
When conversion of scores to probabilities is desirable? When can it be superfluous? When detrimental?
5.
(a) Describe how we can create models with high performance and accuracy by carving out input space regions of low performance/accuracy.

(b) What are necessary preconditions for this to be successful?

(c) What is the fundamental tradeoff involved?
6.
If missing inputs are anticipated at model deployment, how would you choose among alternative options based on the tractability of running models? Compare for example BNs and KNN in this context.
7.
[Advanced] Discuss the application of transductive learning methods to address successful model transference? What may be downsides of this approach?
8.
Consider a situation where the data describing a patient population with disease D in region/health system H1 are radically different than those of region/health system H2.
1. (a)
  What may cause these differences?
2. (b)
  How would you go about building effective models for H1 and H2?
9.
[Advanced] Is every degradation of model input data quality affecting the model’s quality of outputs? How would you systematically incorporate this consideration in the design of robust models?

References

Statnikov A, Aliferis CF, Hardin DP, Guyon. A gentle introduction to support vector machines in biomedicine: theory and methods, vol. 1. World scientific; 2011.
Google Scholar
Kruger J, Dunning D. Unskilled and unaware of it: how difficulties in recognizing one's own incompetence lead to inflated self-assessments. J Pers Soc Psychol. 1999;77(6):1121.
Article CAS Google Scholar
Ben-David, S., Von Luxburg, U. and Pál, D., 2006. A sober look at clustering stability. In Learning theory: 19th annual conference on learning theory, COLT 2006, Pittsburgh, PA, USA, June 22–25, 2006. Proceedings 19. Springer Berlin Heidelberg, pp. 5–19.
Google Scholar
Statnikov A, Lemeir J, Aliferis CF. Algorithms for discovery of multiple Markov boundaries. J Mach Learning Res. 2013;14(1):499–566.
Google Scholar
Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis, vol. 608. New York: Springer; 2001.
Book Google Scholar
Steyerberg EW. Applications of prediction models. New York: Springer; 2009. p. 11–31.
Book Google Scholar
Hailpern B, Santhanam P. Software debugging, testing, and verification. IBM Syst J. 2002;41(1):4–12.
Article Google Scholar
Zhu H, Hall PA, May JH. Software unit test coverage and adequacy. ACM Comput Surv (CSUR). 1997;29(4):366–427.
Article Google Scholar
https://en.wikipedia.org/wiki/Debugging#Techniques
Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: algorithms and empirical evaluation. J Mach Learn Res. 2010;11(1)
Google Scholar
Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local causal and Markov blanket induction for causal discovery and feature selection for classification part II: analysis and extensions. J Mach Learn Res. 2010;11(1)
Google Scholar
Narendra V, Lytkin NI, Aliferis CF, Statnikov A. A comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks. Genomics. 2011;97(1):7–18.
Article CAS Google Scholar
Venet D, Dumont JE, Detours V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol. 2011;7(10):e1002240.
Article CAS PubMed Central Google Scholar
Guyon I, Aliferis C, Cooper G, Elisseeff A, Pellet JP, Spirtes P, Statnikov A. Design and analysis of the causation and prediction challenge. In Causation and prediction challenge. PMLR, pp. 1–33; 2008.
Google Scholar
Hodge V, Austin J. A survey of outlier detection methodologies. Artif Intell Rev. 2004;22:85–126.
Article Google Scholar
Oakland JS. Statistical process control. Routledge; 2007.
Book Google Scholar
Statnikov A, Aliferis CF, Hardin DP, Guyon I. Gentle introduction to support vector machines in biomedicine, a-volume 2: case studies and benchmarks. World Scientific Publishing Company; 2013.
Book Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Constantin Aliferis, Sisi Ma, Jinhua Wang & Gyorgy Simon

Authors

Constantin Aliferis
View author publications
You can also search for this author in PubMed Google Scholar
Sisi Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jinhua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gyorgy Simon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Constantin Aliferis .

Editor information

Editors and Affiliations

Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Gyorgy J. Simon
Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
Constantin Aliferis

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aliferis, C., Ma, S., Wang, J., Simon, G. (2024). Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application. In: Simon, G.J., Aliferis, C. (eds) Artificial Intelligence and Machine Learning in Health Care and Medical Sciences. Health Informatics. Springer, Cham. https://doi.org/10.1007/978-3-031-39355-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-39355-6_13
Published: 05 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39354-9
Online ISBN: 978-3-031-39355-6
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics

Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application

Abstract