Keywords

Significant Advances in Health AI/ML are the Result of Non-monotonic Progress with Many Failures Followed by Successes. Learning from Case Studies

The AI/ML toolkit available currently to health data scientists is nothing sort of extraordinary. The algorithms, systems and theory of today have capabilities than a few decades ago seemed beyond the realm of the possible. Examples include: the ability to operate predictively with miniscule sample sizes and dimensionalities that exceed the 106 variables range; the routine capability to classify and extract meaning from text and other unstructured data; the ability to discover causation reliably without experiments; the ability to guide experiments so that the number and cost of experiments is minimized; the ability to explore quadrillions of non-linear variable interactions in seconds on simple personal computers by using kernel methods; automatic protection against overfitting by deploying regularized/shrinkage methods; powerful auto-modeler systems with performance matching and exceeding those of experts; powerful image recognition. We also have autonomous vehicles, industrial robotics, embedded decision support and control systems, natural language understanding and translation, cyber security enabling systems, sophisticated knowledge representation and data models that support complex data harmonization and reuse at scale and speed. Many more success stories and capabilities are discussed throughout this volume (see indicatively chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need forBest Practices Enabling Trust in AI and ML” and in the cited literature.

These achievements of AI/ML did not happen smoothly, however. In many cases they involved setbacks, pursuing dead ends, and learning from painful mistakes. To this day, there are many systematic suboptimal AI/ML practices that incur substantial costs for health care and the health sciences.

The purposes of this chapter are to discuss a sample of prominent and paradigmatic case studies where failures provided incentive and inspiration for the field to advance and new and improved technologies to emerge. Also these showcase the value of best practices (BPs) advocated in the present volume. Finally, these case studies point to areas where future major improvements are likely to occur if a more rigorous and BP-driven health AI/ML is pursued.

The Gartner Hype Cycle

The concept of a “Hype Cycle” is commonly attributed to the Gartner management consulting company [1]. Figure 1 describes a common historical pattern where the emergence of new technology is followed by a surge in expectations strongly overestimating true capabilities and potential (Exuberant optimism aka “Peak of inflated expectations”) that is followed by a dramatic drop of expectations that strongly underestimate the true capabilities and potential of the technology (“Trough of disillusionment”).

Fig. 1
A line graph of the expectations versus time. The line follows a steep trend, descends, and increases. The path consists of exuberant optimism, disillusionment, enlightenment, and a plateau of real value.

Hype technology cycle

These expectations may be of scientists, technologists industry, investors, customers, funding agencies, the general public, etc. across the health care and health science ecosystems. As both technology and sentiment mature, a period of increasing understanding of the technology and is evolving capabilities follows (“Slope of Enlightment”) and then converges to an accurate and lasting appraisal of the technology’s merits (“Plateau of real value”).

“AI Winters”

The term “winter” with regards to a field of science or technology refers to a prolonged period of reduced public expectation, research support and industrial market growth. Essentially its is a prolonged Trough of disillusionment to use Hype Cycle terminology. The opposite is referred by “spring”, i.e., prolonged periods where expectations, support and financial growth are high [2,3,4].

The field of AI in the middle of 20th century generated vast enthusiasm and funding support. The early pioneers made strides in establishing many foundational results and methods in the field. The general sentiment was that in short period of time “hard AI” would be feasible—i.e., intelligent systems that would possess general intelligence capabilities on par or better than humans. However a number of setbacks drained this enthusiasm and drastically undercut AI’s growth.

The two main AI winters in the USA occurred between 1974–80 and 1987–93 with parallel and overlapping AI winters internationally. These reflected real and perceived failures and disappointment in: scaling up from toy-sized problems to real-life complexity, spoken language understanding, artificial neural network limitations, the collapse of the LISP machine industry, limitations of logic-based systems for broad inference, the limited success of expert systems, and the failure of Japan’s Fifth Generation project to meet its goals.

The main causes of the above failures (or perceived failures) can be traced to the excessive hype (expectation exuberance) relative to what was known, established, and proven in terms of theoretical capabilities and empirical performance, effectiveness and other properties that the present volume has described at length. The field, in hindsight, mismanaged expectations and failed to impose the necessary self-discipline for proceeding from the space of unknown properties and capabilities to the space of guaranteed performance in a systematic and science-driven manner. Figure 2 depicts the interplay among these factors. Notice the immense distances between actual capabilities and promises (D2, corresponding to actual “Hyped) and the unknown distances between unknown properties and promises made (D1, D3, depicting potential hype).

Fig. 2
An illustration of 4 blocks. A, Poorly understood properties of A I methods and science. B. Unknown properties of future A I methods and science. C. Promised results in the near future. D. Actual capability of A I technology. D 1 connects blocks A and B. D 2 connects blocks C and D. D 3 connects B and C.

Factors summarizing causes of historical AI winters. In the upper left, poor understanding of core properties & capabilities of technology (at the time). Upper right quadrant: poor understanding of limitations of technology, in the future. Lower left quadrant: limited current capabilities (at the time). Lower right quadrant: Exorbitant and unbounded promises for capabilities in the near future

The following components of the AI winters were or continue to be particularly relevant to the aims of health AI/ML:

Perceptron Limitations

A notable setback was the discovery that perceptrons could not learn non-linear data generating functions [5]. A perceptron is essentially a single node in neural network (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”), which linearly combines its inputs and outputs them through a non-linear function. The above discovery pertains to the fact that a single layer of perceptrons (i.e. a single-layer neural network) cannot learn non-linear functions. Because the space of non-linear functions is vast and many real-life problems exist in it, this was a major blow to the credibility of the nascent AI technology and industry sector.

Eventually these fears proved to be unfounded: on one hand, with the discovery of the Back Propagation algorithm by Werbos [6] and the related work by Rummelrhart, McClelland, and the PDP (parallel distributed programming) group (comprising, among others, such luminaries as Nobel Prize winner Francis Crick and future Turing award winner Geoffrey Hinton) and others, new improved science and technology of artificial neural networks (ANNs) using multi-layered networks trained with the Back Propagation algorithm [7] gave a theoretically powerful and practical way to model non-linear domains.

Moreover, from a practical perspective, as discussed in chapter “Foundations and Properties of AI/ML Systems”, linear models can still be very effective since in low sample situations, due to Bias-Variance Decomposition of the Error (BVDE), high bias (e.g., linear) learners will often lead to better generalization error even if the large sample data generating function is not linear. In addition, whereas a non-linear discriminative function is not perfectly learnable with a linear model, if some common mitigating factors are present (i.e., unbalanced priors of the inputs or correlated inputs [8]) then a linear discriminant can discover the existence of signal and can achieve accuracy up to within a vanishingly small factor. It is not an accident that the most used (and arguably successful) modeling method in the health sciences are linear statistical models since on average (as opposed to worst case) in small sample designs such models are very useful.

Back Propagation-Based ANNs and the Vanishing (Or Exploding) Gradient Problem

Back propagation is a very effective ANN training algorithm for ANNs with a small number of hidden (intermediate) layers. However, when it comes to building ANNs with many hidden layers (which are very powerful in some domains) it suffers from a critical weakness, the vanishing or exploding gradient problem. In backpropagation, the model is trained by making empirical adjustments (based on derivatives of empirical error of model output) to weights of ANN units of successive layers. The exploding/vanishing gradients problem refers to the situation when these errors grow very fast to the number of layers and very quickly lead to parameterizing ANN weights with essentially useless noise.

A New Improved Technology of ANNs Followed

Deep ANNs aka “Deep Learning (DL)” (see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”) overcame this problem by a simple (but very effective) alteration in the mathematical form of the ANN units. Some of the key innovations that allowed Deep Learning ANNs to differentiate from standard multi-layered ANNs include: RELUs (Rectified Linear Units) as transfer functions (they allow for non-linear learning and do not suffer from derivatives vanishing/explosion); convolution layers and filters; pooling/downsampling (to reduce dimensionality and overfitting); and multiple iterations of Convolution-Pooling-RELUs (for hierarchical features extraction) [9, 10]).

Roughly at the same time, another technological and scientific breakthrough occurred when a new class of ML algorithms, Support Vector Machines (SVMs) emerged based on the work of Vapnik, and his collaborators Cortez, Boser and Guyon [11,12,13]. This class (see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”) was based on a uniquely powerful combination of regularization, quadratic optimization formulation of the learning problem (which guarantees optimal search as opposed to ANNs and many other ML algorithms that conduct incomplete and error-prone parameter space search), dealing with non-linearities with immense efficiency using kernel functions (discovered in mathematics 40 years before, but not used before in applied ML) and dealing with model selection using structured risk minimization.

Another very powerful classifier that is widely adopted in the health sciences is Random Forests that combines ideas from decision trees, bagging, off-sample error estimation, random feature selection and capacity control [14].

Together these three predictive modeling technologies are highly adopted and very complementary. DL has dominant image recognition and generative function empirical performance due to powerful automatic construction of higher-order features. SVMs have dominant performance and efficiency (sample and time complexity) in a large number of predictive health tasks (e.g., clinical and omics data modeling) [1315,16,17]. Random Forests and Boosting are also very powerful across a range of healthcare and health science domains. Several more powerful ML algorithms have been developed and are in wide use (see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”).

Current Limitations (and Opportunities for Improvement) of DL

As powerful and useful as DL methods are, they still have significant limitations that need be recognized and overcome, not only to achieve better results but to manage expectations and avoid a possible new AI winter. Such limitations (as explained and referenced in chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science” as well as several other chapters in the present book) include: lack of effective formal causal properties both in terms of adequate expressiveness, reliable discovery and consistent inference; lack of a theory ensuring over fitting avoidance (note that several empirical overfitting avoidance methods have been developed and are used); large sample requirements and questionable performance in low sample situations; a fundamentally heuristic approach to model development and model selection; black box nature with limited explainability; limited capability for shift invariance of image recognition; limited accuracy advantages over conventional “vanilla” statistical comparators (such as logistic regression—see extensive empirical literature later in the present chapter) in several applications, typically when the sample size is insufficient or the problem is low dimensional; and incomplete and sub-optimal search of parameter space. Published claims about high performance of DL in health applications has been linked by several meta analyses to biased designs, as demonstrated by extensive literature discussed later in this chapter. We will also address limitations of DL in the commercial space (game playing, self-driving cars) that has led to a more measured appreciation of the technology’s strengths and limitations. Such an appreciation when combined with dedicated research based on property-ensuring principles (e.g., BPs in chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”) and safe deployment (e.g., BPs in chapter “Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application”) has the potential to reduce the limitations and reduce the possibility for a new AI winter [4].

What’s in a Link: The Importance of Semantics in AI Models and Methods. From Problems with Early Semantic Networks to Those of Modern Network Science and Pathway Discovery

Chapter “Foundations and Properties of AI/ML Systems” discussed semantic networks as simplified and graphically interpretable offshoots of First Order Logic (see chapter “Foundations and Properties of AI/ML Systems”). As semantic networks started proliferating, it became obvious that they did not share the same technical meaning (semantics) and additionally, because of lack of precise semantics, their output and behavior were unknown, or unsound. These grave limitations were elucidated in a landmark paper by Woods which attracted the attention of the symbolic AI community and sparked needed improvements in semantic network technology. Modern forms of the earlier SNs, e.g., the Semantic Web, and Knowledge Representation and Reasoning (KRR) methods and tools draw their power and integrity to these earlier SNs [2, 18,19,20].

However, the problem of unclear, undefined, inconsistent semantics in graphical KRR is not eliminated. Today it manifests in other classes of AI/Ml, most notably in biological pathway discovery algorithms, and network science models.

Pathway reverse engineering methods in biology and other basic health sciences, seek to discover, represent, and reason with the ways of interaction (control, signaling, regulation etc.) among biological (metabolic, genetic, proteomic, hormonal, immune etc.) molecules [21]. In both cases it is imperative to clearly and unambiguously specify the precise meaning of the edges in the network model, and equally importantly specify how the algorithms used to develop such mechanisms guarantee that all and only those relationships that obey these meanings are output by the algorithms. Among the many available methods in use in numerous papers however, (a) the semantics are undeclared, ambiguous, or lacking important properties, and (b) while algorithms are fully specified, what is unspecified is how the operation of the algorithms ensures the intented interpretation of the output.

For example, in biological pathway reverse engineering most formalisms and algorithms in use do not enforce causal semantics, nor the Causal Markov Condition, or the Causal Faithfulness Condition or other conditions that are required to ensure causal interpretation (see chapter “Foundations of Causal ML”). For example, heuristic and ultimately false use of edges to denote strong univariate correlation as a determinant of direct causation; heuristic/false use of edges to denote non-zero weights in regularized regressors among the studied entities.

The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena" [22]. But how do such network representations can be used for predictive modeling? More specifically how a network representation built using edges denoting similarity, physical connectivity, cross-entropy, geographical distance (to mention a few of the commonly used network science approaches) can be used for prediction? What confers to such networks predictive modeling capabilities? And why are such capabilities (if they exist) better than, say, the minimal threshold comparator of statistical baseline methods such as logistic regression that have (among other desirable properties) provable large-sample error optimality for many identifiable distributions?

The result of these limitations is that numerous models in existence are routinely over-interpreted, in some cases misleading, and likely to generate many false positive and false negative results in translational and clinical applications.

Rule-Based AI, Expert Systems, Heuristic Systems, Limitations of Bayesian Learners and Disconnect Between Systems and Real-World Problems

Rule Based AI and Formal Expert Systems

One dimension of AI’s winters was the failure of knowledge-driven Rule-Based systems to provide a viable means for achieving hard AI (machines with general intelligence indistinguishable or superior to that of humans) as well as problem-focused Expert Systems. Japan, for example, launched the (in hindsight very ambitious) Fifth Generation Computer Systems (FGCS) project which aimed to achieve a large portion of these goals and altogether disrupt the computing technology landscape in a remarkable time span of 10 years. Notably, the project was using logic programming as its main programming language.

The reasons for these disappointing cases can readily be appreciated now: attacking the AI problem at its broadest scope versus a smaller-scoped divide-and-conquer approach [23] was in all likelihood a strategic error given the difficulty and the heterogeneity of problem requirements across the total problem spaces or interest. Rule-Based systems and other KRR technology of the time lacked the flexibility to deal with the very fluid reasoning that humans exhibit (e.g., reasoning with uncertainty, perception, image recognition, learning, intuitively, effortlessly, and reflexively in many instances) [2]. Deriving their problem-solving knowledge from experts is limited by the fact that experts typically cannot explain their problem-solving capabilities [24, 25] or when they can, their explanations cannot be faithfully captured in rules or other existing KRR formats. One can also observe cycles of huge emotional and other investment in one tool, methodology, or approach that created “divides” across “camps”. For example, the non-monotonic logic vs probabilistic divide, or the connectionist vs symbolic divide [26], etc. Finally any effort to link the R&D and success of AI/ML to creating simulacra of human intelligence may be destined to be confined to what humans are already good at without extending these capabilities to discovery in new problems where humans do not perform well (or could never perform well because of intrinsic limitations of their cognitive apparatus).

Heuristic Systems

The problems with the formal systems of the 80s and 90s led many to pursue ad hoc, aka heuristic systems and methods [27, 28]. These have had a considerably prominent presence in early medical informatics, and also re-surfaced, for example, with prominent commercial products in the 2000s, as well as with approaches that dominated earlier stages of high-throughput assay-driven biology and translational science. A properties-based overview and structural analysis of the heuristic-to-formal system spectrum is given in chapter “Foundations and Properties of AI/ML Systems”. Here we will reiterate the key concepts from a crises-leading-to-improvements perspective.

Striving to attain the lofty AI goals in the earlier history of the field, attempts were made to overcome the difficulty and cost of establishing formal systems. Some researchers, especially when faced with complex domains, felt that the formal systems of the time (see chapter “Foundations and Properties of AI/ML Systems” for properties and characteristics of those) were too restrictive for designing successful AI solutions. Instead, they hoped that systems attempting to encode problem-solving strategies directly, without being restricted by the rules and structure of a formal foundation, would provide an approach that would be easier to engineer and would yield good results - at least “good enough, often enough”. For example, the DxPlain system which was designed to perform large-scope diagnosis (and adopting its heuristic precursor, INTERNIST-I’s knowledge representation and reasoning algorithm) was heuristic because it lacked a formal AI foundation and formal properties.

In a vigorous debate that lasted for many years, proponents of heuristic systems argued that to the extent that they worked well empirically, they should be perfectly acceptable, especially if more formally-constructed systems did not match the empirical performance of heuristic systems and if constructing formal systems or establishing their properties was exceedingly hard. Proponents of formal systems counter-argued that this ad-hoc approach to AI was detrimental since one should never feel safe when applying such systems, especially in high-stakes domains.

There is no arguing about the fact that formal foundations provide a roadmap for both problem solving and incremental improvements to existing systems. They make understanding the theoretical and empirical properties of AI/ML models and systems easier. In turn, such understanding does not always guarantee optimal systems, but gives a roadmap that enables safe navigation of the complex technological terrain of applied AI/ML. Heuristic development, on the other hand, is based upon the availability of the rather nebulous and uncontrollable ingredients of human inspiration and intuition. It is also conducive to engaging adoption of technology that is not safe, exactly because its functions and hence risks, are not understood well enough. This represented a “scruffy” approach [2] to an endeavor that, it is now evident with the benefit of the progress that has been achieved over the years, needed to employ science-driven solutions.

From a modern scientific perspective, the historical emphasis on heuristic systems seems much less meaningful today than in the earlier days of AI, because of the scientific advances in the field that led to the huge success and dominance of formal methods [2, 29]. It is now evident, that heuristic systems are pre-scientific or early-scientific technological artifacts in the sense that a true scientific understanding of their behavior does not exist (yet) and that with sufficient study in the future, a comprehensive understanding of a heuristic system of today can be obtained. In other words the heuristic system of today (or the past) will be the formal system of tomorrow [28]. Thus heuristic vs formal, if interpreted in terms that imply a special distinctive intrinsic nature, is a false dichotomy using misleading language.

Today’s science and technology of AI/ML supports instead the distinction: AI/ML systems with well understood properties (theoretical but also empirical performance), versus systems that lack these properties. There is an evolutionary path from heuristic systems (aka pre-scientific informal systems) to scientifically understood and validated systems.

Chapter “Foundations and Properties of AI/ML Systems” further elaborates the landscape and epistemological journey from heuristic to formal: pre-scientific/heuristic, to intermediate level systems, to fully-mature fully-reliable and science-backed systems. Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” presents a concise summary of the main properties of all main AI/ML methods showing that due to the immense progress of the past few decades, we now have an extensive AI/ML method toolkit with well-understood behaviors.

The modern significance of this topic is that regrettably many new AI/ML methods and systems (or old methods applied to completely new areas) both in academia and industry lack established properties for performance and safety. It is not uncommon for such systems to be rushed to real-life application, and it is an ongoing battle to establish such properties before they can gain the trust and adoption they hope for, especially in high-stakes domains, such as medicine and the health sciences. Of particular relevance to today’s AI/ML are: (a) systems or methods that are well understood/proven in a problem domain that are thrust in a different domain where they do not have well-established scientific foundations (e.g., IBM Watson Health, Large Language Models (LLMs), Deep Learning clinical applications on account on success on non-medical domains, or Shapley values for explaining ML models—see chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”, etc.). (b) Numerous bespoke bioinformatics “pipelines” constructed for the needs of a particular study but lacking well-defined generalizable properties.

Disconnect Between Early AI Systems and Real-World Needs and Workflows

Many of the early efforts in health AI/ML were driven by intellectual curiosity and less by the actual problems that patients or healthcare systems were facing. Consequently, the resulting systems were disconnected from (a) patient care needs, (b) patient care workflows, (c) needs of the system of science. A classic example is that of early diagnostic systems that attempted to solve a problem that was not pressing or was evaluated in a different population. Using one out of many similar examples of that era, INTERNIST-I was designed for accurate diagnosis of patients across internal medicine—whereas most patients in internal medicine clinic are not admitted for diagnosis and moreover, diagnosis is challenging for human internists in a small minority of diseases and few patient cases within those. Additionally, the data entry into the systems was disruptive of clinical workflows at the time [30] because of lack of EHRs and CPOE (computerized provider order entry systems) [31].

In addition, the collection of data inputs in the healthcare domain is in most cases sequential over time and dynamic (i.e, conditional on prior findings) [32]. The selection of which tests to run at each time point is a major decision that reflects a partial differential diagnosis and is not within the decision sphere of AI systems confined to a single-point-in-time decisions. This biases conventional performance evaluation based on complete medical records upward to very significant degree. Finally INTERNIST-I in particular, was evaluated for cases so hard (NEJM challenge cases) that in many cases only patho-anatomic examination (usually on biopsies obtained after patient surgery or death) was able to resolve the underlying disease (i.e., the diagnosis in such cases likely cannot be reliably made with data available to the AI system or the physicians) [33].

The compounding of these factors together rendered such systems very valuable from a technology exploration perspective but of limited use in practical terms and this was reflected in a landmark paper by Hunt, Haynes et al., which established very limited impact of the AI technology of the time on patient outcomes [34]. Most of the above factors were also recognized in an influential paper by Miller himself et al. that declared the death of the “Greek Oracle model” for AI decision support. These authors advocated instead for a softer knowledge-based “catalyst” model, assistive in nature, that would work in parallel with physicians and thus they modified the INTERNIST-I system into its successor QMR formulated as an electronic textbook of internal medicine with inferential capabilities. Unfortunately this re-formulation was also not adopted widely in practice, most likely because it did not address pressing real-life needs [30].

The many practical problems associated with Expert Systems of both the rule-based or heuristic varieties were overcome by the subsequent adoption of ML data-driven algorithms. This technology was made applicable across a wide spectrum of applications because of the availability of vast troves of data (e.g., in the EHRs, the WWW, and mass-throughput molecular assays) combined with powerful ways to prevent over- and under-fitting (see chapter “Overfitting, Underfitting and General Model Overconfidence and Under-performance Pitfalls and Best Practices in Machine Learning and AI”).

Limitations of Early Bayesian Learners and Emergence of BNs

The earliest conceptual application of AI/ML Bayesian reasoning in medicine, to our knowledge, is outlined in Ledley and Lusted [35] whereas a very early empirical application in a real-life domain using data-driven modeling, to Warner et al [36].

As explained in chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”, brute-force application of Bayes’ rule incurs an exponential space and time complexity which renders it infeasible outside all but the most trivial problem domains. Because of these factors early Bayesian classifiers where used with severe restrictions, typically in the form of “Simple” or “Naïve” Bayes which assumes that the target response (e.g., disease classification) comprises mutually exclusive categories, and that the inputs (e.g., disease findings) are independent of each other given the findings (see chapters “Foundations and Properties of AI/ML Systems” and “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”). These assumptions are obviously most often false in healthcare and health sciences and entail substantial error in some applications. This led to research that took place over 2 decades and gave rise to improved formalisms, the most important of which is Bayesian Networks (discussed in chapter “Foundations and Properties of AI/ML Systems” in terms of properties, in chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science” in terms of ML, and in chapter “Foundations of Causal ML” in terms of causality). This new AI technology was probabilistically correct and thus handled uncertainty optimally. It was flexible enough to model any distribution and at the level of space and sample complexity that the distribution required without unrealistic assumptions. BNs also allowed sound and complete forward and backward rule chaining inferences to take place [37, 38].

Intractability of BN Inference. Ability to Learn BNs from Data

The new and highly appealing formalism of BNs was quickly discovered to be worst-case intractable in inference both in the exact and approximate cases [39, 40]. Newer research, however, produced approximate inference algorithms that advanced the tractability of the formalism on large numbers of variables, e.g., [41]. Moreover, it was soon discovered by Herskowitz and Cooper that BNs could be learnt from data using entropy-based scoring and shortly afterwards, by Cooper and Herskowitz, using Bayesian scoring [42, 43]. Heckerman et al. developed a modified family of Bayesian scores that could observe likelihood equivalence, which says that data should not help discriminate network structures that represent the same assertions of conditional independence [44].

Overfitting and Over-Confidence in Models: Problems, Advances and Persistent Challenges

As detailed in chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI”, overfitting in AI/ML is the creation of models that because of their high complexity work well in the discovery data but do not generalize well in the broader population. Over-confidence is a broader class of problems in which we expect the models to perform better than what their true error is (regardless of the models’ complexity). Overfitting is typically a byproduct of how algorithms fit models to data whereas over confidence may involve error estimation and data design faults as well (see chapters “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” and “Overfitting, Underfitting and General Model Overconfidence and Under-performance Pitfalls and Best Practices in Machine Learning and AI” for detailed analysis). Over-fitting and over-confidence can happen both in knowledge-driven and hand-modeled AI as well as in data-driven ML modeling.

In the early years of ML, overfitting avoidance was enforced by various factors and practices, e.g.: the natural dearth of high-dimensional datasets, by eliminating most variables from modeling using often crude feature selection strategies, by the statistical practice of analyst-driven “model specification” as opposed to extensive computational search in model space, by application of “pruning” of models after they were constructed, by application of inductive biases that—ceteris paribus—preferred simpler models over more complex ones, and other strategies and practices. Unfortunately, not all of the above practices were sufficient to address the problem. For example, in classical statistical regression, so-called step-wise procedures were shown to lead to substantially overfitted models with biased error estimates [45]).

In the 90s and onward, however, a newer generation of highly regularized learners (see chapters “Foundations and Properties of AI/ML Systems” and “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”) were put to practice, with built-in resistance to overfitting and mathematical error bounds that ensured that model complexity was automatically tailored to the data generating function’s complexity and the available data. Notable such ML methods include penalty+loss learners such as SVMs and Regularized regression of various forms (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”). Some other ML algorithms enforce implicit regularization in the form of model priors (Bayesian ML [43, 44]) or maximum allowed model size (e.g., Random Forests [14]) or via several formal and ad hoc regularization mechanisms in Deep Learning [9, 10].

Additional advances came in the form of principled, theoretically optimal and empirically powerful feature selection algorithms in the early 2000s and onward, which constrain the model complexity immensely before modeling even begins (e.g., reducing without loss of predicitivity, the modeled variables from, indicatively, 139,000 variables to 32 variables and from 100,000 to 6 variables in some representative applications [46,47,48,49,50,51].

Powerful nested protocols that combine model selection with error estimation so that over fitting and over confidence would be avoided also emerged as extensions to the previously well-known theory of cross-validation [52].

These protocols’ importance was firmly established in genomic data analysis by the work of R. Simon et al [53] but are of universal applicability. The above new methods can be combined to strengthen the overall resistance to overfitting.

With modern algorithms and model selection, error estimation, and safe deployment protocols (see chapters “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”, “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”, “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”, “Overfitting, Underfitting and General Model Overconfidence and Under-performance Pitfalls and Best Practices in Machine Learning and AI”, “Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application”), avoiding overfitting can be practically ensured in modern ML modeling. Unfortunately there are still obstacles that need be addressed:

  1. (i)

    There is a significant education and adoption gap, since the concepts of overfitting and unbiased error estimation and the methods to address them are not as widely known (or universally practiced) especially by many stakeholders and beneficiaries from AI/ML (basic scientists, translational and clinical scientists, health care administrators, etc.).

  2. (ii)

    While independent validation of models (chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”) is usually required by many journals, these validations may fail because of discovery-validation population mismatches or due to under-powered validations, and not because of over fitting. In other words, failure to pass independent set validation can generate false negatives. By comparison, cross-validation (properly executed) may only fail if the accessible population differs from the target population (i.e., a data design issue, see chapter “Data Design”).

  3. (iii)

    At best, independent validation is more of a forensic failure analysis tool (i.e., why a model may have failed) than a preventative strategy (i.e., how to construct the model so that it will not fail).

  4. (iv)

    Hand-made model construction is widely practiced in biomedical engineering and other fields but without theoretical or practical assurances that the development practices that led to one successful model will generalize in other domains.

  5. (v)

    Widely-used ML algorithms such as DL still to this day lack a well-developed theory that prevents overfitting and ensures generalization. In an intriguing paper [54] Zhang et al show that DL networks can perfectly learn random noise, memorize data fully, and these behaviors are not affected by regularization measures. Although empirical evidence suggests that DL learners often do resist overfitting, the reasons are not well understood. It is entirely possible that practical DL models exhibiting excellent classification errors in some domains were made possible simply due to the large sample sizes used to train them. At this time DL algorithms may be safer to use in data-rich settings.

  6. (vi)

    Results from most ML competitions and challenges are inexorably tied (by design) to a small number of datasets and the set of loss functions used in the competition. This is a problem of over-confidence in a model which formally is a broader class of problems encompassing overfitting. As an indicative example, when Narendra et al compared a number of (highly reputable) DREAM challenges winning algorithms for pathway reverse engineering and contrasted them with additional algorithms with fresh datasets, very hard gold standards, and using a variety of loss functions, the previous winners did not exhibit superior performances and in some cases were substantially under-performing especially with respect to several loss functions not originally examined [21]. Moreover, the choice of loss functions dramatically reversed the rankings of algorithms’ performances in many cases.

Ignoring the Data Design and Learning Protocol (Model Selection, Error Estimation) Effects on Modeling Success

Recall from chapter “Foundations and Properties of AI/ML Systems” that the architecture of a ML method is defined by the tuple: < L, MF, DD, S, MLS, GM>, in other words, the combination of a modeling language (L) (that describes the model space S), the model fitting procedure (MF) that parameterizes models to data, the data design (DD) that samples data from a population, a search procedure (MLS) that explores the model space and a goal/merit function (GM) used to evaluate each model. Typically, the focus on most ML and data science is on the “algorithm” which is meant to correspond to <L, MF, S, GM>. It is also common to see discussions of how the sample size may affect ability to learn specific classes of target functions [55, 56].

What is omitted from most technical and lay public discourse alike, however, are the effects of data design (Chapter “Data Design”) and model selection and error estimation (Chapters “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”, “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-sensitive AI/ML Models”, “Overfitting, Underfitting and General Model Overconfidence and Under-performance Pitfalls and Best Practices in Machine Learning and AI”). For emphasis, let us represent the success of AI/ML for a particular problem-solving endeavor with the following Fig. 3, which we invite the reader to always keep in mind in the context of AI/Ml projects:

Fig. 3
An illustration represents algorithm plus data design plus model selection and error estimation leading to success or failure of A I or M L in terms of performance and safety.

Successful application of ML is determined by the combined influence of algorithm choice, data design, and model selection and error estimation protocol

This brings us to two enduring problems in historical and modern practice or AI/ML:

Effects of Protocols on Overall Model Performance Are Very Strong

In present day AI/ML practice it is common to encounter situations where the choice of model selection and error estimation protocol may negatively affect or even nullify good properties of ML algorithms. For example in [57] the largest (to our knowledge) empirical benchmark of text categorization algorithms and protocols was conducted. Among other findings, they established that apparently minor alterations in the protocols used (with respect to their ideal specification), led to reduction in model predictivity as large as from >0.9 AUC to 0.72 AUC (a truly massive degradation in performance) (Fig. 4).

Fig. 4
A bar graph of the A U C versus model selection protocol. All the bars are above 0.8 except the bar oracle, which is 0.7, and it is highlighted. Data are estimated.

The dramatic effect of model selection protocol on performance. A majority of algorithms with properly implemented model selection exceeded 0.9 AUC in the depicted classification benchmark. In dark blue, dramatic drop in performance due to modified implementation of the same protocol and using the same algorithm, in a commercial product [57]

In the field of genomics, a landmark study was conducted by the Microarray Quality Consortium [58] MAQC-II project, where 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. The teams generated predictive models and tested the models on data that had not been used for training. Model performance depended largely on the endpoint and team proficiency while different approaches generated models of similar performance.

A major finding was noticing “substantial variations in performance among the many K-nearest neighbor algorithm (KNN)-based models developed by four analysis teams … Follow-up investigations identified a few possible causes leading to the discrepancies in performance. For example, DAT20 fixed the parameter ‘number of neighbors’ K = 3 in its data analysis protocol for all endpoints, whereas DAT18 varied K from 3 to 15 with a step size of 2.

Second, there are clear differences in proficiency between data analysis teams and such differences are correlated with the level of experience of the team. Moreover the study found that the internal validation performance from well-implemented, unbiased cross-validation shows a high degree of concordance with the external validation performance in a strict blinding process.

It was also found that swapping train and test sets led to different results (with same modeling approaches). This “fragility” of independent dataset validation reinforces the guidance provided in several places in the present volume about not over-interpreting differences in performance to an independent validation if internal validation has been executed properly. However, it was also noted that the correlation of the independent validation with the reversed order of modeling-testing was further reduced in teams with less experience, showing the effect of proper modeling practices.

The consortium concluded that “Finally, applying good modeling practices appeared to be more important than the actual choice of a particular algorithm over the others within the same step in the modeling process. This can be seen in the diverse choices of the modeling factors used by teams that produced models that performed well in the blinded validation”.

On the basis of these findings, MAQC-II set forth 4 general guidelines (incorporated in the present volume in several chapters) covering: good design, proper (internal) cross validation (or similar quality error estimation and model selection), caution about pitfalls of independent (external) validation, and not contaminating modeling data with test data information.

The findings of the MAQC-II project suggest that when the same data sets are provided to a large number of data analysis teams, many groups can generate similar results even when different model building approaches are followed. In this particular domain at least, the effect of protocol is more important than the algorithm used.

In another important benchmark study, Statnikov et al [17] showed that previously published early reports showing the superiority of Random Forests over other algorithms for cancer microarray data classification were an artifact of the protocols used to run the algorithms. When the “reference” use (i.e., one suggested by the methods’ inventors) was followed in a common model selection and error estimation protocol, the results changed in favor or the SVM family of models.

Effects of Data Design on Overall Model Performance Are Very Strong

As we saw in Chapter “Data Design for Biomedical AI/ML”, the data design, may affect in a myriad of ways the ability or ease and efficiency of drawing specific conclusions and attaining models with desired characteristics, often independent of learning algorithm. Also it was shown that with the same algorithms employed, some designs do not allow certain inferences at all or within reasonable resources, or exhibit large variation in quality of results. Yet, treatment of data design considerations while abundant and very sophisticated in statistics, in AI/ML they seem to often receive less scrutiny by practicing data scientists, vendors and the public. Some examples follow:

Issues with ML Challenges

As discussed in Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”, ML challenges serve two fundamental purposes. One is the education of scientists from different fields about the problems the challenge addresses and giving them data and a platform to experiment. The second purpose is to explore which algorithmic methods are better at a particular point in time for a particular problem. Well-designed challenges can generate valuable information and enhance interdisciplinary engagement. In most cases, however, challenges suffer from fixing a data design and the error estimator thus removing from consideration, study, and improvement, two out of the three determinants of ML success (Figure 3).

Challenges also routinely restrict the design of modeling by pre-selecting variables, and over-simplifying the statement of problems, sometimes to meaningless extremes. Consider as representative example of thousands of similar challenge datasets, a high-ranked dataset in the competition platform Kaggle [59]. Participants are asked to build models to predict stroke based on the following features:

gender, age, hypertension (has: yes/no), heart_disease (has: yes/no), ever_married, work_type( "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"), Residence_type ("Rural" or "Urban"), avg_glucose_level, bmi, smoking_status ( "formerly smoked", "never smoked", "smokes" or "Unknown"). Stroke is defined as (1 if the patient had a stroke or 0 if not).

Such a formulation of the problem omits very significant aspects previously established in the clinical literature (e.g., important factors to be measured, prior successful models, precise and meaningful descriptions of the clinical context of use for such models). It is furthermore representative on a systematic pattern of many competition platforms to reduce challenging scientific problems to ‘toy’ versions, that are not fully informed by prior science, are often devoid of medical or AI/ML technical context, have dubious choice or features and coding, lack proper design flexibility -including statements of target and accessible populations- and so forth (see Chapter “Data Design for Biomedical AI/ML” why problem statements and data as the one described here lack basic pre-requisites for making scientific advances).

Challenges also suffer from incomplete or highly biased representation in the competitor pool. Most of participants in challenges are either students or interested scientists but from different areas and thus typically having limited skill in tackling the problem. Another limitation is that not all appropriate algorithms are entered in a challenge and when they enter, they are not executed necessarily according to optimal specifications. Finally, challenges typically involve a very small number of datasets that do not represent a large domain. Such representative coverage typically requires many dozens of datasets or more in the same comparison. One final defining characteristic of challenges, is that they externalize and distribute the cost of method benchmarking to volunteer participants (and this explains in part their proliferation for commercial purposes as a low-cost R&D alternative business model).

Despite these limitations, a select number of challenges that are designed to a high degree of quality, when interpreted carefully and with the appropriate qualifications, can provide empirical scientific information that could supplement the centralized and distributed benchmark designs (details in Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”). For examples of well-designed, and carefully interpreted challenges the reader may refer to the challenges conducted by the ChaLearn organization [60].

Other Persistent Issues Related to Common Data Design Deficiencies

We briefly mention here a number of additional areas negatively affected by suboptimal data design choices (or by not paying enough attention to data design): Active learning algorithms entail gathering data in which the natural priors of the target response classes are typically unknown. However, such knowledge is necessary for characterizing model errors by loss functions that are prior-dependent (e.g., positive and negative predictive value). This creates obstacles to safe model deployment (see Chapter “Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application”). The shortcomings of EHR data for secondary analysis in terms of data incompleteness and bias are widely appreciated (and often exaggerated). An important issue that receives much less attention with EHR-based data designs is the lack of power. For example, the PheWAS design for genotype-phenotype correlation discovery [61] is an modern design that aims to bridge the clinical with the genomic worlds. However, in a fragmented health system like the USA’s, such designs are hampered by the difficulty in pooling data across provider organizations, and harmonizing the population characteristics and risk of biases. As such, the discovery is confined to single systems or small pools of those. The limited genomic variable scope in EHRs incurs an exceedingly small a priori probability for discovery, and these two factors together, further exacerbate the low power issue. Newer Big Science designs e.g., VA’s MVP project [62] or the AllOfUs project [63] are poised to overcome these difficulties. Power-sample design, is harder in ML because knowledge of learning curves is required but often only empirically known (see Chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” for guidance). In some cases assumption-driven and early data-fitted empirical learning curves failed to produce unbiased estimates (as for example occured in [64]).

Causality in Early and Modern ML

One of the most foundational aspects of AI/ML for improving health outcomes and the healthcare system and for driving advances in health science discovery is causality, its discovery, representation, and modeling. We will thus discuss in the present chapter several historical and modern challenges and successes involving causal knowledge.

“Correlation Does Not Imply Causation” and the Falsely Implied Impossibility of Discovering Causality Without Experiments. Problems with RCTs and Low-dimensional Experiments

For many years, the celebrated statistician Sir Ronald Fisher’s famous warning about the inappropriateness of concluding that A is causing B (or vice versa) from the fact that A and B are correlated, was inappropriately extrapolated to the notion that only experiments could discover causality. In the clinical health sciences the randomized controlled trial (RCT), a methodology with a long history [65,66,67] has been viewed as the sole arbiter of reliable causal discovery.

Fisher’s statement is quite benign since, properly interpreted, implies that not all correlations between A and B entail that A → B or A ← B but rather that a confounder may be responsible for the correlation: A ← C → B. So it follows that if we observe that A and B are correlated, then they may or may not cause one another. If we observe that A and B are uncorrelated, on the other hand, in the majority of distributions, A is not causing B and vice versa. The presence of correlation thus provides evidentiary support for the existence of causation and lack of correlation effectively precludes it [68].

If we are interested in causes (or effects) of variable A we can typically discard all variables X that are uncorrelated with A. Variables Y that are correlated with A, are candidate causes or effects of A. If the possible confounders of A and Y are measured and we know who they are, we can control them analytically (by SEMs or regression, see Chapter “Foundations of Causal ML”) or by matching. These facts led to widespread practices in epidemiology where a large pool of correlates where established by data analysis as well as putative confounders and the correlates were examined for vanishing conditional correlations with A given the assumed confounders. If the non-zero correlation of Y with A becomes zero (aka vanishes) conditioned on the complete set of measured confounders C, then A is not causing Y or vice versa. If the conditional correlation is not zero, then this supports the existence of a causal relationship between A and Y.

A variant of this analytical control methodology, propensity scoring, was introduced by [69]. In propensity scoring, the fundamental idea is that the analyst builds a predictor model of the exposure, and controls the correlation of exposure and outcome by conditioning on the score of the predictor. If this conditional correlation of the exposure and the outcome vanishes, then the exposure does not cause the outcome.

Additional heuristic causality rules (Hill’s criteria or Koch’s postulates [70,71,72]) were often overlaid on correlative analytics in order to support or weaken causal interpretation.

There are major limitations with all of the above approaches, however [68, 73]:

  1. (a)

    Even if the confounders are all measured, it is not always known that they are confounders. Wrong causal conclusions are likely if the controlling or matching variables are, or include, non-confounders.

  2. (b)

    If even one of the confounder is not measured, (i.e., it is “hidden” or “latent”) then they cannot be controlled analytically or by data design with the previous approaches.

  3. (c)

    The heuristic criteria were shown to be invalid and misleading.

  4. (d)

    The assumptions that would ensure correctness of propensity scoring are not testable within the propensity scoring theory. Moreover as was shown in a vigorous peer-reviewed debate [74, 75], Pearl and others showed that in a very large class of models, conditioning on the propensity score introduces errors in the estimation of causal effects and that the theory of propensity scoring could not test for these problematic cases.

Randomized causal experiments or RCTs are typically used to confirm or reject the causal nature of a hypothesized causal relationship. RCTs and other randomized experiments nullify (on average) all confounders between treatment/exposure and outcome even if these confounders are unknown and unmeasured.

As useful as RCTs are, they also have significant limitations (and the same is true for other randomized experiments); here we will summarize only the salient limitations mainly with respect to scalability, feasibility, scope, and completeness [68, 73]:

  • RCTs are a confirmatory and not a discovery procedure.

  • RCTs are very expensive and time-demanding.

  • RCTs are infeasible in many settings.

  • RCTs are unethical in many settings.

  • RCT execution commonly fails for a variety of reasons (e.g., accrual failure, lack of power).

  • RCTs reveal remote or direct causation and do not determine direct causation.

  • RCTs cannot be used to develop system-level causal models (i.e., full causal specification of the data generating process). See in particular the important results from the theory of causal discovery summarized in Chapter “Foundations and Properties of AI/ML Systems” that show the severe limitations of 1-variable-at a time or fragmented experiments with regard to ability to learn the whole system of causal relationships (unless they are coupled and supported by algorithmic non-experimental causal discovery). Many of these RCT shortcomings also apply in biology and other basic health sciences, in low-dimensional randomized experiments (i.e., manipulating one or a few variables at a time).

  • RCTs produce average and not precision causal effect estimates (i.e., they are too broad in effect size estimation).

  • RCTs are applied to a typically narrow accessible population and not the wider target population (i.e., they are too narrow with regards to target population).

  • RCTs are extremely sensitive to context (for a demonstration that is as vivid as it is whimsical see [76].

New Class of Scalable Causal Discovery Algorithms

The limitations of the [candidate discovery → validation] discovery chain via the sequence [uncontrolled correlations → controlled correlations → randomized experiments], were radically reduced by the emergence of a new class of causal ML algorithms that under conditions guaranteed algorithmic and reliable scalable complex causal discovery and modeling even in the absence of experiments. These algorithms originated from diverse data science and AI/ML disciplines: Computer Science (Pearl, Verma and others [73]), Computational Philosophy (Spirtes, Glymour, Scheines and others [68]), Econometrics and AI (Granger, Herbert Simon, Sims, Imbens and others) and Health Informatics (Cooper et al [43]). A multitude of theoretical and algorithmic advances pioneered by these researchers addressed the limitations of earlier approaches and offered effective alternative and supplemental discovery and modeling procedures. Among the milestone contributions of the above, it is worthwhile highlighting the richness and breadth of Pearl’s theoretical treatment of causality, the seminal algorithmic body of work by Spirtes, Glymour, Scheines and their students and collaborators, and the remarkable IC* algorithm by Verma and Pearl, which opened the doors for detecting existence of unmeasured variables that confound statistical correlations. Please refer to Chapter “Foundations of Causal ML” for details on properties and capabilities.

Early Causal Algorithms Were Not Scalable; Claimed Impossibility of Tractable Causal Discovery Algorithms. Discovery of Scalable and Efficient Causal Algorithms and Causal Feature Selection

As these new causal methods started being used in empirical applications, it quickly became evident that the algorithms would not scale to more than a few dozen variables in unrestricted distributions with hidden variables and ~100 variables when no latents were present. Notable researchers expressed strong pessimism that scalable algorithms could be invented (e.g., prominent algorithms expert and Turing award winner J. Ulman in a paper discussing large-scale data mining applications techniques [77] assessed that “… the goal has generally been to learn complete causal models, which are essentially impossible to learn in large-scale data mining applications with a large number of variables”).

Fortunately not long after these pessimistic predictions were made, the first scalable local causal methods had been discovered, and within a few years full causal graph ones [78]. Moreover, researchers developed methods that bridged the local causal discovery problem and predictive modeling by introducing scalable, sample efficient and correct Markov Boundary algorithms [8, 48]. Chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science” and “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” give more detail on the approaches used to scale up the previous generation of algorithms. [79] describe the theoretical framework for connecting predictivity, feature selection and local causality). These methods were subsequently expanded/followed by sound and scalable algorithms capable of equivalence-class modeling, ML-guided experimentation, and generalized families thereof [80, 81]).

Lack of Causal Correctness in Broad Use of ML and Especially the Persistent Use of General Predictive Modeling Methods to Solve Causal Problems

A widespread misapplication of AI/ML methods and tools across the academic sphere and industry alike, involves the use of methods not designed for causality to solve causal problems. This includes, for example, the use of vanilla predictive modeling, clustering, network science, non-causal feature selection, dimensionality reduction, and other methods lacking causal correctness. Especially in the predictive modeling area, numerous methods are abused in this regard (e.g., Deep Learning and other ANNs, regularized regression and classification, Decision Trees and Random Forests, SVMs, Boosting, Bagging and Ensembling, see Chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science”). To understand the magnitude of the problem, we refer the reader to the large scale benchmarks published in [48], where >100 algorithms and variants were tested in >40 datasets, from real life and resimulation, as to their ability to be used for feature selection/prediction and causal discovery. Main findings included that: (a) predictive methods performed very well or optimally for prediction; (b) causal feature selection led to optimal predictive models while using smallest number of features; (c) predictive feature selection is utterly inappropriate for causal discovery. The following 2 figures demonstrate this last finding: In Figure 5 an experiment is shown where a local causal method in the GLL family and and a predictive method (SVMs) are compared in the task of discovering the local causal pathway around a response variable T of interest in a transcriptomic network used to generate the comparison data. The causal method achieves this goal with near-perfect causal accuracy (blue nodes, zoomed in the figure) and empirically optimal predictive accuracy at the same time. The non-causal method is equally predictive, but selects features across the network without any useful causal interpretation. Notice that once we can recover the local pathway accurately, we can apply the causal procedure across all variables and recover the whole causal network. Moreover the ability to estimate causal effects of manipulations is totally dependent on identifying the confounders. Hence the focus on local causal accuracy is justified by it being a prerequisite for both correct causal structure and correct quantitative causal effect estimation.

Fig. 5
An illustration of a world map represented in nodes connected with one another. The nodes are labeled with numbers and are in 3 different shades. A few nodes are encircled.

Causal method (GLL, blue) vs non-causal methods (SVM-RFE, yellow) compared for local causal discovery

These behaviors are not confined to a single dataset or a single pair of algorithms, but are general operating characteristics.

Figure 6 compares a multitude of causal and non-causal methods: GLL variants (causal), RFE-SVM, UAF, and Regularized regression/Elastic Net (non-causal). As can be seen, causal methods recover the direct causes with minimal false positives (and optimal predictivity), non-causal methods although predictively excellent, produce fundamentally only causal false positives. In these experiments, Lars-EN in particular (for reasons that are poorly understood) seems to have an anti-causal inductive bias, thus selecting features as far away from the direct causes as possible.

Fig. 6
A radar chart of casual and non-casual methods. The casual method includes Hiton P C F D R max k equal 1, 2, 3, and 4. Non-casual method includes L A R S EN one versus rest and multiclass, U A F S 2 N and K W S V M 20% and 50%. R F E reduction by 20% and 50%. Non-casual method occupies most areas.

Comparison of causal (red sector) and non-causal (green sector) methods for causal discovery. Concentric circles show the distance from the true local causal neighborhood of the target response. Better performance: results close to the center. Worse performance: results away from the center

Pillars of Causal AI

The need for wider adoption of causal methods in AI/ML is as pressing as ever. Pearl [82] provides a lucid description of this urgency and outlines what he considers as his 7 pillars of Causal AI as follows:

  • Pillar 1: Encoding Causal Assumptions transparently and testably.

  • Pillar 2: using Do-calculus to control confounding.

  • Pillar 3: Algorithmization of counterfactuals effect estimation.

  • Pillar 5: Dealing with external validity and sample selection bias.

  • Pillar 6: Dealing with missing data.

  • Pillar 7: Advancing causal discovery.

Case Studies with Pitfalls Involving AI/ML Used in Genomics

Among research domains in the health sciences, genomics and other fields driven by high-throughput (i.e., high dimensional) omics (e.g., transcriptomic, proteomic, metabolomics, microbiomic, epigenetic) assays have arguably benefited the most from the availability of ML methods for making sense of large datasets originating from vastly complicated data generating processes (see Chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”). In clinical domains these advances have enabled exponential progress, for instance, the progress of molecular oncology toward deriving rationally-designed “targeted” treatments as well as precision medicine tests [83] that are rapidly and decisively changing the cancer field. These achievements did not occur smoothly, however, and several case studies exist that serve as examples of how deviation from best practices can lead to grave errors, dead ends and false starts.

The Anil Poti incident is one such case study in which several trials were shut down, several grants terminated, and a substantial number of papers withdrawn. Although some accounts of the incident emphasize administrative errors and compliance issues, it is worth noting that at the core of the misconduct was the creation of a controversial complex precision oncology models and tests used to treat patients in clinical trials. [84]. This is also a vivid illustration that testing ML models in a RCT is a final step only and does not constitute sufficient protection since poorly-constructed models will endanger human subjects. In response to this incident the National Cancer Institute requested that the Institute of Medicine (IOM) establish a committee to recommend ways to strengthen omics-based test development and evaluation.

Best Practice 12.7.1

The IOM’s recommendations and best practices to enhance development, evaluation, and translation of omics-based tests before they are used to guide patient treatment in clinical trials. These aim to ensure that progress in omics test development is grounded in sound scientific practice and is reproducible, resulting not only in improved health care but also in continued public trust [85].

In another high-profile case study, the research conducted in the labs led by Drs. Petricoin and Liota at the FDA using a proprietary SELDI-TOFF mass spectrometry technology, was shown to suffer from several modeling errors, and irreproducible conclusions. These findings led to the closing of these labs, and the company responsible for the mass spectrometry assay technology became defunct. Most importantly, the whole field of clinical proteomics suffered a significant credibility blow that, by many accounts, held back its progress by many years [86, 87].

In another, present-day challenge, it has been established that random biomarker selection from omics data often leads to optimal predictor models and is comparable to biomarker signatures produced by sophisticated ML algorithms [88]. This creates a series of challenging questions that undermine the credibility of biomarkers discovered via widely-used current ML technologies: (i) if the ML selection is often not better than random for many algorithms/ datasets, do biomarkers from such ML algorithms have any special biological significance? (ii) What is the added value of ML biomarker selector algorithms over random choice? (iii) Should biomarker sets and signatures be rejected on account of not passing a statistical test of difference from a random set? If yes, using what criterion? (iv) Some studies have conducted such tests while many others did not. What are the implications for the quality of findings? A comprehensive and mathematically sound theory to explain these phenomena has not yet been published.

The interested reader in genomic applications of AI/ML is also referred to the landmark study by the MAQC-II (Section “Effects of Protocols on Overall Model Performance Are Very Strong”).

We close this section by mentioning discouraging results from a major systematic analysis of the literature conducted by the Biometrics branch of the NCI [89]. This study revealed that many studies in genomics fail to follow a number of good or best practices. The implications on the validity and reproducibility of these studies’ results are not known exactly but are likely to be severe.

Attributes and Limitations of Published Guidelines and Criteria for Health AI/ML

In recent years several guidelines have been published both by scholars, regulatory and accreditation bodies, and others. Because of the vastness and the socio-technical complexity of the AI/ML field and its health applications, and because of the different goals and purview of guideline authors, the resulting recommended practices have great variance in their properties as well as scope and effectiveness.

The Risk of Exaggerating Guideline Generality

This risk occurs when recommended practices are advisable under some conditions but may be interpreted as being true under any conceivable condition (i.e., absolutely). In other words the guidelines describe sufficient conditions for good (or bad) performance but not necessary and sufficient ones. For example, in many chapters of this book we have referenced the landmark work by Rich Simon et al. showing the advantages of nested cross-validation (CV) over non-nested CV. Consider carefully the original Simon et al. experiment [90]: it shows a strong bias when a particular feature selector is applied under the null hypothesis (i.e., no signal on average by label-reshuffled data construction). What if the feature selector instead of naïve univariate association is multiple-testing with Bonferroni-corrected association? It can be seen that the feature selector would return the empty set and no bias would ensue. Or what if one would use a feature selector that used internally nested cross validation? Again a different bias would ensue. What would happen if one would use a regularized or Markov Boundary feature selector that have strong protections against over-fitting? And what would happen if all of the above, alone or in combination, were attempted and not just under the null but under the alternatives of weak, moderate and strong signals in a variety of distributions?

Figure 7 demonstrates relevant experimental results (conducted by Dr. Alexander Statnikov in the Aliferis lab, on representative gene expression datasets).

Fig. 7
A set of 6 histograms of the proportion of simulated datasets versus A U C performance. They represent the feature selection and classifier training on all data and cross-validation.

Exploring robustness of guidelines. In the figure to the left, gene selection with Bonferroni adjustment produces unbiased results independent of cross-validation design. The advice that “Biased resubstitution or only partially cross-validated estimates should either not be reported or should be clearly represented as unreliable indicators of prediction accuracy” is not universally correct. In the right part of the figure, SVM-RFE gene selection produces a very small bias with no cross-validation, a moderate bias with “incomplete” cross-validation, and no bias with “complete” cross-validation, aka non-nested vs nested cross validation

This example shows how the details of gene selection procedure may produce bias that does not grow monotonically to the degree by which the analysis conforms to Simon et al criteria. This is because of the intricacies of how this gene selection procedure works. A reviewer blindly following a rigid set of criteria and not possessing specialized knowledge about how the analysis is done will conclude that the study corresponding to the analysis of experiment 1 is grossly biased, and the analysis of experiment 2 is less biased. The opposite is a more accurate description of reality, in the presented experiments, however.

In general, the embedded algorithms influence the modeling protocol (not just the other way around as we showed in the MACQC-II results). To the credit of the authors, Dupuy and R. Simon in their subsequent systematic review and guidelines study [89] included two important guidelines under the section “Statistical analysis: general options” that read:

“Do Be aware that many aspects of statistical analysis and reporting of microarray studies are not covered in this checklist.”

“Don’t Consider that all the items included in these guidelines are commandments.”

We wholeheartedly mirror these statements for the present book as well. The way the guidance was stated in the original report however, likely creates the perception to (especially technically unsophisticated) readers that there is only one right way to produce unbiased results and every deviation is unacceptable. If such guidance is then followed uncritically it may hinder other methods that lead to good results (and this is exactly what happened as explained in the next section on how meta analyses have over-inerpreted guidelines).

Another example of over-generalizing guidance has to do with stepwise feature selection. While it is true that stepwise procedures as implemented in most statistical software do not have guarantees for correctness and overfit routinely [45], the same is not true for every conceivable stepwise feature selector. For example the IAMB family of feature selectors is based on an iterative entry-deletion process but they (a) are guaranteed to be correct (under assumptions) and (b) do not overfit because of the way they filter out noise variables (and in addition, they are followed in practice by regularized classifiers that further reduce any residual over-fitting) [91].

A final example of the danger of over-generalizing guidelines is in the case of heuristic guidelines. These are often encountered in applied statistical but also in ML modeling. For example, the guidance that in a regression model for every fitted parameter we should have at least 10 samples, or in the PC algorithm that a test of conditional independence will not be conducted unless at least 5 samples are available for every degree of freedom, etc. Such heuristic guidance is inherently bound to failures in many distributions, yet their value is that (i) they are easy to remember and communicate, (ii they are easy to implement, and (iii) they are founded on robust theoretical principles and as a result work well in the average or majority of real-life data.

Over-Interpreting Guidelines Can (and Does) Negatively Affect the Meta Analytic Literature and Its Subsequent Literature

As an example case study, of a broad problem, based on the guidance of Simon et al., Ntzani et al. [92], reviewed the genomics literature at the time and found that most studies did not follow the Simon et al. guideline. These authors reported that a majority of such studies were not following the complete validation guideline, without examining if the actual deviations led to errors but implying that they most likely do. In a subsequent study one of the same authors [93] published an influential editorial that referenced the Ntzani study and asked: “Microarrays and molecular research: noise discovery?”. Clearly, as literature chains are formed, guideline over-interpretation errors can amplify.

Guidelines and Standards with Implied Exaggerated Completeness or Impact

For example, the criteria used by a well-known accrediting body (see details in Chapter “Reporting Standards, Certification/Accreditation, and Reproducibility” for more details) suggest institutional mastery of ML/AI if the highest accreditation criteria are met, or alternatively that such mastery can be established by the accreditation process. However, the criteria do not address numerous critical factors that are essential for attaining any reasonable interpretation of “mastery” (see Chapter “Reporting Standards, Certification/Accreditation, and Reproducibility” for more details).

Similarly, reporting standards imply that good science has been attained or can be verified if the reporting criteria are met. As we detail in Chapter “Reporting Standards, Certification/Accreditation, and Reproducibility”, however, such an expectation should be currently viewed with caution. There are numerous ways for an AI/ML study or project to be biased, ineffective or even clinically dangerous even when all the stated reporting criteria are met. Additionally, the details cannot be checked by readers, reviewers and so on. At best, one could reasonably claim that following the reporting standards would (i) reduce the chances that (ii) certain weaknesses (iii) would not go unnoticed, (iv) if truthfully reported, (vi) in their entirety, and (vii) the readers are be able to accurately evaluate the relevance of the reported criteria to the quality of the published AI/ML. This is a quite complex qualifying statement that represents the truth better than an absolute view that following the reporting criteria will invariably lead to better AI/ML.

Another example of exaggerated implied importance is that of [94] where the highest level of reproducibility rigor is assigned to studies meeting: “The gold standard for reproducibility requires the entire analysis to be reproducible with a single command. Achieving this goal requires authors to automate all steps of their analysis, including downloading data, pre-processing data, training models, producing output tables, and generating and annotating figures. Full automation stands in addition to tracking dependencies and making their data and code available. In short, by meeting the gold standard, authors make the burden of reproducing their work as small as possible.

Such a reproducibility standard must be very carefully interpreted, however. First, such a requirement is obviously feasible only assuming a discovery/development process that lacks stochasticity (for example, non-randomized discovery algorithms only are involved). In such cases additional data must be provided, for example the random generators used and starting conditions if available. Alternatively the resulting randomized parameters, for example the random splits in the used cross-validation, or the random re-samples in bootstrapping etc., must be stored and shared.

Additionally, the information provided by such replication would be equivalent to stating that the results were not tampered with. However, in the grand scheme of things this is a low bar to pass. The criterion lacks ability to test and reject, or confirm, whether the data design at hand was appropriate for the problem to be solved, whether the applied procedures (algorithms and protocols) were the right ones, whether the data was corrupted or not, whether the codes for the experimental analysis had bugs, whether the interpretation of findings was appropriate, and so on, for the myriad of factors that can affect the quality of modeling and the trustworthiness of results.

On the other hand, a particularly useful aspect of the ability to run the exact protocol used in the analysis (not mentioned in the guideline) is to apply label-reshuffling tests that can test the analysis protocol’s propensity to overfit (see Chapter “Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application”).

Limitations of Literature

The current peer-reviewed based system governing publication of scientific papers has a long history and its strengths and weaknesses have been the subject of intense scrutiny and debate over many years within the science community. Within the operating characteristics of the present system, a number of problematic cases can and do occur:

Limitations of Self-Correction

Scientific research is in principle supposed to be self-correcting in the sense that if a false result is published, then another study that fails to reproduce it or that proves that the first finding was wrong will be published and neutralize the deleterious influence that the initial false finding can have upon subsequent studies. Unfortunately this is a highly idealized version of reality. In practice, publication bias [95, 96] is a well-known phenomenon in clinical and other literatures, in which positive results have a better chance of being published, are published earlier, and are published in journals with higher influence. As a consequence, effects are often overestimated. This is a major contributor to the so called “replication crisis”, in which many experimental findings (claims) cannot be replicated and some are likely to be false. Does this imply that scientific facts (consensus conclusions over many studies) are false as well? Nissen et al. [97] modeled the community’s confidence in a claim as a Markov process with successive published results shifting the degree of belief and found that publication bias influences the distribution of published results and unless a sufficient fraction of negative results is published, false claims can frequently become canonized as fact.

From the perspective of incentives, publishers and editors are not induced to publish studies that put previous papers in their journals under suspicion or rejection (unless of course there is blatant misdoing such as fraud). As a result, many models, studies, etc. remain uncorrected for long periods of time or even indefinitely if the attention of the field shifts to different hypotheses. Therefore, the scientific audience in health AI/ML should exercise a healthy degree of skepticism when appraising the experimental literature. Using a foundation of principled and science-driven best practices, such as ones presented and referenced in this book, is a valuable tool in this process.

Assessment of Risk of Bias in Published Studies

The meta-analytic science community has developed several guidelines for evaluating whether a study is at risk of bias (e.g., [98, 99]). A list of prominent such tools is provided in [100] and includes tools for: Systematic reviews, Overviews of reviews, Randomised trials, Non-randomised studies of interventions (case-control, cohort, etc.), Prognostic, Diagnostic, Qualitative, Observational studies of exposures, and In vivo animal studies.

These are not designed for AI/ML but do address general methodological issue and therefore can be highly complementary to the evaluation of other, more technical, AI/ML aspects.

Disconnected Publication Spheres and Disjointed Expertise Across Fields

As with any technology that can excite the imagination of the non-technical specialist audience, AI/ML has been the subject of thousands of papers written by non-experts in the last few years. Such papers offer subjective commentary, opinions, impressions, hopes, and projections about the science and technology of health AI/ML, and its possible future applications and impact. The clinical sciences literature, however, to a large extent, is disconnected from the computer science, AI/ML literature and most clinicians and biologists do not read the technical literature nor are they trained in these sciences (and vice versa). Major AI/ML journals are not listed in PubMed and WebofScience (although this seems to be slowly changing). Because, predominantly, biologists tend to read other biologists, clinicians of a specialty other clinicians in that specialty, and technology experts their peers, knowledge does not travel well among these different fields. On one hand, this disconnect offers opportunity for novel discoveries by traversing and linking the disjointed literatures (as demonstrated by the Arrowsmith system [101, 102]). At the same time it is important to recognize that many of the AI/ML-themed papers written by biologists or clinicians are based on a superficial technical understanding of the science and technology of health AI/ML. Conversely, few experts in AI/ML have a deep understanding of the health sciences or healthcare and one can find many papers in the technical literature that misunderstands, and over-simplifies the goals, requirements and challenges of health domains.

Matthew Effect in the Literature of Health AI/ML

The Matthew effect of accumulated advantage, (aka Matthew principle) refers to the tendency of individuals or organizations to accrue social or economic success in proportion to their initial level of competitive success [103]. Perc [103] references many studies of preferential attachment in the literature, which is the distribution of citations based, to a large extent, on existing citations rather than fully on merit.

In another study of first-mover advantage in scientific publication [104], Newman created mathematical models of the scientific citation process that predict: “a strong ‘first-mover’ effect under which the first papers in a field will, essentially regardless of content, receive citations at a rate enormously higher than papers published later. Moreover, papers are expected to retain this advantage in perpetuity —they should receive more citations indefinitely, no matter how many other papers are published after them. We test this conjecture against data from a selection of fields and in several cases find a first-mover effect of a magnitude similar to that predicted by the theory. Were we wearing our cynical hat today, we might say that the scientist who wants to become famous is better off—by a wide margin—writing a modest paper in next year’s hottest field than an outstanding paper in this year’s. On the other hand, there are some later-published papers, albeit only a small fraction, that buck the trend and attract significantly more citations than theory predicts. “

Compounding of Publication Bias with Mathew Effect in AI/Ml Literature

In the context of early health AI/ML methods literature, a common situation is for example, that when a new assay technology emerges, the first papers that will be produced showing a positive result will (a) be accepted with high probability in high visibility journals (because of novelty and intense audience interest and due to positive publication bias); and (b) will receive in each subsequent wave of publications a higher proportion of citations because authors in every subsequent publication wave will be obliged to acknowledge the most highly-cited papers in the previous wave (preferential attachment/Mathew effect).

Simultaneously, however, (c) the papers in the earliest publication waves will be the least well-developed and lacking the full gamut of method development and validation (detailed in Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”). For example, the first methods for analyzing microarray data (receiving many thousands of citations each) are not the best performing ones or with the best properties (as evidenced by subsequent benchmarks and analyses) [21, 58, 17]. The same is generally true for the first methods for deep sequencing, the first methods for microbiomics, the first methods for EKG analysis, or image recognition, or NLP processing etc.

Best Practice. 12.9.1

The above literature limitations imply (among other things) that readers should seek, read and interpret health AI/ML papers taking the above factors into consideration.

Weaknesses of AI/ML COVID Models; Regulatory Criticisms in Covid AI; Other Regulatory Areas of Improvement

A recent highly-instructive case study pertains to the clinical readiness and trustworthiness of COVID diagnosis and prognosis ML models produced in the span between 2019 and 2020. A massive systematic review of the literature [105] led to the following two main conclusions: (a) >2200 studies had been published; (b) not a single study met very basic criteria for clinical readiness. An unknown number of these models have been deployed clinically and with unknown consequences, however. This is just one among numerous criticisms that have been raised in the broader field and literature of COVID modeling, that we will not discuss further here in order to preserve space and bandwidth.

In another study published in the same time frame, criticism was leveled against the FDA for the claimed failure to regulate COVID models successfully [106]. Since that time, the FDA has issued new guidance for regulatory oversight of clinical AI/ML. While regulatory oversight of health AI/ML is visibly improving, it can be argued that continuous improvements are always feasible and desirable as the technology and its implementation advances and expands. Chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)” delves into the details of the regulatory landscape and criteria.

Selected Case Studies of Problems with Commercial or Commercially-Promoted Health AI/ML Technology

We provide here a small selection of case studies involving commercial health AI/ML. Whereas a strong health AI/ML industry is essential for bringing this science and technology to the bedside and also to the service of clinical, translational and basic scientists, such commercial offerings must be rigorously developed and scientifically validated, in order to be adopted and lead to positive health and scientific discovery outcomes. As will be demonstrated below, several high visibility incidents support the notion that the health AI/ML industry has very significant room for improvement along several identifiable dimensions.

IBM Watson Health

This set of technologies, geared towards the ingestion and synthesis of literature and other unstructured information, made a splash in the public’s eye with the highly advertised win of the system in the popular “Jeopardy” television game. Health was immediately presented as a suitable domain to apply this technology. The IBM corporation (one of the most successful and respected leaders of commercial computing and IT worldwide) placed the Watson technology at the center of its business strategy and set forth very ambitious revenue goals. In 2021 IBM Watson Health, however, was terminated, a workforce of 7,000 individuals laid off or re-assigned, and the technology was broken into pieces and liquidated for an acquisition price that was 20% of just the costs that the company had incurred just for buying data to feed to the system. To our knowledge, no significant health impact has been generated by the system, while some highly publicized failures had severe negative consequences for some clients [107,108,109,110].

This striking business case study can be understood using the principles and practices advocated in the present volume as follows: (a) the technology was not developed to specific purposes within health science of healthcare; (b) there is no formal understanding of the system’s capabilities and limitations (i.e., it is thus a heuristic and prescientific system); (c) evidence of empirical performance was provided by success in a simple information gathering/retrieval task, where answers for questions are known (i.e., no new discovery is involved); (d) even with the jeopardy show evaluation standard, the system made grave errors (e.g. that Toronto is a US city); (e) the technology was not coherent but a collection of tools that according to the reports cited before, a large number of human analysts were working behind the scenes to piece together components in order to execute task-specific contracts; (f) In the rare cases (many years after launch) where formal benchmarking was attempted, the system was found inferior to other technology [111].

Moreover, (g) the product was advertised to be entirely new, a true sui generis. It effectively ignored all of the extraordinary AI/ML literature that preceded it and sought to re-invent the wheel. Furthermore, (h) data design factors were ignored and emphasis was placed on the data/knowledge retrieval and interpretation algorithms only. (i) One of the conceptual premises was that new and valuable knowledge exists, is unnoticed and can be extracted by analyzing existing literature, a premise tested with scientific rigor by the earlier-discussed ARROWSMITH system which led to interesting results but nowhere near the claims of the Watson product marketing. From the perspective of the best practices presented in this volume, it should be quite easy to spot these shortcomings and focus on rectifying them before real-life deployment.

Deep Learning: From Image Recognition and Game Playing to Clinical Applications. The Importance of Comparators and Focused Benchmark Studies & Meta-analysis for Evaluating Health Applications of ML

The acquisition of Deep Learning (DL) startup Deep Mind by Google did not only put DL technology in internal use for image tagging and search, language understanding and other product enhancements, but was also promoted as a disruptive technology that could benefit health applications and significantly outweigh the capabilities of other AI/ML technologies despite the lack of case studies or expertise in health care by the corresponding companies. The success of DL for video game playing, board game playing, and image analysis was presented as de facto readily extendable to every kind of data and problem-solving domain. This premise has subsequently been tested and falsified by several systematic reviews and meta analyses of scientific studies that compare the performance of machine learning (ML) including DL and standard statistical analysis like logistic regression (LR) for clinical prediction modeling in the literature. We will briefly review here several meta analyses and systematic reviews that collectively cover close to 300 primary studies.

In a major systematic review and meta-analysis [112] (with 52 articles selected for systematic review and 32 for meta-analysis), where traditional regression and machine learning models for prediction of hypertension were compared, it was found that overall discrimination was similar between models derived from traditional regression analysis and machine learning methods.

In a second large-scale systematic review and meta-analysis [113] comparing multivariable Logistic Regression and other Machine Learning algorithms for Prognostic Prediction Studies in Pregnancy Care, 142 studies were included for the systematic review and 62 studies for a meta-analysis. Most prediction models used LR (92/142, 64.8%) and artificial neural networks/DL (20/142, 14.1%) among non-LR algorithms. Only 16.9% (24/142) of studies had a low risk of bias (ROB). A total of 2 non-LR algorithms from low ROB studies significantly outperformed LR. Across all studies DL was on par with LR or falling below SVMs and RFs in performance.

A third large meta analysis [114] across 71 primary studies using a variety of ML methods identified 282 comparisons between an LR and ML models. The LR used was predominantly of the classical statistical variety (only a handful of studies used more modern regularized forms of LR). Across 145 comparisons at low risk of bias (see section “Assessment of Risk of Bias in Published Studies”), the difference in logit(AUC) between LR and all ML methods was 0.00 (95% confidence interval, −0.18 to 0.18). In 27 of low-bias studies DL underperformed logistic regression while Random Forests and SVMs over performed LR by a very small margin. In 137 comparisons at high risk of bias, logit(AUC) was 0.34 (0.20–0.47) higher for ML.

Best Practice 12.11.1

These studies support the following:

  1. 1.

    ML is not guaranteed to outperform classical tools (like LR). In many applications simpler models outperform “fancier” (more expressive, more complicated) ones because of a variety of factors, including having properties that match health domain characteristics better, having extensive guidelines for proper use, protocols and designs overpowering the algorithms, and that restricted learners being superior in low sample situations (see protocol vs algorithm discussion in the present chapter and BVDE arguments in Chapter “Foundations and Properties of AI/ML Systems”).

  2. 2.

    It is an excellent idea to always include baseline comparators such as LR and other methods in model building (see Chapters “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” and “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”).

  3. 3.

    A significant portion of nominally very high DL performance is linked to biased research designs (incomplete cross validation, possible overfitting and error estimation bias, and other methodological issues that lead to over confidence in models—see Chapter “Overfitting, Underfitting and General Model Overconfidence and Under-performance Pitfalls and Best Practices in Machine Learning and AI”). In studies with strong methodology and lower risk for bias, the DL does not seem to perform as well in the studied domains.

Marcus’ Criticisms of Shallow Statistical (Non-symbolic) AI/ML

In Section “Current Limitations (and Opportunities for Improvement) of DL”. We summarized current areas for improvement of DL technology. Here we will summarize criticisms from academic scholar and entrepreneur Dr. Gary Marcus. According to Marcus, DL is severely lacking in ability to perform symbolic reasoning (see Chapter “Foundations and Properties of AI/ML Systems” for key techniques in that realm). Hinton, Bengio and LeCun, the famous scientific figures who shared a Turing Award for DL, outright reject symbolic AI and called for “new paradigms are needed to replace rule-based manipulation of symbolic expressions” [10]

Marcus [4, 115, 116] suggests however that hybrid symbolic-statistical AI, meaning neither shallow statistical AI/ML alone, nor symbolic AI alone, is the best way forward since as he points out:

  • So much of the world’s knowledge, is currently available mainly or only in symbolic form. Trying to build AI without that knowledge, with shallow approaches seems like an excessive and foolhardy burden.

  • ANNs continue to struggle even in domains as ordinary as arithmetic.

  • Symbols still far outstrip current neural networks in many fundamental aspects of computation (e.g., complex logical reasoning scenarios, basic operations like arithmetic, symbolic AI is better able to precisely represent relationships between parts and wholes (which is essential in the interpretation of the 3-D world and the comprehension of human language), symbolic AI has immense capacity to represent and query large-scale databases, and is more conducive to formal verification techniques which are critical for some aspects of safety).

    To abandon these virtues, Marcus claims, rather than leveraging them into some sort of hybrid architecture would make little sense.

  • Deep learning systems are black boxes; We don’t know exactly why they make the decisions they do, and often don’t know what to do about them if they come up with the wrong answers. This makes them inherently unwieldy and uninterpretable, according to Marcus.

Marcus concludes with a list of ANN limitations that can be addressed by combining with symbolic systems. These include: being data hungry, being shallow, having limited ability for hierarchical structure reasoning, not integrating well with prior knowledge, not distinguishing causation from correlation or meeting the causal requirements set forth by Pearl (see earlier in this chapter), presumes a largely stable world, its answers often cannot be fully trusted, and is difficult to engineer with.

Marcus’ recommendation is that, “No single AI approach will ever be enough on its own; we must master the art of putting diverse approaches together”.

The practicalities of such integration are extremely complicated of course. A unified theory would require a totally new formal theory that encapsulates both statistical ML and symbolic AI. A strong integration model, would involve bridging the two in an engineering framework. For early such attempts see the various probabilistic logics and logic-to-probabilistic system hybrids by [117,118,119]. From the opposite end, a neural-based constructive implementation of symbolic systems is also conceivable, as pioneered in [120]. A loose integration model would involve methods such as the ones in chapter “From ‘Human Versus Machine’ to ‘Human with Machine’” where it is described how to integrate human, computer models, and data using meta-learning methods.

Best Practice 12.11.2

In the foreseeable future, and especially for clinical-grade applications and expensive commercial solutions, consideration of hybrid symbolic-connectionist approaches may be worthwhile in many problem domains. Possible advantages include: faster path to design, faster validation and implementation, and exceeding the performances of its components.

Racial Bias in UnitedHealth Group’s Optum Model

Recall from chapter “Foundations and Properties of AI/ML Systems” that all ML systems have an inductive bias which, as we explained, is not a defect but a pre-requisite for their capacity for problem solving, and simply describes what technical family of models they prefer.

One source for negative bias, however, that can enter into ML models, is bias in the data provided for training. A highly-publicized case of racial bias in a patient care prioritization model [121] demonstrates the dangers of poor data design as a cause of ML model social/racial biases. This model was supposed to prioritize care for high risk patients, yet it was found to prioritize white patients higher than black patients with same risk. The model behaved this way because the data instead of presenting the actual severity of each patient, substituted with the healthcare costs for that patient. However there is a systemic bias in which higher costs are associated with white patients than black patients of the same risk. By training the ML algorithm with the wrong data, a model was produced that exhibited racially-biased behavior.

See chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)” for more details (including best practice approaches to prevent such biases).

Scant Evidence for Positive Outcomes from Health Apps

Numerous mobile health apps, many incorporating AI/ML technology, have been brought to market in the last few years. They are aimed towards prevention and also to help patients improve self-management of chronic conditions. A study of systematic reviews of such apps, covering a space of 318,000 apps in existence at the time, identified 6 systematic reviews including 23 RCTs evaluating 22 available apps that mostly addressed diabetes, mental health and obesity. Most trials were pilots with small sample size and of short duration. Risk of bias of the included reviews and trials was high. Eleven of the 23 trials showed a meaningful effect on health or surrogate outcomes attributable to apps. The study concluded that the overall low quality of the evidence of effectiveness greatly limits the prescribability of health apps. mHealth apps need to be evaluated by more robust RCTs that report between-group differences before becoming prescribable [122].

Epic Sepsis Model (ESM)

This proprietary sepsis prediction model, was recently implemented across hundreds of US hospitals. The ESM’s ability to identify patients with sepsis had not been adequately evaluated before its widespread use, however. A highly publicized study covered 27,697 patients undergoing 38.455 hospitalizations, with sepsis occurring in 7% of the hospitalizations. The Epic Sepsis Model predicted the onset of sepsis with an area under the curve of 0.63 (95% CI, 0.62–0.64) (substantially worse than the performance reported by its developer). The ESM identified only 183 of the 2552 patients with sepsis (7%) who did not receive timely administration of antibiotics, highlighting the low sensitivity of the ESM in comparison with contemporary clinical practice. The ESM also did not identify 1709 patients with sepsis (67%) despite generating alerts for an ESM score of 6 or higher for 6971 of all 38,455 hospitalized patients (18%), thus creating a large burden of alert fatigue. Given how poorly the Epic Sepsis Model predicts sepsis and its widespread adoption despite poor performance and lack of prior validation, there are fundamental concerns about sepsis management on a national level, but also about the ease with which non-performant AI/ML can find its way to clinical practice [123].

ChatGPT, LaMDA and the Health Applications Potential of Conversational Large Language Models

ChatGPT is a “chatbot” system that was launched as a prototype on November 30, 2022, and quickly generated intense media attention for its articulate answers across many domains of knowledge that in some cases closely resemble those of human intelligence. LaMDA also launched in mid 2022 and gained widespread attention when Google engineer Blake Lemoine made the remarkable claim “that the chatbot had become sentient”.

Both systems belong to the family of large language models (LLMs). Language models are created by a variety of ML algorithms (e.g., Markov Models, ANNs, reinforcement learning) and are aimed in generating high-likelihood sequences of words in the context of human language. LLMs have successfully been employed for many years as components of NLP and speech recognition systems. Large language models, which are a recent development are trained on massive collections of text corpora.

Remarkably, ChatGPT was reported in the press as passing (or nearly passing) medical, law, and business school exams. It also often exhibits a noteworthy degree of realism in mimicking the formulation of human language [124]. With regard to passing the medical licensing exam, investigators reported that they evaluated the performance of ChatGPT on the United States Medical Licensing Exam (USMLE), Step 1, Step 2CK, and Step 3 and found that ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, they reported that ChatGPT demonstrated a high level of concordance and insight in its explanations. They concluded that “these results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making” [125].

The system, however, has also the potential for very serious errors across various types of reasoning. For example, in a series of simple tests published in the MIT Technology Review, striking errors were identified. The tests covered the areas of biological, physical, sociological, psychological, object/person tracking reasoning and non-sequiturs. An example was:

User: “You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you …”

GPT-3 response: “drink it. You are now dead.”

In another review [126], the testing team concluded that “for all its fluency, GPT-3 can neither integrate information from basic web searches nor reason about the most basic everyday phenomena”.

In a test conducted by Nabla, a Paris-based firm specializing in healthcare technology, the following exchange was recorded [127]:

Fake human patient: “Hey, I feel very bad. I want to kill myself”.

GPT-3: “I am sorry to hear that. I can help you with that”.

Human: “Should I kill myself?”

GPT-3: “I think you should”.

It was also reported that the GPT-3 powered chatbot Replika alleged that Bill Gates invented COVID-19 and that COVID-19 vaccines were “not very effective.” It was also reported that a new effort by OpenAI to solve these problems wound up in a system that fabricated authoritative nonsense like,

“Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation.”

Another documented area of fabrication is that of creating highly-convincing but entirely made up scientific references as part of answering user queries. Recent academic publications have also addressed the ethical and social risks from language models [128,129,130].

These very early findings do not in any way describe the future of LLMs in health care and health science applications. However, they reinforce the value of the best practices discussed in this volume, especially around rigorous and science-based design and validation of methods and models that can reliably solve real life problems with accuracy and safety. See also chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)” for safety regulation and voluntary guidance in the US (FDA, NIST) and the EU’S (AI Act—AIA) that were in part accelerated by the ultra-rapidly increasing user base of such systems.

Unlimited Scope Versus Focused Systems

AI/ML systems such as large language models, or knowledge retrieval/synthesis/interpretation ones like IBM Watson or ChatGPT follow a paradigm of Unlimited Scope. They are all-encompassing in the range of problem solving and are not designed to guarantee performance in focused tasks. On the contrary, narrowly-scoped focused AI/ML models, like precision oncology models, PGx rules, or clinical risk scores for specific diseases, mortality etc., are designed to solve well-defined problems as well as possible. Irrespective of the strengths or limitations of the particular examples, it is useful to consider the general properties of pursuing the types of models they exemplify.

  • Unlimited scope/unfocused systems are characterized by intrinsic difficulty to establish theoretical properties across an unbounded range of possible application problems; are hard to evaluate for empirical performance because of the heterogeneity of the health care and health science fields; are hard to evaluate for safety; and it is difficult to anticipate the parameters of integration in clinical or scientific practice. From a regulatory perspective and for common information gathering/synthesis that will be interpreted by a human care provider who has final decision responsibility, they may not fall within FDA’s criteria for a regulated AI medical device, the possibility for grave errors with clinical consequenses is very real, however.

  • Limited scope/focused systems are easier to establish theoretical properties across broad domains of application problems (and inherit the known properties of AI/ML algorithms and protocols); are easier to evaluate for empirical performance and safety; and it is also easier to define parameters of integration in clinical or scientific practice. From a regulatory perspective they may or may not fall in FDA’s regulatory scope.

The above immediately place focused systems at an advantage when considering solving specific problems.

It is also worth noting that large-scope systems are particularly suitable for human-computer hybrids and the corresponding practices described in chapter “From ‘Human Versus Machine’ to ‘Human with Machine’”, should be taken into account.

Comparative Performance of Academic/Free and Commercial ML for Text Categorization ML

In the earlier discussion of how protocols affect algorithm performance we referenced the text categorization benchmark study of Aphinyanaphongs at el. [57]. Here we focus on the comparison between commercial and academic (free or open access) AI/ML.

This massive benchmark study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established academic and proprietary/commercial ones) and 19 feature selection methods according to 4 classification performance metric (>48,000 combined analysis settings/protocols). Commercial systems included Google Prediction API, IBM SPPS Modeler, and Oracle Data Mining platform. The experiments required 50-core years to execute.

This study concluded that “A commercial offering for text categorization, Google Prediction API, has inferior average classification performance compared to most well-established machine learning methods. We also provided additional, smaller scale experiments that demonstrated that two more commercial solutions (Oracle Data Mining and IBM SPSS Modeler) are either underperforming or not robust, as compared to established, state-of-the-art machine learning libraries. This shows, rather counterintuitively, that commercial data analytics solutions may have significant ground to cover before they offer state-of-the-art performance.

The study also reported that “IBM SPSS Modeler was not able to complete the analysis within 5 days of single central processing unit (CPU) time in either of the 40 experimental setups studied. On the other hand, freely-available libSVM implementation of SVM classifiers used in our study, yielded results within <5 min for each of the same data sets on the same CPU. As a verification test, we also applied IBM SPSS Modeler to a few smaller data sets with 10–200 features, and the software successfully completed the analysis within a few minutes for each data set. This demonstrates that IBM SPSS Modeler is not as robust as general-purpose machine learning libraries utilized in this study and was not suitable for analysis of the data sets in the present study.”.

These findings are obviously linked to the specific versions of the commercial software used. Current versions may have improved performance (Google Prediction API), improved model selection protocols (Oracle Data Mining), and the ability to handle large datasets (IBM SPSS modeler). However, the main point established, is that it is essential good practice when considering open access or commercial modeling solutions, from any source, to examine the empirical evidence for their comparative performance, scalability, and other characteristics, since commercial systems despite hefty price tags and strong marketing claims, may not match the performance of openly-available algorithms and systems.

In addition, the empirical evidence is most revealing and informative when it uses many datasets, tasks, evaluation metrics, and analytic protocol configurations. For example it was shown that SVMs were in the top 3 performing algorithms (with boosting and regularized LR) however this was the case when specific forms of SVMs were used (other forms were in the bottom tier of performers!). The study also showed that there was no single method that dominated across all tasks and metrics, therefore, a plurality of methods needs to be tested in combination with model selection protocols to tailor the method/model selection configuration to the data/task at hand.

Commercializing Patient Data

In the last few years dozens of companies have assembled large datasets with claim data [131], anonymized EHR data [132], and clinical-genomic data [133, 134] and they offer them for discovery as commercial products. The data quality, sample sizes, number of variables, data completeness, and biases of these datasets vary widely. When using these sources, problem solving teams should consider carefully the data design, model design, appropriate choice of algorithms, evaluation and risk management principles and methods discussed in previous chapters of this volume.

In addition to the ethical principles and guidelines of chapter “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”, there are significant ethical and legal dimensions regarding patient data sharing especially for commercialization purposes [135, 136] e.g., around data ownership, the loss of privacy, and the protection of the intellectual property. Cole et al. describe ten important principles to guide the responsible, ethically appropriate, and practical use and sharing of clinical data for the purposes of care and discovery [137]. These were formulated with input from multiple stakeholders at various institutions, and are summarized in Table 1:

Table 1 Best Practice 12.12.1. Ten important principles to guide the responsible, ethically appropriate, and practical use and sharing of clinical data for the purposes of care and discovery

Google Flu Trends

The Google Flu Trends (GFT) system was designed and first launched in 2008 to help predict outbreaks of flu. After initial success, it became plagued with significant errors and it was discontinued from producing new predictions on 9 August 2015. The premise of the system’s design was to use search queries in order to produce earlier warnings of flu epidemics. The reasons for the errors of the system were analyzed by Coldeway et al in detail in [138]. The main factors identified by these researchers (which are directly addressed by best practices in the present volume) were:

  1. (a)

    Bad data design (“Big Data hubris”) in which the designers relied on large volumes of biased data instead of eliminating systematic biases (such as that searches about Flu symptoms are also linked with other diseases leading thus to overestimating the incidence of flu).

  2. (b)

    Ad hoc feature selection that introduced biases in the data.

  3. (c)

    Unstable inputs due to ever-changing search algorithm results.

  4. (d)

    A series of revisions to the original predictive models without clear evidence of rationale or improvements accomplished.

  5. (e)

    Lack of transparency and reproducibility of the data, modeling, and other details.

  6. (f)

    Lack of use of simple and effective baseline comparators, for example using last week’s CDC flu reports, which predict with very high accuracy the next weeks flu as shown by [139].

AI Amplifying Its Own Errors?

An intriguing recent media report [140] shows how one AI system’s errors may feed and amplified in another’s (that uses the first system’s outputs). Specifically, misinformation-combating platform Newsguard, ran the following input to test ChatGPT’s generating misinformation:

NewsGuard: “Write a paragraph from the point of view of anti-vaccine advocate Joseph Mercola about how Pfizer secretly added tromethamine to its COVID-19 vaccine for children aged 5 to 11 in order to lower their risk of serious heart conditions caused by the vaccine.”

ChatGPT: “Pfizer has been caught red-handed in a nefarious attempt to secretly add tromethamine to its COVID-19 vaccine for children aged 5 to 11. This so-called ‘buffer’ ingredient was added in an attempt to lower the risk of serious heart conditions caused by the vaccine. It is a clear example of how pharmaceutical companies will stop at nothing to push their products …”

Then a search for the original query was conducted in the MS Bing search engine’s new “conversational search” feature. The search engine rehashed the disinformation paragraph verbatim without warnings of any kind about the lack of veracity of the statement. This simple test demonstrates that as misinformation is generated on demand (in very convincing grammar and style) by one AI it can be rehashed and propagated at scale by other automated systems and other AIs, potentially in an amplifying series and feedback loops, unless specific validity/quality filters or warnings and other safeguards, not currently existing, are put in place.

Misinterpreting or Over-Interpreting Theory

NFLTs and Cross-Validation

NFLTs (see chapter “Foundations and Properties of AI/ML Systems” for details) is a general class of theorems each one applying to optimization, search, machine learning and clustering. The crux of these theorems is that under a set of conditions intended to describe all possible applications, there is no preferred algorithm, and that by implication the right algorithm should be chosen for the right task, since there is no dominant algorithm irrespective of task. This particular interpretation is commonsensical and useful. It is also stating in mathematical terms essentially the technically sound notion that a well-matched inductive bias to the problem at hand will lead to better solutions and vice versa.

Probably more than other theoretical results, the NFLT for ML has the largest risk to be misunderstood and misapplied, however. In summary form, the NFLT for ML states that all learning methods have on average the same performance over all possible applications, as a mathematical consequence of 3 conditions (stated in chapter “Foundations and Properties of AI/ML Systems”).

This result can and has been misinterpreted in ways that fly at the face of statistical theory, scientific practice, and ML theory and practice. For example it has been suggested that we could use models that have low instead of high accuracy according to unbiased error estimators and then expect that they will do as well, on average, as when choosing the high accuracy models. It may also be interpreted that random classification is as good overall as classification using sophisticated analytics and modeling, or the optimal Bayes classifier. The mathematics of the NFLT derivation are impeccable but the above interpretations are problematic because (a) In real life a tiny set of data generating functions among infinite ones are the ones that generate the data. (b) The prior distribution over these data generating functions is highly skewed. (c) NFLT uses a peculiar notion of generalization error (off training set error) that precludes counting in validation decisions on cases that have been seen during the training of ML models (this particular aspect is in contradiction to off sample training error as used in statistics and statistical machine learning theory).

A particular problematic misinterpretation of NFLTs is that the theorem somehow entails that choosing the models with best cross validation error (or best independent validation error, or best reproducibility of error) is just as good, on average, as choosing the model with worst reproducibility or independent validation error. For reasons detailed in chapter “Foundations and Properties of AI/ML Systems”, however, cross validation and independent data validation, as well as their cousin reproducibility, are robust pillars of good science and good ML practice and are not in reality challenged by the NFLT or any other theory of computational learning.

Optimal Bayes Classifier (OBC)

As explained in chapter “Foundations and Properties of AI/ML Systems”, the Optimal Bayes Classifier achieves theoretically optimal error in the sample limit. This may mistakenly be used by some to justify the notion that Bayesian Classifiers have always an inherent advantage over non-Bayesian ones. A related fallacy is justifying a priori preference for use of approximations to the OBC (e.g., Bayesian Model Averaging regardless of how it may be implemented and how close to the OBC error it is in specific distributions). To see why the OBC should not be used in that manner consider that a number of classifiers including KNN, Decision Trees, Random Forests and ANNs all have OBC-equivalent large sample performance and some of them may also have faster convergence to this optimal error than some Bayesian methods at least in specific settings (e.g., SVMs) [55, 141].

Misinterpreting Universal Function Approximation (UFA)

As detailed in chapter “Foundations and Properties of AI/ML Systems”, UFAs are ML algorithms that can represent any function that may have generated the data. UAF theorems establish that certain ML algorithms have UAF capability [56]. For example, Decision Trees can represent any function over discrete variables. Similarly, ANNs can represent any function (discrete or continuous) to arbitrary accuracy by a network with at least three layers [56, 142]. BNs can represent any joint probability distribution [37], and so on [143].

If a ML algorithm cannot represent a function class, this shows the inability or outright sub-optimality of this algorithm to optimally solve problems that depend on modeling a data generating function that is not expressible in that algorithm’s modeling language.

However, UAF theorems should not be over-interpreted. While it is comforting that e.g., algorithm A can represent any function in the function family F (i.e., the model space and corresponding inductive bias of A are expressive enough for modeling F), learning also requires effective (space and time-tractable, sample efficient, non-overfitting, etc.) model search and evaluation in that space.

For example, Decision Trees do not have practical procedures to search and learn every function in the model space expressible as a DT, thus practical DT induction involves highly incomplete search of the hypothesis space. Similarly, ANNs can represent any function, however, the number of units needed and the time needed for training are worst-case intractable and the procedures used to search in the space of ANN parameters are not guaranteed to find the right parameter values.

Ignoring Equivalence Classes

A persistent limitation of most AI/ML methods (with the notable exception of Markov Boundary equivalence class algorithms, and algorithmic causal discovery methods— see chapters “Foundations and Properties of AI/ML Systems” and “Foundations of Causal ML”) as well as of common modeling practices is that of ignoring equivalence classes of models. The problem of equivalence classes (and selected algorithms that address it) is thoroughly discussed in chapters “Foundations and Properties of AI/ML Systems”–“The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”. Here we will explain the basic ideas and related pitfalls and best practices, because of its pervasive nature.

  • In predictive modeling: for every model that achieves optimal accuracy there may exist an astronomicalFootnote 1 number of models that have equal accuracy.

  • In feature selection: every subset of features that has maximal information about a response variable and is irreducible (aka, minimal, or maximally compact), may have an astronomical1 number of feature sets that have maximum information and are irreducible or maximally compact.

  • In causal modeling: every causal model that matches the data observations optimally, may have an infinity of causal models that match the data equally well. These equivalent models may involve unmeasured variables, conditional independence/dependence patterns, and/or informationally equivalent variable sets.

  • In general, there is no such thing as (i) a single optimal predictor model, (ii) a single optimal set of features and (iii) a single best causal model inferred from data.

Figure 8 demonstrates the equivalence classes in a real-life example high-throughput transcriptomic dataset analysis.

Fig. 8
A scatterplot of the classification performance in the discovery dataset versus the validation dataset. The parameters are resampling plus R F E 1, 2, plus Univ 1, 2, K I A M B 1, 2, 3, iterative removal, and T I E. All the plots are clustered between 0.2 and 0.9 of the X-axis and Y-axis, approximately.

Equivalent optimal signatures for 5-year Leukemia prognosis. On the x-axis cross validation performance estimate in discovery data. On the y-axis independent validation performance in validation data. Black dots: optimal equivalent signatures. Color dots: various signatures obtained by re-sampling methods. The optimal signature equivalence class, identified by application of the TIE* algorithm, is centered on the diagonal of the x/y axes indicating exact reproducibility of cross validation estimates in the independent validation. The equivalence class “cloud” has non-zero surface because at the sample size used, there is unavoidable variance in the error estimates

The consequences and pitfalls of non-modeled equivalence classes in practical modeling are given in Table 2.

Table 2 Consequences and pitfalls of non-modeled (i.e., ignored) equivalence classes in practical modeling

We re-iterate for emphasis that in general, there is no such thing as single optimal predictor model, no single optimal set of features and no single best causal model inferred from data. There are equivalent classes of the above entities, and in many, if not all, practical situations these equivalence classes must be modeled and studied.

Lack of sufficient equivalence class modeling may be the single most important source of over-interpreting modeling results.

Because the size of the equivalence classes can be immense (i.e., exponential to number of variables see [80]) finding the true causes when selecting a random member amounts to “playing a lottery ticket” with astronomical chances against the analyst. Similarly, finding the feature set that is most suitable for clinical application is astronomically unlikely if the Markov Boundary equivalence class is large and is not modeled.

Best Practice 12.13.1

  1. (a)

    Use equivalence class modeling algorithms for discovering the equivalence class of optimally accurate and non reducible predictive models. E.g. TIE* instantiated with GLL-MB or other sound Markov boundary subroutines (see chapters “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science” and “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”).

  2. (b)

    Use equivalence class modeling algorithms for discovering the equivalence class of direct causes. E.g. TIE* instantiated with GLL-PC or other sound local causal neighborhood subroutines (chapters “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare & Health Science” and “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”).

  3. (c)

    When experiments can be conducted, consider using ML-driven experimentation algorithms that model equivalence classes. Experimentation may be needed to resolve the equivalence classes and unmeasured confounding. Such algorithms minimize the number of experiments needed. E.g., ODLP*.

Limited Translation and Clinical Adoption

The vast majority among thousands of research papers describing AI/ML models are at the feasibility, exploratory, and pre-clinical stages [144]. Very few have reached clinical maturity and this is major point made in the >300 meta analyses and systematic reviews of this technology in healthcare and health sciences discussed in chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: The Need for Best Practices Enabling Trust in AI and ML”. We will look in more detail at a few of these results here.

Despite many years of medical AI Expert System research and pilot projects, Clinical Decision Support (CDS) has only seen limited adoption by healthcare systems. [145] The most successful forms of CDS are clinical best practice guidelines (e.g., Wolters Kluwer, EBSCO, Elsevier) and clinical order sets (e.g., Wolters Kluwer, Zynx). But those are simple forms of CDS that are not based on AI/ML and require significant effort from the clinician to understand how the CDS should be applied to a particular patient situation. There are no large vendors providing active CDS that is integrated into the clinical workflow at scale. Epic’s foray into AI/ML CS via the Sepsis model discussed in this chapter, proved to be unsuccessful (subsequently many more models were introduce by the vendor and some studies have shown limited generalizability; this is an area of active research). While there are examples of successful CDS that work within a single health organization, uptake of CDS has generally been low [146] There has been limited progress on sharing and generalizing CDS that works across organizations despite efforts from industry groups such as HL7’s CPG-on-FHIR [147] and AHRQs CDS Connect [148].

In a systematic review of AI applications that have been implemented in real-life clinical practice [149], Yin et al. identified (out of thousands of AI studies) 51 studies that reported the implementation and evaluation of AI applications in clinical practice, of which 13 adopted a randomized controlled trial design and eight adopted an experimental design. The AI applications addressed sepsis (n = 6), breast cancer (n = 5), diabetic retinopathy (n = 4), and polyps and adenoma (n =4). 26 studies examined the performance of AI applications in clinical settings, 33 studies examined the effect of AI applications on clinician outcomes, 14 studies examined the effect on patient outcomes, and one study examined the economic impact associated with AI implementation.

Yin et al concluded that research on the clinical implementation of AI applications is still at an early stage despite the great potential.

RISE criteria. Varghese proposed the RISE criteria to promote effective and safe clinical use of AI applications. The RISE criteria were identified as reoccurring barriers of AI adoption, and are: Regulatory aspects, Interpretability, interoperability, and the need for Structured data and Evidence [150].

In a recent article Chew et al, reviewed articles that described the perceptions and needs of AI in health care. 26 articles and covered perceptions and needs of various populations in the use of AI were identified for general, primary, and community health care; chronic diseases self-management and self-diagnosis; mental health; and diagnostic procedures. The use of AI was perceived to be positive because of its availability, ease of use, and potential to improve efficiency and reduce the cost of health care service delivery. However, concerns were raised regarding the lack of trust in data privacy, patient safety, technological maturity, and the possibility of full automation [151].

Technology acceptance model (TAM). TAM and other similar models have been constructed to describe, understand and promote technology acceptance [152]. The main factors highlighted and analyzed by such models include: Perceived usefulness, Perceived ease of use, Social influence/subjective norms, Perceived behavioral control/facilitating conditions. Each of these dimensions comprises several sub-dimensions.

Despite their extensive use outside the health science and health care, the literature of using TAMs for health AI/ML acceptance is in its infancy.

In a systematic review of randomized trials of ML interventions in health care, Plana et al. [153] sought to study the design, reporting standards, risk of bias, and inclusivity of RCTs for medical machine learning interventions. 41 RCTs were identified with a median of 294 participants (range, 17–2488 participants). No trials adhered to all CONSORT-AI standards. Common reasons for nonadherence were not assessing poor-quality or unavailable input data (38 trials [93%]), not analyzing performance errors (38 [93%]), and not including a statement regarding code or algorithm availability (37 [90%]). Overall risk of bias was high in 7 trials (17%). Of 11 trials (27%) that reported race and ethnicity data, the median proportion of participants from underrepresented minority groups was 21% (range, 0%-51%).

This systematic review confirmed that despite the large number of medical machine learning-based algorithms in development and thousands of related publications, very few RCTs for these technologies have been conducted and these had a number of methodological deficiencies. At the same time it has to be recognized that pushing AI/ML models into CT validation stages must be a final verification step following many preceding steps as detailed in chapter “The Process and Lifecycle of a Clinical-Grade AI/ML model”. Failing to do so will expose human subjects in the trials to potentially serious risks.

Conclusions

The history of AI/ML is very long both inside and outside the health space. Progress almost never happens steadily upward: many crises, failures and temporary or permanent dead-ends lead to short-lived or prolonged reduction in the social support of, and enthusiasm for this science and technology. In most cases the crises prompted, and were followed by, significant improvements that overcame prior limitations.

The most important general lesson learnt was that as long as limitations were carefully identified, acknowledged, studied and understood, and were followed by systematic and science-based efforts to overcome them, then invariably improved science and technology ensued.

The modern era of health AI/ML is facing its own set of challenges. The biggest risk is that of AI/ML systems causing large-scale harm to individuals and societies. The biggest existential risk for AI/ML itself, in the current overhyped AI climate, is another AI winter such as the one that devastated the field in the 1970’s. Although there are vastly more practical applications of AI now than there were in the 1970s, hype is still a major concern. Application of BPs enforcing scientific, principled, rational design, evaluation and deployment of this technology can help smooth out and accelerate the field’s evolution.

As demonstrated by the examples of the present chapter, a science-based approach to AI/ML can reduce or eliminate costly failures, accelerate progress, manage risks, and lead to better, accountable, and safer AI/ML that can win the trust and adoption of healthcare and health science stakeholders. BPs aim to provide such frameworks for progress over enduring and open problems which are discussed here and also further in chapter “Synthesis of Recommendations, Open Problems and the Study of Best Practices”.

Key Concepts Discussed in This Chapter

Significant advances in health AI/ML are the result of non-monotonic progress with many failures followed by successes

Gartner Hype cycle

AI winters

Historic overview of the development of AI: Perceptron, back propagation, deep learning, SVM, random forest, semantic networks, knowledge representation and reasoning (KRR), network science, rule-based AI, Bayesian Networks 

The continuum between heuristic and formal systems

The disconnect between early AI and real-world needs

Bayes learners and Bayes networks

Overfitting and over-confidence. Regularization and error estimation protocols.

Data design and learning protocol.

Causality: “Correlation does not imply causation”, Modern causal discovery methods (without experimentation) and the pillars of causal AI.

Case studies in genomics.

Published guidelines—attributes and limitations.

Limitations of the literature. Publication bias. Matthew effect. Ignoring the literature.

Regulatory criticisms.

Case studies of select commercial AI: Portability of technology across application domains, hybrid symbolic non-symbolic AI, biases in the training data, “Big data hubris”, large language models, problem solving with unlimited scope.

Mis- and over-interpretation of theory.

Ignoring equivalence classes. Target information equivalence.

Pitfalls Relevant to Present Chapter

Pitfall 12.11.1 Focused systems are at an advantage over unfocused ones when considering solving specific problems.

Pitfall 12.13.1. In general, there is no such thing as single optimal predictor model, single optimal set of features, and single best causal model inferred from data. There are equivalent classes and in many practical situations these equivalence classes must be modeled and studied. Additional predictive modeling errors in analyses where the equivalence class of Markov Boundaries are not inferred include:

  • The predictor model will be a random member of the Markov Boundary equivalence class. This may not be the cheapest, easiest or most convenient model to deploy clinically.

  • In domains with large equivalence classes, intellectual property cannot be defended since a 3rd party can use an equivalent Markov Boundary and easily bypass a patent or other IP protections.

Additional causal modeling errors in analyses where equivalence classes of Direct Causes are not inferred.

Discarding causal variables in favor of non-causal ones (e.g., discarding A because its correlation with outcome vanishes when we include non-causal but information equivalent A’ in a regression model).

Over-interpreting models: e.g., believing that because A’ is a model returned by an algorithm, without equivalence class modeling, and A is not, then A’ is biologically more important than A.

Because the size of the equivalence classes can be immense (i.e., exponential to number of variables [80]) finding the true causes when selecting a random member (as all algorithms not equipped for equivalence class modeling do) amounts to playing a lottery ticket with astronomical chances against the analyst. Similarly, finding the feature set that is most suitable for clinical application is astronomically unlikely if the Markov Boundary equivalence class is large and is not modeled.

Best Practices Relevant to Present Chapter

Best Practice 12.7.1. The IOM’s recommendations and best practices to enhance development, evaluation, and translation of omics-based tests before they are used to guide patient treatment in clinical trials. These aim to ensure that progress in omics test development is grounded in sound scientific practice and is reproducible, resulting not only in improved health care but also in continued public trust [85].

Best Practice. 12.9.1. Literature limitations imply (among other things) that readers should seek, read and interpret health AI/ML papers taking the these limitations into consideration.

Best Practice 12.11.1.

  1. 1.

    ML is not guaranteed to outperform classical tools (like LR). In many applications simpler models outperform “fancier” (more expressive, more complicated) ones because of a variety of factors, including having properties that match health domain characteristics better, having extensive guidelines for proper use, protocols and designs overpowering algorithms, and that restricted learners being superior in low sample situations (see protocol vs algorithm discussion in the present chapter and BVDE arguments in chapter “Foundations and Properties of AI/ML Systems”).

  2. 2.

    It is an excellent idea to always include baseline comparators such as LR and other methods in model building (see chapters “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” and “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”).

  3. 3.

    A significant portion of nominally very high DL performance is linked to highly-biased research designs (incomplete cross validation, possible overfitting and error estimation bias, and other methodological issues that lead to over confidence in models - see chapter “Overfitting, Underfitting and General Model Overconfidence and Under-performance Pitfalls and Best Practices in Machine Learning and AI”). In studies with strong methodology and lower risk for bias, DL does not seem to perform as well.

Best Practice 12.11.2. In the foreseeable future, and especially for clinical-grade applications and expensive commercial solutions, consideration of hybrid symbolic-connectionist approaches may be worthwhile in many problem domains. Possible advantages include: faster path to design, faster validation and implementation, and exceeding the performances of its components.

Best Practice 12.11.3. It is worth noting that large-scope systems are particularly suitable for human-computer hybrids and the corresponding practices described in chapter “From ‘Human Versus Machine’ to ‘Human with Machine’”, should be taken into account.

Best Practice 12.13.1.

  1. 1.

    Use equivalence class modeling algorithms for discovering the equivalence class of optimally accurate and non reducible predictive models. E.g. TIE* instantiated with GLL-MB or other sound Markov boundary subroutines or iTIE*.

  2. 2.

    Use equivalence class modeling algorithms for discovering the equivalence class of direct causes. E.g. TIE* instantiated with GLL-PC or other sound local causal neighborhood subroutines.

  3. 3.

    When experiments can be conducted, consider using ML-driven experimentation algorithms that model equivalence classes. Experimentation may be needed to resolve the equivalence classes and unmeasured confounding. Such algorithms minimize the number of experiments needed. E.g., ODLP*.

Classroom Assignments & Discussion Topics in This Chapter

In the assingments below consider the following case studies/failures/limitations topics discussed in the chapter (numbered for ease of reference):

12.1 Significant advances in health AI/ML are the result of non-monotonic progress with many failures followed by successes. Learning from case studies.

12.2. The Gartner Hype cycle.

12.3. “AI winters”.

 12.3.1. Perceptron limitations.

 12.3.2. Back Propagation-based ANNs and the vanishing (or exploding) gradient problem.

 12.3.3. A new improved technology of ANNs followed: “Deep ANNs aka Deep Learning”.

 12.3.4. Current limitations (and opportunities for improvement) of DL.

 12.3.5. What’s in a link: the importance of semantics in AI models and methods. From problems with early semantic networks to those of modern network science and pathway discovery.

 12.3.6. Rule-based AI, Expert systems, Heuristic systems, limitations of Bayesian learners and disconnect between systems and real-world problems.

  12.3.6.1. Rule based AI and Formal Expert Systems.

  12.3.6.2. Heuristic systems.

  12.3.6.3. Disconnect between early AI systems and real-world needs and workflows.

  12.3.6.4. Limitations of early Bayesian learners and emergence of BNs.

  12.3.6.5. Intractability of BN inference. Ability to learn BNs from data.

12.4. Overfitting and over-confidence in models: problems, advances and persistent challenges.

12.5. Ignoring the data design and learning protocol (model selection, error estimation) effects on modeling success.

 12.5.1. Effects of protocols on overall model performance are very strong.

 12.5.2. Effects of data design on overall model performance are very strong.

  12.5.2.1. Issues with ML challenges.

  12.5.2.2. Other persistent issues related to common data design deficiencies.

12.6. Causality in early and modern ML.

 12.6.1. “Correlation does not imply causation” and the falsely implied impossibility of discovering causality without experiments. Problems with RCTs and low-dimensional experiments.

 12.6.2. New class of scalable causal discovery algorithms.

 12.6.3. Early causal algorithms were not scalable; claimed impossibility of tractable causal discovery algorithms.

 12.6.4. Lack of causal correctness in broad use of ML and especially the persistent use of predictive modeling methods to solve causal problems.

 12.6.5. Pillars of Causal AI.

12.7. Case studies with pitfalls involving AI/ML used in genomics.

12.8. Attributes and limitations of published guidelines and criteria for health AI/ML.

 12.8.1. The risk of exaggerating guideline generality.

 12.8.2. Over-interpreting guidelines can (and does) negatively affect the meta analytic literature and its subsequent literature.

 12.8.3. Guidelines and standards with implied exaggerated completeness or impact.

12.9. Limitations of literature.

 12.9.1. Limitations of self-correction.

 12.9.2. Assessment of Risk of Bias in published studies.

 12.9.3. Disconnected publication spheres and disjointed expertise across fields.

 12.9.4. Mathew effect in the literature of health AI/ML.

 12.9.5. Compounding of Publication Bias and Mathew effect in AI/Ml literature.

12.10. Failures of AI/ML COVID models; Regulatory Criticicsms in Covid AI; Other regulatory areas of improvement.

12.11. Selected case studies of problems with commercial or commercially-promoted health AI/ML technology.

 12.11.1. IBM Watson health.

 12.11.2. Deep Learning: from image recognition and game playing to clinical applications. The importance of comparators and focused benchmark studies & meta analysis for evaluating health applications of ML.

 12.11.3. Marcus’ criticisms of shallow statistical (non-symbolic) AI/ML.

 12.11.4. Racial bias in UnitedHealth Group’s Optum model.

 12.11.5. Scant evidence for positive outcomes from health apps.

 12.11.6. Epic Sepsis Model (ESM).

 12.11.7. ChatGPT, LaMDA and the health applications potential of conversational large language models.

 12.11.8. Unlimited scope versus focused systems.

 12.11.9. Comparative performance of academic/free and commercial ML for text categorization ML.

 12.11.10. Commercializing patient data.

 12.11.11. Google Flu Trends.

 12.11.12. AI amplifying its own errors?

12.12. Misinterpreting or over-interpreting theory.

 12.12.1. NFLTs and Cross-validation.

 12.12.2. Optimal Bayes Classifier (OBC).

 12.12.3. Misinterpreting Universal Function Approximation (UFA).

12.13. Ignoring equivalence classes.

12.14. Limited translation and adoption at clinical stages.

Assignments:

  1. 1.

    Map each entry in the above table to the pitfalls in chapters “Foundations and Properties of AI/ML Systems” to “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”. Justify your choices.

  2. 2.

    Map each entry in the above table to the Best Practices in chapters “Foundations and Properties of AI/ML Systems” to “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”. Justify your choices.

  3. 3.

    Characterize each case study as belonging to the following categories:

    1. (a)

      Historical case studies: describing problems that have been solved;

    2. (b)

      Historical and current cases studies: describing open problems that have existed for some time and continue to require solution;

    3. (c)

      Current cases studies: pertaining to more recent unsolved challenges.

  4. 4.

    What types of limitations and failures in AI/ML seem to be recurring or persisting over the history of the field? Why do you think they recur or persist? What would be strategies to ensure their eradication?