Keywords

Establishing Properties of AI/ML Methods During New Method Development and Evaluating Properties when Choosing the Best Method for the Problem at Hand

In chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML” we reviewed major health AI/ML methods in common use in healthcare and the health sciences. We examined how they work and summarized important properties. We especially focused on the following theoretical properties and corresponding conditions (if known) that guarantee these properties:

  1. 1.

    Representation power: can the models produced by the method represent all relevant problem instances and their solutions?

  2. 2.

    Transparency: are the models produced by the method easy to understand (i.e., are they “transparent box”) and can they be easily understood by human inspection (i.e., are they explainable and human interpretable)?

  3. 3.

    Soundness: When the methods output a solution to a problem instance, is this solution correct? If there is a degree of error (measured in some scale of loss, risk or other scale) how large is the error and its uncertainty?

  4. 4.

    Completeness: Does the method produce correct answers to all problem instances? If only for a fraction of the problem space, how large or important is the fraction?

  5. 5.

    Computational complexity for learning models: what is the exact or asymptotic computational complexity of running the method to produce models as a function of the input size (e.g., number of variables, or sample size)?

  6. 6.

    Computational complexity for using models: what is the exact or asymptotic computational complexity of running the models produced by the method to answer problems as a function of the input size (e.g., number of input variables)?

  7. 7.

    Other cost functions: for example, what is the cost to obtain and store input data and run analyses on a compute environment, either at model discovery or at model deployment time?

  8. 8.

    Space complexity for learning models: what is the exact or asymptotic storage complexity of running the method to produce models as a function of the input size (e.g., number of variables, or sample size)?

  9. 9.

    Space complexity for storing and using models: what is the exact or asymptotic storage complexity of running the models to answer problems as a function of the input size (e.g., number of variables, or sample size)?

  10. 10.

    Sample complexity, learning curves, power-sample requirements: how does the error of the produced models varies as function of sample size of discovery data? How much sample size do we need in order to build models with a specific degree of accuracy and acceptable statistical error uncertainty, and how much sample size is needed to establish statistically superiority to random or specific models and performance levels?

  11. 11.

    Probability and decision theoretic consistency: is the ML/AI method compatible with probability and utility theory?

We also discussed empirical performance properties which we can differentiate for the purposes of the present chapter into:

  1. 1.

    Comparative and absolute empirical performance in simulation studies: when we give the method discovery data produced by simulations, what is the empirical performance of the method?

  2. 2.

    Comparative and absolute performance in real data with hard and soft gold standard known answers: when we give the method discovery data from real world sampling, what is the empirical performance of the method?

Chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML” made it abundantly clear to the reader that the majority of widely-used formal methods have known properties along the above dimensions. Ad hoc and heuristic systems and methods do not, however, and thus are viewed in this volume as pre-scientific or early-stage and preliminary, thus carrying great risk of failure (at the present stage of scientific knowledge about them). It also became obvious that knowing the above properties is essential knowledge for determining which among several available methods and corresponding libraries, tools and commercial offerings are the ones that are best suited for a problem we wish to solve using AI/ML.

In general, theoretical properties are stronger than empirical ones in the sense that they describe very large parts of the problem-solving domain more efficiently and with greater clarity than empirical studies. Figure 1 shows the coverage and interpretation of sufficient conditions, vs. those provided by sufficient and necessary conditions and vs. necessary conditions. Sufficient and necessary criteria describe the whole problem space, whereas sufficient conditions and necessary conditions only part of it, but they are usually more difficult to establish, thus usually theoretical analysis for AI/Ml methods is provided in the form of sufficient conditions. It is not uncommon for the set of sufficient conditions to grow over time so that the total problem space is better understood.

Fig. 1
3 diagrams of A I or M L performance. Top. The problem space has subspaces for everywhere here equals positive performance and performance not met. Center. Subspaces for everywhere here equals positive and negative performance. Bottom. Everywhere here equals negative performance and performance met.

Conditions for desirable performance of AI/ML methods. Sufficient vs. sufficient and necessary, vs. necessary conditions

Notice that sufficient conditions point to parts of the problem space where desired performance is achieved. So, if we establish that in some domain a set of sufficient conditions hold, we can expect desirable performance. In other words, sufficient conditions are constructive. Necessary conditions point to parts of the problem space where undesired performance will occur. They provide warnings against pitfalls, but do not tell us the complete picture of how to obtain desirable results (other than avoiding the pitfalls, i.e., safeguarding that necessary conditions are not violated).

Even if we do not violate necessary conditions, we are not guaranteed desired performance however (because additional sufficient conditions may not hold). Necessary and sufficient criteria map out precisely the totality of the problem space in which we will obtain desirable performance and the space that we will not. Again, it is much harder in practice to derive necessary and sufficient conditions.

By comparison to the above, empirical studies provide limited coverage of the problem space as shown in Fig. 2. In the absence of theory it would take an astronomical number of empirical studies to cover the space that a single theorem can (assuming that the combinatorial space of factors involved is non-trivial).

Fig. 2
A diagram for empirical studies. The circular problem space has 3 positive empirical studies and 4 negative empirical studies represented by smaller circles with symbols, and uncertainty everywhere not covered by empirical studies represented by a dark shade within the circle.

Empirical studies. Empirical studies showing positive (green) and negative (red) performance results, can sample the problem space and collect evidence about the applicability of method (s) of interest (extrapolated, inductively, to similar populations). The vast majority of the space (grey) has unknown performance however since it has properties that may affect performance, and which properties are not covered by the empirical studies. Empirical studies are vastly less efficient than theoretical analysis

One caveat of applying theoretical analysis is that we may not know with absolute certainty if sufficient, necessary, or sufficient and necessary conditions are met in some problem domain. It is very important for theoretical conditions to be testable. For example, we can easily test whether the data is multivariate normal and thus we can know whether in some domain we satisfy the relevant sufficient condition of OLS regression (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”). By comparison, within classical statistics frameworks we do not know in general, and cannot test practically, whether the strong ignorability sufficient condition of propensity scoring method holds (chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring problems, and the role of Best Practicess”). The existence of such a condition then just says that it is theoretically possible for the method to work in some real or hypothetical distribution (but we cannot tell whether it will work in any real life data of interest with certainty).

In some situations it is also possible to fix violations of conditions. For example, we can transform (some) non normal distributions so that they become normal. Sometimes, violating sufficient conditions for desired performance is mistaken to imply guranteed undesired perfromance. This is not the case, always, since not meeting a set of suffiecient conditions may still be followed by good performance (i.e., if other sufficient conditions exist and are met, aka “mitigation factors” [1]).

There is a clear relevance of pitfalls and guidelines for best practices, to the above concepts. These must describe sufficient, necessary or sufficient and necessary conditions for desired or undesired performance, and these conditions must be testable or identifiable.

In Fig. 3 we demonstrate the importance of combining theoretical analysis with empirical testing. If theory predicts that in some identifiable part of the problem space - described in the example of the figure by the subspace where sufficient conditions hold - we expect there desirable behavior. But we may still be unsure about whether these conditions were tested accurately. Or if no testing of conditions was conducted, whether the theory applies to this particular problem space and data sampled from it, that we are facing. By ‘sampling” this part of the problem space we can obtain evidence strengthening or refuting the appropriateness of the applicability of the methods in question.

Fig. 3
3 diagrams of problem spaces. Top. Subspaces everywhere here equals positive performance with 4 positive empirical studies and performance not met. Center. Everywhere here equals positive performance with 1 negative study. Bottom. Everywhere here equals positive performance with 4 negative studies.

Combining theoretical analysis’ expectations with empirical studies (green = empirical studies showing desired performance, red = empirical studies with undesired performance). Top: empirical studies (verify that our problem space is well aligned with theoretical expectations of AI/ML method performance. Middle: some studies show misalignment with theoretical expectations. Possibly, criteria used to test whether assumptions hold may be inaccurate, or the empirical studies were flawed, or both. Bottom: many studies show misalignment of our domain with theoretical expectations. Either criteria used to test whether assumptions hold are wrong or assumptions were never tested (and the problem domain does not oblige)

To elaborate, if we obtain positive empirical verification of theoretical expectations, we can be sure that the theoretical roadmap is aligned with our practical problem solving setting. If, however we see empirical results that violate the theoretical expectations, this is evidence that we did not sufficiently test the preconditions for method success, or that the means in our disposal for testing the preconditions of the theoretical properties are not accurate and we need better ways to test the suitability of our methods to the real-life problem.

Going back to Fig. 2, the reader may wonder: why don’t we dispense with theory and treat the characterization of the problem space with regards to performance of any set of methods, as an empirical ML problem itself? For example, dispel with theory and base acceptance of, e.g., clinical AI models by doing Clinical Trials? With enough methods and enough trials/datasets over enough problem spaces we can circumvent the need to derive complex theoretical analyses. The answer is three-fold:

  1. 1.

    As indicated earlier, the number of empirical studies needed would be astronomical, as the number of studies needed grows combinatorially to the number of factors involved.

  2. 2.

    In the absence of theory we do not even known which factors affect performance and thus lack the knowledge necessary to design a set of empirical studies that can cover the space of interest. This is a Theory Ladeness problem (i.e., scientific conclusions are affected by the framework determining what to study, how to measure it etc. [2, 3].

  3. 3.

    This practice may jeopardize human subjects or waste valuable resources, and is thus unethical in such cases. Moreover,  as the field of ML matured, it has become evident that even thousands of datasets used for a variety of benchmarks across domains cannot cover the full space of possibilities of evaluating a single method. At most we can divide and conquer the full ML problem space into application areas and conduct large benchmark studies there. However for this endeavor in order to be efficient and practical, the specific choices of focused areas and datasets must be constrained and guided by a robust theoretical understanding of the factors affecting successful modeling.

In summary, theory and empirical study work synergistically together and provide a concrete roadmap of which methods are suitable for what task, by establishing the properties of AI/ML methods. The same is true for individual models produced by such methods as detailed in the next chapter on developing and validating AI/ML models. Knowing AI/ML properties (theoretical and empirical) provides crucial information that allows one to make informed decisions about the Performance, Safety and Cost-effectiveness requirements and potential of corresponding AI/ML approaches, methods and systems (and the decision models produced by them). Successfully meeting requirements that ensure desirable AI/ML behavior can then lead to building trust in the AI/ML solution (as discussed in chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML”) which encompasses: Scientific and Technical Trust; Institutional Trust; System-of-science Trust; Beneficiary Trust; Delivery Trust; Regulatory Trust; and Ethical Trust. We can codify the above as 4 best practices:

Best Practice 5.1

Methods developers should strive to characterize the new methods according to the dimensions of theoretical and empirical properties.

Best Practice 5.2

Methods developers should carefully disclose the known and unknown properties of new methods at each stage of their development and provide full evidence for how these properties were established.

Best Practice 5.3

Methods adopters and evaluators (users, funding agencies, editorial boards etc.) should seek to obtain valid information according to the dimensions of theoretical and empirical properties for every method, tool, and system under consideration.

Best Practice 5.4

Methods adopters and evaluators should map the dimensions of theoretical and empirical properties for every method, tool, and system under consideration to the problem at hand and select methods based on best matching of method properties to problem needs.

Major pitfalls can and do ensue when the above best practices are not followed. In chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring problems, and the role of Best Practices” we will discuss many case studies of real-life dire consequences of failing to follow these best practices.

Pitfall 5.1

Developing methods with theoretical and empirical properties that are:

  1. (a)

    Unknown, or

  2. (b)

    Poorly characterized in disclosures and technical, scientific or commercial communications and publications, or

  3. (c)

    Clearly stated (disclosed) properties but not proven, or

  4. (d)

    Not matching the characteristics of the problem to be solved at the level of performance and trust needed opens the possibility for major errors, under-performance, poor safety, unacceptable cost-effectiveness, and lack of trust in AI.

Best Practice Workflow to Establish the Properties of any New or Pre-Existing Method, Tool or System

In the remainder of the present chapter we will outline a best practice workflow, shown in Figs. 4, and 5 that can be used to create new (Fig. 4) or establish the properties of any pre-existing (Fig. 5) method, tool or system, so that a rational, effective and efficient solution to the problem at hand can be identified as per the guidelines 5.1–5.4.

Fig. 4
A flow diagram for A I or M L methods. Rigorous problem definition is followed by theoretical analysis, first-pass algorithms, theoretical properties of algorithms, algorithm refinements, and empirically tested algorithms in controlled conditions and real life with and without known answers.

Steps in developing and validating new AI/ML methods. Details in text. Notice the highly non-linear flow that describes the frequent need to go back and revise earlier steps if they do not lead to desired performance

Fig. 5
A flow diagram for A I or M L method evaluation. Appraisal of operating problem domain specification precision is followed by theoretical analysis of domain, theoretical properties of methods, other capabilities, extensions and derivatives, and empirical tests in controlled and real-life conditions.

Steps in evaluating existing AI/ML methods. Details in text. This process is linear since the evaluators do not concern themselves with fixing problems found in the appraisal

Throughout the description of the process details that follow, we will highlight how the steps apply to several well-known methods (which are detailed in chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML”) and furthermore interleave a running in-depth example of putting these principles to practice in a real-life method development context (new method development for biomarker, signatures and biomedical pathway discovery).

Step 1. Rigorous problem definition (in precise mathematical terms, and with precise correspondence to health care or health science objectives)

It is always worthwhile to invest time to define the problem addressed by new methods we plan to develop or evaluate, in very precise terms. This enables an accurate and effective mapping of the data design and modeling onto the most appropriate AI/ML methods if they exist, or establishing the need for creating new methods (if methods do not exist or their properties are not up to par with the specifications of the problem solving effort at hand).

A mathematical-level description of the problem we wish to solve enables theoretical (mathematical, algorithmic, information theoretic, statistical, or other) appropriate analysis and proving that the problem can or cannot be solved, within specific performance requirements. This includes whether the problem can be solved in acceptably small time, storage, or sample size and input costs. Or, if it cannot, whether, the development effort needs to focus on subclasses of the broad problem space that are solvable efficiently.

Unfortunately it is often the case that AI/ML is used by applying on data and expecting that “useful” patterns will be revealed, or “insights” will be generated without a precise description of what constitutes a desired solution.

Real-life relevance. Characteristic examples of this pitfall in healthcare AI/ML applications is seeking “useful patterns”, “anomalies”, “practice variation”, “actionable insights”, “subpopulations”, or any number of other fuzzy goals that are hard to critically and conclusively evaluate in terms of meaningfully meeting the goals of the project. Another common pitfall is in the risk and predictive modeling domains where “accurate” algorithms (or models) are being sought without reference to the degree of accuracy expressed in some evaluation scale, that is required to advance the goals of the project at hand.

In the health sciences, a characteristic example of this pitfall is in bioinformatics analyses of high-dimensional omic molecular data where application of AI/ML methods is used to reveal “biological insights”, “structure in data”, “shape of the data”, “clusters”, “pathways”, “signatures”, “gene lists”, often with very imprecise or inconsistent language about what exactly each one of these entities is meant to encompass, or about how discovering them will enable a specific scientific investigation, answer a concrete hypothesis, or generate results with practical clinical, scientific or technological value. Often such fuzzy goals and results are combined with ex post facto narratives by the authors of the study that overlay meaning and significance to the findings although they may inherently lack such meaning.

The ability of experts to create explanations in their field is an important component of their professional success [4]. A serious possibility exists that the domain expert investigators may be prone to creating (on the surface) convincing narratives around any set of results, even random ones, however. The term Scientific Apophenia describes the tendency to find evidence of order where none exists in scientific results [5]. The human mind is a powerful recognizer of patterns but is also subject to seeing patterns over random sequences and conversely, believing that non-random patterns are random [6].

Moreover some pattern may emerge out of random structures as a result of mathematical necessity as studied by Ramsey Theory [7]. For example, in any group of 6 people, by mathematical necessity, either at least three of them are (pairwise) mutual strangers or at least three of them are (pairwise) mutual acquaintances. The reader can see the relevance of this theorem to, e.g., interpreting bioinformatics analyses describing clusters among genes or other molecules. Such clusters even, if randomly generated, will contain structured patterns of known relationships and in turn will give credence to the (false) validity of previously unknown relationships.

The requirement for rigorous problem definitions prevents such problems early in the discovery process, at a stage where the very methods used for discovery are formulated and validated.

Another common pitfall of new method development related to defining the goals of methods is to define them in mathematical terms but without establishing how the mathematical goal maps to the health science discovery goals, or clinical value, or how it addresses known limitations of existing methods. For example, the problem of topological clustering can be very precisely defined (and is a worthwhile and non-trivial mathematical theoretical endeavor), however it is not clear how solving this class of problems overcomes practical limitations of existing methods for concrete health problems.

A related pitfall to imprecise problem definitions is that of “re-inventing the wheel”, that is creating an unnecessary new solution to a problem for which equally good or better solutions exist. In the history and practice of AI/ML re-invention of solutions to problems with established solutions is unfortunately rampant. For example, numerous pathway reverse engineering methods have been proposed in bioinformatics without leveraging or acknowledging the mathematically and algorithmically robust literature on causal discovery or improving upon their performance. Similarly the kernel regression method has been re-discovered several times in recent decades. In yet another example, numerous heuristic feature selection algorithms with “non-redundancy” properties have been proposed well after several sound, complete and efficient Markov Boundary algorithms (that solve this class of problem optimally) had been previously introduced, validated and successfully applied.

Pitfall 5.2

Evaluating the success of methods with poorly defined objectives, by employing ex post facto expert narratives as a proxy of “validity”.

Pitfall 5.3

Defining the goals of methods in mathematical terms but without establishing how the mathematical goals map to the healthcare or health science discovery goals.

Pitfall 5.4

Reinventing the wheel: whereby a new method is introduced but it has been previously discovered yet ignored (willfully or not) by the “new” method developers.

Pitfall 5.5

Reinventing a method but make it worse to established methods (…“reinventing the wheel and making it square”!).

  • In-depth real-life example for step 1 (Rigorous problem definition): Development and evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data.

  • Investigators started by asking: how can we make this problem concrete? Biomarkers are a diverse group of conceptual entities which across the literature includes the following: (1) substitutes (proxy outcomes) for outcomes in clinical trials; (2) downstream effects of interventions that are indicative of toxicity and adverse events; (3) complex computable models (aka “signatures”) that can be used to diagnose, prognosticate or precision-treat patients; Pathways, on the other hand, are causal subsystems in bigger healthcare or biological systems with defined functions and modularity. Pathways can be instrumental in revealing drug targets for new therapeutics. Biomarkers are involved in one or more drug target pathways [8,9,10].

  • One can see that the general concepts of biomarkers and related disease or drug pathways includes, therefore, both predictive and causal modeling which indicates that one would need to develop methods that seamlessly support both types of reasoning. In addition, such development is sensitive to the need for discovery and modeling with very high dimensional and often low sample size datasets.

  • One natural way to frame the predictive/prognostic/outcome surrogate biomarker discovery aspect is as a feature selection problem, and the causal aspect as a causal induction problem. Recall from chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML” that the feature selection problem in its simplest standard form can be stated as: given a distribution P, with variable set V including a response variable (“target”) T, and data sampled iid from P, find the smallest set of variables S in V such that S contains all predictive information about T in P. The observational causal induction problem can be stated as: given a distribution P, generated by a causal process C comprising a variable set V including a response variable (“target”) T, and data sampled iid from P, without interventions on V, find the set of ordered relations (Vi,Vj) such that Vi directly causes Vj, whereby the causal semantics are as follows: Vi, temporally precedes Vj and a randomized experiment determining values of Vi yields changes in the distribution of Vj (compared to not manipulating Vi).

  • These definitions directly point to three computational and mathematical frameworks for AI/ML modeling addressing the problem definitions: first, theory of relevancy; second, theory of causal graph induction; and third, theory of Bayesian Networks and in particular Markov Boundaries [11,12,13,14].

Step 2. Theoretical analysis of problem (complexity, problem space characteristics etc.)

The second step, after the method goals are precisely defined, is to conduct theoretical analysis that shows the characteristics of the problem, across all possible AI/ML methods that could be employed. This at first glance may seem to the non-expert exceedingly hard, or even impossible, since there is an infinity of possible AI/ML algorithms that one can devise. In practice, a precise and thoughtful problem definition in step 1 of the development process presented, often makes it feasible or even easy to derive properties by mapping the problem considered (i.e., by establishing its correspondence) to known problems which themselves have established properties.

For example, recall from chapter “Foundations and Properties of AI/ML Systems” one of the foundational achievements of computer science is the theoretical toolkit to prove that a problem class has a certain computational complexity. Similarly, the whole practice of Operations Research relies on having a catalog of prototypical problems with efficient solution algorithms such that practitioners can solve infinite problems just by mapping them to the smaller set of pre-established archetypal problem solutions [15].

If such mapping is not possible, the methods developers or evaluators can apply other established more granular and general-purpose techniques from the field of design and analysis of algorithms to understand the feasibility and hardness of the problem space [16].

  • Real-life example for step 2 (Theoretical analysis of problem). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • How one would go about characterizing the theoretical properties of the problem as formally defined in step1? Recall from chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” that existing theory of relevancy includes the Kohavi-John framework (K-J) which differentiates between strongly relevant, weakly relevant and irrelevant features. Strongly relevant features in K-J theory are features that have unique information and can never be dropped by feature selection without loss of predictive signal. Weakly relevant features are predictive but lack unique signal, thus can be dropped by feature selection without loss of predictivity. Irrelevant features carry no signal so they are effectively noise (for the target T) and can be dismissed by feature selection. So there is a theory guiding the discovery of predictive biomarkers. What about the causal aspect?

  • As we saw, the connective tissue between causation, feature selection and causal discovery is given, locally around a target variable T, by the Markov Boundary (MB) of T. Specifically, a Markov Blanket of T is any set of variables that renders all other variables in the data independent of T, once we know the Markov Blanket variables. A Markov Boundary is a minimal Markov Blanket which means that we cannot discard any member of the Markov Boundary without losing its Markov Blanket property. Moreover in a vast family of distributions (the majority of all possible distributions including all distributions modeled by classical statistics), called Faithful Distributions, and when no latent variables are present,: (a) there is always a single Markov Boundary and (b) the Markov Boundary comprises the direct causes, plus direct effects plus direct causes of the direct effects of T (so-called “spouses” of T). Thus, the Markov Boundary contains the local causal pathway around T (minus those spouses that lack causal edges to T). Moreover because BNs are probability and decision theoretically sound, MBs are also consistent with probability and decision theory. Finally, the Markov Boundary feature selection is connected with K-J relevancy since in Faithful distributions the strongly relevant features are the members of the Markov Boundary [11,12,13,14].

  • So far the problem space was well-characterized, looked feasible and the natural next question was what are algorithms that solve it?

Step 3. First-pass algorithms solving problem

The third stage of new AI/ML method development is a first attempt at identifying or (if none exists) creating an algorithmic method that solves the problem as previously defined and analyzed. If the problem has been precisely defined and its properties established in steps 1 and 2, then it is often easy to modify existing methods or put together a first algorithm that solves the problem. Typically this first-pass solution is not meant to be optimally efficient, as such optimization is attempted in subsequent steps. Existing method evaluation may also apply here if the existing method is an early-stage one.

  • Real-life example for step 3 (First-pass algorithms solving the problem). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, the theoretical specifications and analysis described above provide solid footing for moving to first-pass algorithm development. Algorithms that solve the example problem do exist (chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”) and have solid properties, however for illustrative and pedagogical purposes we will consider their development at the time that they did not exist and had to be invented.

  • Developing new such algorithms was a necessity in the early 00s because at the time there was no sound and scalable algorithms for discovery of Markov Boundaries from data. In principle, one could use existing causal graph induction algorithms to discover the whole graph and then extract the Markov Boundaries from it. However the algorithms were not scalable beyond approx. 100 variables in practice, and it had been shown that learning the causal graph with Bayesian search and score algorithms was NP-Hard, whereas the Conditional Independence and constraint-based algorithm family was worst case exponential (chapter “Foundations of Causal ML”). Heuristic algorithms introduced by Kohler and Sahami in 1996 and by Cooper et al. in 1997 were informative first attempts but not sound. In the Kohler-Sahami case they were also poor empirical performers, whereas the Cooper et al. algorithm was better performer in small data but was not scalable in compute time or sample size. Another recent (at the time) algorithm by Margaritis and Thrun was sound but not scalable and also not sample efficient. With these necessities in mind Tsamardinos and Aliferis invented a novel sound and scalable algorithm IAMB, and variants, that could be instantiated in a variety of ways (e.g. by combining it with full-graph algorithms for intermediate results pruning or post-processing) [18,19,20,21].

Step 4. Theoretical properties of first pass algorithms: focus on representation power, soundness and completeness.

Once a first algorithm that prima facie solves the problem has been created, its key theoretical properties should ideally be established, typically under (a) sufficient or necessary conditions, or (b) sufficient and necessary conditions. If analysis shows that no such conditions exist, or that they are too narrow and unworkable in real life conditions, then revisiting step 3 is mandated and steps 3 and 4 are iterated until a method has been identified that guarantees soundness and completeness (or reasonable approximations thereof) under real-life realistic conditions.

  • Real-life example for step 4 (Theoretical properties of first pass algorithms: focus on representation power, soundness and completeness). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example and from the viewpoint of when the corresponding development steps were conducted, the IAMB algorithm family was theoretically sound and complete. In preliminary empirical analyses its inventors established that it is both sound and scalable. IAMB was however a definitional Markov Boundary algorithm meaning it would apply the definition of the Markov Boundary in each step of its operation. This entailed that it would conduct conditional independence tests with large sets of variables which increased its sample size needs to exponentially large to the number of conditioning variables. So, refinements and optimizations were needed which led to a second (more refined) family of algorithms (see below).

Step 5. Algorithm refinements, extensions and derivatives

Step 5 is a multi-faceted and critical step in the sense that its success will determine the practical utility of the new method. This is the step where various optimizations and adaptations to real life performance requirements will take place. This step has several sub-steps that may require many years’ worth of incremental improvements. Hardly ever the full range of optimizations is accomplished close to the first time a method appears in the literature. It is not uncommon for such efforts to constitute a career-long research program of the method innovators and their teams, as well as independent researchers. The refinement sub-steps comprise:

Step 5.a. Performance Optimizations that cover achieving high computational efficiency for learning and for using models; optimizations for other costs (for example cost to obtain and store input data and run analyses on a compute environment, either at model discovery or at model deployment time); efficient space complexity for learning, storing and using models; efficient sample complexity and establishing learning curves and broad power-sample requirements.

  • Real-life example for step 5.a. (Performance Optimizations - Sample efficient algorithms and going beyond local pathways). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, the second-pass algorithms designed to overcome IAMB’s sample size inefficiency described earlier was HITON and MMMB. These were compositional algorithms, that is, instead of applying the definition of Markov Boundary, they composed it edge-by-edge (exploiting the link between graphical and probabilistic Markov Boundaries and causality).

  • At the core of the causal discovery problem from observational data there is a foundational theorem (chapter “Foundations of Causal ML”) that states that to establish a direct edge between variables Vi and Vj one must establish the conditional dependence of these variables given every subset of the remainder of the variables in the data [13]. This is clearly an exponential cost operation in both compute time and sample, which becomes super-astronomical in datasets with more than a few dozen variables. Thus smart algorithms are needed to exploit the sparsity of causal processes and identify quickly a single subset that shows independence so that the vast majority of tests can be omitted from computations.

  • With this asymmetry in mind, the designers of HITON and MMMB sought to apply informed search functions that would identify quickly a single subset needed to establish that variable Vi was not directly linked to T, for every Vi in the data. Whereas the algorithms are still worst-case exponential (because the problem is itself worst-case exponential regardless of algorithm), their exponentiality is directly linked to the connectivity of the underlying causal structure. For sparse or predominantly sparse causal data generating processes (as most biological networks are, for example), the majority of the causal process can be identified fast if the algorithms’ time complexity is adaptive to the connectivity. Both HITON and MMMB are locally adaptive to the hardness of the causal problem at hand. Additional variants return direct causes and effects only (HITON-PC, MMPC) and Markov Boundaries (HITON-MB, MMMB).

  • Because of the compositional nature of this second generation of algorithms these investigators and their collaborators were able to introduce algorithm variants that discover not just the Markov Boundary for T, but also the local causal pathways (causal neighborhood) only around T (i.e. without need to find spouses and remove them with post processing), also local causal regions of depth k (depth specified by the user), and the full causal graph by inducing all local causal edges around every variable in the data (algorithm MMHC and it generalized family LGL) [22,23,24].

Step 5.b. Parallelization, distributed/federated, sequential, chunked, versions [25,26,27]. These are derivative and enhanced forms of the main algorithmic solution with the following properties:

  • Parallel algorithms: can be run in parallel computing architectures and environments whereby the computational steps are divided among many processors. The parallelization can be coarse, intermediate, or fine grain, depending on how large and complicated are the unit computations divided among the parallel processors.

  • Distributed/federated algorithms: operate across federated and distributed databases which exist in diverse locations without the need to bring all data into a centralized database and computing environment.

  • Sequential algorithms operate in steps corresponding to incremental availability of input data, for example with increasing sample size over time, or with increasing sets of variables over time. At each processing step, a different set of results is obtained that over time increases in quality (i.e., converges to the right solution or approximation thereof).

  • Chunked algorithms address the situation where data is so large that overwhelms the memory limits of the computing environment. The data is divided then in chunks, each one of which fits in memory, and analysis proceeds across all chunks until all data is analyzed and final results obtained.

  • Real-life example for step 5.b. (Performance Optimizations - Parallelization, distributed/federated, sequential, chunked, versions). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, and motivated by the need to process datasets with vast numbers of variables not fitting in single-computer memories of the time, the investigator team was also forced to invent a chunked version of IAMB and HITON. Because of the nature of these algorithms it was immediately obvious that they can also be modified to obtain parallel, distributed, and sequential versions as well. Over the years, the investigator team conducted massive experiments in parallel compute clusters taking advantage of these algorithms. These algorithms also gave important insights on the feasibility and requirements for sound federated Markov Boundary and causal discovery. For example, it was established that for sound federated/distributed Markov Boundary discovery exactly two passes of local processing was required plus one global step with the results from the first two passes, and a subset of variables had to be shared among all nodes depending on intermediate results [128].

Step 5.c. Relaxing assumptions/requirements. This step corresponds to efforts to relax the assumptions guaranteeing its properties and broaden the space of conditions under which the new or existing method will have the desired guaranteed properties.

Real-life relevance. For example, extending Bayesian classifiers from restricted distributions with Naive Bayes, to algorithms that can operate on all discrete distributions. Other examples include: extending from discrete to continuous decision trees, extending single tree models to ensembles of trees (e.g., Random Forests) that are more robust to sampling variance, extending artificial neural networks from linearly separable functions to non linearly-separable ones, extending SVM binary classification to multi-class classification, extending SVMs from noiseless data to noisy data, extending linear to non-linear SVMs, extending standard Cox Proportional Hazards regression to accommodate time-dependent covariate effects, extending causal discovery algorithms that require no hidden (aka latent) variables to ones that can operate in the presence of latents, etc. (see chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML” for details and more examples).

  • In-depth example for step 5.c. (Performance Optimizations: Relaxing assumptions/requirements allowing equivalence classes, latent variables, guided experimentation). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, the methods outlined so far depended on two fundamental assumptions: one is Faithful distributions and the other is causal sufficiency (i.e., no unmeasured confounders). To address the former, the development team introduced new algorithms addressing distributions with information equivalences, i.e., distributions where several variable groups can have the same information regarding the target response. Such situations are very common in omics data, complex survey data, clinical data, and many other types of biomedical data. The result of information equivalency is the existence of multiple (not just one) Markov Boundaries and ambiguity of the causal pathways.

  • Specifically, in such distributions there is an equivalence class containing multiple statistically indistinguishable MBs and local causal edges and the size of this class can be exponential to the number of variables. Statnikov et al. introduced a family of algorithms called TIE* which extract from data all MBs and local causal neighborhoods. TIE* algorithm family members are sound, complete and adaptable to the distribution at hand by choice of conditional independence tests and component subroutines (used for single Markov Boundary induction).

  • The second relaxation concerning latent variables was addressed by post-processing the main algorithms’ results with algorithms that can detect presence of hidden variables (e.g., IC* and FCI, see chapter “Foundations of Causal ML”). Such algorithms could not be used for end-to-end analysis because they are not scalable and in most cases they are also error prone empirically.

  • Another important algorithmic extension addressing both equivalence classes and latent variables was the introduction of algorithms that resolve these ambiguities by limited algorithm-guided experimentation. Specifically the ODLP algorithm family combines using MB and local neighborhood algorithms plus equivalency class algorithms and guides an experimenter to conduct a series of experiments that resolve statistical ambiguity due to latents and equivalence class due to information equivalency. The ODLP algorithm attempts to minimize the number of experiments and has worst-case number of experiments that is at most the total number of variables in the equivalence class of the local pathway [29,30,31].

Step 5.d. Generalized frameworks (generalized family of algorithms and generalized conditions for performance guarantees).

This step involves extending the new method to a more general family of interrelated similar methods. It also involves establishing testable rules for instantiating the family into specific methods in that class, and testable rules for guaranteeing that the properties of the family will be shared by every method in that family (without further need to prove these properties of empirically test them).

It is not always obvious if such generalizations are possible, and if yes how to accomplish them. Thus it is not always pursued, especially in initial stages of developing a new method. But whenever it is possible, it confers a number of important benefits which we summarize here. Developing a generalization of a fundamental method explains in mathematically precise terms how the core method can be modified so that:

  1. (a)

    It can address slightly different problem instance classes and situations;

  2. (b)

    To allow for modifications that do not alter its foundational nature;

  3. (c)

    It will enable other method developers to create variations without having to undergo the whole development process from scratch and at the same time inherit the main performance and other properties of the core method;

  4. (d)

    It will help understand other methods and their properties by showing how they relate to the generalized core method;

  5. (e)

    It will prevent confusion about apparently similar methods with different properties or apparently different methods with same properties; and

  6. (f)

    It will protect scientific priority claims and commercial intellectual property, by establishing which methods are just variations or derivatives of the original core methods, especially when such variations and derivatives were anticipated by the generalized framework.

Real-life relevance. Examples of generalized families of inter-related algorithms are many and include: the Best-First-Search algorithm family (chapter “Foundations and Properties of AI/ML Systems”), the General Linear Model (GLM) family (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”), the penalty+loss regularized classifier family (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”), the Generalized Local Learning (GLL) and Local to Global causal discovery (LGL) algorithm families (chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML”), and the TIE* family for equivalence class modeling (chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

  • In-depth example for step 5.d. (Performance Optimizations: Generalized frameworks). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, of developing principled and scalable biomarker and pathway discovery algorithms, it was realized within the team developing these algorithms, that infinite variations could be had that would preserve the soundness, completeness and other desired properties of the first algorithms introduced. To facilitate the study and further analysis and development of the families of the algorithms in a systematic way and minimizing confusion, Aliferis at al introduced a generalization of HITON-PC/HITON-MB and MMMP/MMMB, and of MMHC.

  • The former was termed GLL family and the latter LGL. Around the same time Statnikov and Aliferis introduced the generalized TIE family termed TIE*, and later generalized versions of ODLP and of parallel/distributed/sequential and chunked variants. Aliferis et al. introduced the notion of a 2-part Generative Algorithm Framework comprising:

    1. 1.

      A general template statement of the algorithm family.

    2. 2.

      Admissibility rules for instantiation of the template components.

  • If the admissibility rules are followed when instantiating components, then this guarantees that the instantiated algorithm have guaranteed properties without the need for de novo theory or proofs.

  • Figure 6 illustrates one instantiation of the GLL-PC generative template algorithm (shown in 6.a) by the admissibility rules of the Generative Algorithm Framework (shown in 6.b) to yield an infinity of algortihms with guaranteed properties such as the original MMPC (presented in 6.c). Contrary to the intricacy of the original MMPC, however, the generative framework describes the whole family of algorithms by specification of a few simple components.

  • Moreover, a Generative Algorithm Framework does not allow just the re-creation and compact representation of pre-existing algorithms, but as shown by these investigators, several new instantiations of the original generalized algorithms were demonstrated and they matched or exceed empirical performance of the original set of algorithms in validation data. One of the new algorithms is shown in part (6.d) of the figure. The new instantiations exhibited different traces of navigating the solution space toward the correct (same) output, demonstrating that they are not just a rehash of known algorithms [1, 29,30,31,32].

Fig. 6
A pseudocode for G L L P C. Approximate P C of T without symmetry check is followed by symmetry check, true set of parents and children, return a subset of E P C of T and a superset of P C of T, initialization, apply heuristic function, elimination and interleaving strategies, and return T P C of T.figure 6figure 6figure 6

Example of Generative Algorithm Framework. GLL-PC shown here and generating two from an infinite number of members of this family of algorithms (pre-existing MMPC algorithm in part (c) and new algorithm semi-interleaved HITON-PC with symmetry correction in part d), with guaranteed properties, merely by instantiating a simple general template statement (part a), following a small set of admissibility rules (part b)

Step 5.e. Nested algorithms, embedding protocols and stacks, interactions with data design.

Real-life relevance. Per traditional computer science and AI/ML practice, algorithms can be used as subroutines in higher complexity algorithms. For example, decision tree induction algorithms can be used inside Random Forests. Weak learning algorithms can be used as components of boosting algorithms. Algorithms of various kinds can be components of Stacked ML models, and so on.

Generative Algorithm Frameworks can also be used to create more complicated and hierarchically- nested algorithms families, creating nested systems of generalized algorithms tackling an increasingly complex problem-solving AI/ML construct. For example the GLL-PC generative family is nested in the GLL_MB family, which is nested in the TIE* family, which is nested in the ODLP* family. The nesting does not force use of the most complex level algorithm. To the contrary, the algorithms with smallest complexity that solve a problem are sufficient for that problem.

Algorithms and their implementations are the conceptual and scientific backbone and “engines” of real-life AI/ML. Complicated tools and systems, designed to solve health science and healthcare problems are almost always organized in complex data science “stacks”.

AI/ML (or data science) Stack: a hierarchically-integrated architecture for AI/ML software system delivery comprising data input management at the lower level, going upward to model selection, to error estimation, to error management, to decision support delivery and end-user interfacing, to embedding and healthcare integration, to model monitoring and full model lifecycle support (see chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” for details).

At the core of the ML stack is the ML protocol (see chapter “Foundations and Properties of AI/ML Systems” and “The development process and lifecycle of clinical grade and other safety and performance-sensitive AI/ML models” for details).

  • ML Protocol. A ML system architecture implements a ML method which can be understood as a combination of data design, algorithm, and model selection procedure that ideally will incorporate an error estimator procedure. The higher-level algorithm that combines the ML algorithms, data processing subroutines, model selection strategies and error estimators used, is the ML protocol.

  • AI/ML “Pipelines”, Automodelers, and Platforms. These represent discrete software implementation entities embodying implementation of the chosen algorithms and protocol, plus all other layers of the full AI/ML stack and are designed to be used reproducibly, in either fully automatic mode (“automodeler”) or semi-automatically, or as a component of a broader modeling system (a “pipeline”). Platforms refer to even more complex software systems with additional facilities for user experimentation, model sandboxing, training, model development, integration with other enterprise systems, etc.

Best Practice 5.5

The properties of a ML algorithm can be negatively or positively affected by the ML protocol to extreme degrees (see chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring problems, and the Role of Best Practices” for several important case studies that show the immense and often under-appreciated practical consequences). Similarly the data design can negatively or positively affect the ML protocol and its embedded algorithms to extreme degrees. Therefore, it is imperative to design AI/ML methods taking into account any positive or negative interactions of data design with the protocols and embedded algorithms employed.

  • In-depth example for step 5.e. (Performance Optimizations: Nested algorithms, embedding protocols and stacks, interactions with data design). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, of developing principled and scalable biomarker and pathway discovery algorithms, a core choice was made to design the algorithms and ML protocols with a focus on nested balanced cross validation as a “canonical” preferred model selection and error estimation protocol (see chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” for more details). The anticipated data designs were primarily case-control or natural, cross-sectional or longitudinal, i.i.d. sample designs. As long as distributions were faithful or TIE, the algorithms were guaranteed to exhibit well-defined and desirable soundness, completeness, computational and sample efficiency properties. Deviations from these “canonical” designs were also tolerated but would need careful tailoring of the methods to data designs outside these “canonical” specifications.

  • Many of the produced algorithms were embedded in software with data ingestion/transform, model selection, error estimation and adaptive selection among the core methods and state of the art comparators. Examples include the GEMS auto modeler system for microarray analysis, the FAST-Aims auto modeler system for Mass Spectrometry analysis, and several psychiatry-oriented, as well as bioinformatics, clinical and translational predictive and causal modeling stacks and pipelines. The construction of the auto modelers was further guided by extensive benchmarking of these and comparator methods and cross-referencing to expert analyses in the literature on the same datasets. These empirical performance benchmark studies measured a level of performance that matched or exceeded that of faculty level experts in AI/ML with the added advantage of almost immediate analysis, at no cost other than an inexpensive laptop or desktop [33,34,35,36].

  • Figure 7 shows components of the GEMS auto modeler.

Fig. 7
A flow diagram, a screenshot, and a line graph. a. Wizard user interface is followed by computational engine, cross-validation, performance computation, gene selection, and classification. b. GEMS wizard with step 4 out of 9 and project summary. c. The accuracy of GEMS is more than primary studies.figure 7

Example of an auto modeler system. Components of the GEMS auto-modeler shown. (a) System architecture. (b) “Wizard”-like user interface (the user could enter values for specific parameters, load analysis templates, or run in fully automatic mode). (c) Empirical results of GEMS analyzing datasets from the relevant literature and comparison with performance of original studies conducted by experts. (d) Algorithms benchmarked to inform the construction of the system [37]. (e) Algorithms chosen to be included in the system based on results of the benchmarks

Step 5.f. Explanation, clarity, and transparency

This step addresses the needs for transparency and clarity in the method’s semantics, syntax, inference mechanisms and mode of operation, and including its transparency, explainability, and ability for humans to inspect and understand the models produced by the new AI/ML method and of the operations that led to these models.

We frequently refer to AI/ML methods as being “black box, and “transparent”, “explainable”, or “open box”.

For the purposes of the present book we will define:

  • Black box AI/ML methods and/or models: are methods or models for which the user (and possibly even the developer) know only the inputs and outputs but not the internal operation; or, alternatively, the internal operation is accessible but it is not readily fully interpretable by humans.

  • Transparent (aka explainable, open, or white, or clear box) are methods or models for which the user and the developer know not only the inputs and outputs but also the internal operation which is readily fully interpretable by humans.

The transparency of the new method and its models are critical for (a) debugging the method and models (see chapter “Characterizing, Diagnosing and Managing the Risk of Error of ML and AI Models in Clinical and Organizational Application”), and (b) managing risks associated with its use and establishing trust in the AI/ML method and its outputs (see chapters “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” and “Characterizing, Diagnosing and Managing the Risk of Error of ML and AI Models in Clinical and Organizational Application”), as well as “Regulatory Aspects and Ethical Legal Societal Implications (ELSI)”).

When AI/ML methods are transparent we can also explain and justify their results including on a case-by-case, input-output basis. However there is a somewhat subtle but important distinction between explanation by translation and other forms of justification which is germane to the purposes of best practices in AI/ML.

  • Justification of a method or a model (and their outputs) is any argument that supports the validity of the method, the models produced by it, and the outputs produced by the models.

  • Explanation of a method or model (and their outputs) by functional translation is a justification of the method or model’s logic by fully equivalent translation in humanly-understandable language.

  • Where:

  • Human understandable language includes natural language but also other formalisms readily understood by humans such as decision diagrams, decision trees, propositional and first order logic, etc.

A common pitfall in AI/ML is providing peripheral/oblique and thus inadequate justifications of the model and its decisions which may be persuasive in some settings, but do not “open” the black box in the sense of creating a human-understandable and mathematically equivalent model to the black box model. For example, consider a hypothetical similarity-based “explanation” module of a neural network AI/ML model using exemplars. The module attempts to justify model decisions on a case C, by presenting a small number of cases similar to C,  with gold standard labels same as the model’s prediction for C. Because the neural network does not make decisions based on similarity to exemplars, this whole justification exercise  amounts to a “sleight of hand”. It can also be argued that local simple (e.g. linear) approximations to a very complex decision functions underlying the black box models, attempt to justify the decisions of the model are generally meaningful and trustworthy by examining a simplified version of the local individual decisions of the neural net but not explaining the global complex inductive logic of the model and its generalization.

We also highlight the distinction between “open source” and “closed source” software implementing AI/ML methods and models.

  • Open source software is software whose source code is at minimum open for inspection. Depending on the specific licensing terms of open source software, it may or not come with other rights granted, such as the right to modify the source code and release such modifications under same or different licensing terms.

  • Closed source (or proprietary) software is software with restrictions on code inspection, use, modifications/derivatives creation, sharing derivatives, or sharing the software.

A pitfall in AI/ML is conflating “open source” for open box” and “closed source” for “black box” software.

In reality an open source implementation of a method or model does not entail “open box” status, and a “closed-source” software does not entail “black box status. Artificial neural Network models are notoriously “black box” ones, yet this does not change if we have access to the code implementing them, or the right to modify and re-distribute implementation code. Conversely, a closed source implementation of, for example the ID3 decision tree algorithm, can be transparent in terms of both the algorithm used (which is well understood and openly accessible in the literature, and may also be accessible by licensed users), and the models it produces (i.e., decision trees which are intuitive and readily understood by humans). The latter case requires that the disclosures of the algorithms used by the software are (a) complete and (b) accurate.

Pitfall 5.6

Providing persuasive but peripheral/oblique justifications that lack fidelity to the AI/ML method, the models produced by them and their decisions.

Pitfall 5.7

Confusing “open source” for “transparent” and “closed source” for black box”.

In this section we covered only introductory concepts about explainability, since at the method development stage it is typically quite clear whether the method and its models are interpretable. In chapter “The development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” we will delve into the details of explainable AI (XAI) [38], interpretable ML [39] and some of the key techniques and nuances of explaining back box models.

  • In-depth example for step 5.f. (Performance Optimizations: Explanation, clarity, and transparency). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, of developing principled and scalable biomarker and pathway discovery algorithms, a core choice was made to design the algorithms and ML protocols within a causal graphical modeling framework. Causal graphs are very intuitive representations of causality, and Markov Boundaries have an intuitive causal and probabilistic meaning and lead to very compact and transparent models as well. Depending on the classifier used (e.g., decision trees, conditional probability tables/heatpmaps, rules, regression) results can be readily reviewed by human non-experts. In addition, the team members devised a more sophisticated method whereby if the MB was used to create black box models (because they were optimally predictive) or for any black box model for that matter, data would be sampled from the black box model and then a meta learning step involving MB induction and learning decision tree models over the MB that were equivalent to the black box, but perfectly transparent to humans. This explanation method was used to understand the black box reasoning of human physicians in the diagnosis of melanoma by Sboner et al. [40, 41]. More details are presented in chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models” in the context of general explanation of black box models.

Step 6. Empirically test algorithms in controlled conditions

The next stage in the development of new AI/ML methods is testing them in controlled conditions whereby they are given data that comply with the sufficient or necessary and sufficient conditions for their theoretical performance properties to hold. In these empirical tests, developers also vary parameters such as sample size, variable measurement noise, strength of signal for the functions to be learnt, dimensionality of the data as well as various parameters relevant to the specific characteristics of the learning methods (as dictated by their properties established in previous steps) The methods are also tested with data where assumptions are violated to varying degree and the effect on performance characteristics is studied.

There are four important types of controlled condition data testing:

  1. (a)

    Simulated data: where developers or other evaluators first define a mathematical or computational model representing the data generating function to be learned. Then they sample from this function via simulation, give the data to the AI/ML method and study its performance characteristics.

  2. (b)

    Label-reshuffled data: these are real data where evaluators randomly re-assign the response variable’s values (aka “labels”) across the dataset (Fig. 8). This has the effect of maintaining the real joint distribution of inputs, as well as the real marginal distribution of response variable, but decoupling the inputs from outputs, so that over multiple such label-reshuffled datasets, on average there is no signal to be learned. This type of simulation is essential for testing ML methods’ error estimation procedures. It is also a valuable tool for testing a learnt model against the hull hypothesis of no predictive signal in the data [42, 43]. See also chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI” on model over confidence for more details.

  3. (c)

    Re-simulated data. Resimulation attempts to create controlled data simulations that are ideally hi-fidelity approximations of real data distributions of interest. Resimulation works as follows [32, 44]:

    • We start with one or more real life datasets Dreal where all variables (including response variables) have known values.

    • Then we use a learning algorithm A to learn one or more models Mi capturing all or at least some desired characteristics of the real-life distribution from which the real data was sampled from.

    • Then we sample from the encoded distribution in Mi using simulation (“resimulation” in his context) creating resimulated data Dresim.

    • Then we test the properties of Dresim against the real data Dreal. A variety of distribution similarity metrics can be used as well as custom predictive modeling, and custom tests of properties can be used for establishing that the resimulated data is a hi-fidelity representation of the real data.

    • If adjustments are needed, we can iterate between the second and fourth steps until sufficiently accurate resimulated data is obtained.

    • Once we have the hi-fidelity resimulated data we can now feed it in the new method, while varying parameters and obtain performance metrics just like in the case of simple simulation.

    A conceptual nuance about resimulation is that if the resimulated data is perfectly indistinguishable from real data in all modeling aspects of interest to us (not just modeling the joint distribution), that implies that algorithm A is an optimal algorithm for discovery for the data generating function of the real distribution. Algorithm A would then represent a correct discovery algorithm procedure, rendering further method development effort potentially unnecessary (barring efficiency considerations). In common practice, we rarely have a perfect algorithm A at hand when performing resimulations. Instead we use algorithms that capture simplified and controllable aspects of the real data and therefore the performance of our new method is (loosely) an upper bound on the performance with real data. The rationale being that, if a new method cannot learn the process that creates the simplified version of real data Dreal, i.e., Dresim, then the new method will most likely not be able to learn much more complicated real life data generating functions. If a perfect algortihm A exists in terms of quality of output, then the new algorithms may be more efficient ones. 

  4. (d)

    Statistical tests for distributional or other conditions for method correctness. These typically encompass statistical tests that do not test the algorithms but the data per se. For example, if an algorithm is devised to create regression models under multivariate normality of the data, we can test the real data for conformance to this assumption. This type of test is an important supplement to simulation studies and may also serve as a preferred alternative if no credible simulation or resimulation can be designed, and the algorithm’s theory of correctness is well established.

Fig. 8
3 histograms. a. Frequency versus area under R O C. The bars have a fluctuating normal distribution between 0.6 and 0.9, with uninformative prediction at 0.5 on the x-axis. b and c. Probabilities of observing versus A U C. Original is at around 0.7 and 0.5 for Beer et al and Bhattacharjee et al.

Label reshuffling and its uses. In the top panel (a), the distribution of estimated predictivity for a hypothetical new ML method operating in a dataset that has been label-reshuffled (measured as area under the ROC curve—AUC ROC) is shown in blue. As can be seen, the distribution is not centered at the 0.5 point (the uninformative or no-signal point) in the AUC ROC. This indicates that the new ML method significantly overestimates performance [42]. In the bottom left panel (b) from a real data set modeling it can be seen that (1) the modeling protocol does not overestimate the model performance (because the distribution is centered on AUC ROC 0.5) and (2) the actual model obtained (red line) has performance that is statistically significantly different than that of the null hypothesis (i.e., no signal, represented again by the 0.5 point of the AUC ROC distribution of label-reshuffled datasets). By contrast, the analysis of the data depicted in the bottom right panel (c), leads to a model devoid of signal. Again the protocol used is unbiased with respect to error estimation [43]

  • In-depth example for step 6 (Empirically test algorithms in controlled conditions). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, of developing principled and scalable biomarker and pathway discovery algorithms, all of the algorithms discussed in section “Over- and Under-Interpreting Results”, were tested extensively with simulated and resimulated data. These experiments revealed a number of important properties including: (1) verifying the MB theoretical expectation of maximum compactness and maximum predicitivity; (2) verified causal consistency; (3) showed high stability; (4) showed large added value over comparators and over random selection; (5) showed the role of various hyperparameters; (6) established the natural false discovery rate control in the GLL algorithm family that protecting against false positives due to conducting massive numbers of conditional independence tests; (8) showed the role of inclusion heuristics for computational efficiency; (9) demonstrated how perilous is to use non causal comparator methods to infer causality (something that GLL algorithms accomplish very well); and (10) showed the relatively large insensitivity of the algorithms to hyperparameter choice [1, 32].

Step 7. Empirically test algorithms in real life data with known answers/solutions

This testing involves real data where known gold standard answers have been established by prior research. This is a very informative testing stage because a myriad of factors may exist in real data that have not been anticipated in theoretical analysis or in simulations. Conversely, it is also possible for real data to exhibit higher simplicity than what was anticipated by methods’ developers which if true, typically leads to relaxing some of the related assumptions of the new method.

We re-iterate for emphasis that there are two reasons why we do not jump directly to step 7 (i.e., omitting empirical testing under controlled conditions):

First, real data does not afford full control of all relevant parameters that may influence the new method’s performance. For example, we cannot arbitrarily control sample size, or signal strength, or dimensionality, or % of unmeasured variables, connectivity of the causal data generating process, etc. in real data since these factors have fixed and in many cases unknown characteristics in the available real datasets.

Second, real data with known high quality answers may be very limited and we do not wish to overfit the new method development to a small number of validation datasets (as will invariably happen, see chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI” on overfitting).

Empirically testing new and existing methods with real data with known answers typically follows three designs:

  1. 1.

    Centralized benchmark design. A small group of expert data scientists organize and execute a series of tests of a new or existing method using several datasets and several state of the art alternative methods. Ideally all reasonable alternatives and baseline comparators including the best known methods for this type of problem are included and are executed according to the best known specifications (e.g., the specifications provided by their inventors). The datasets used are sufficient to cover a wide range of data typically encountered in the application domain. A multitude of factors are varied and their effects studied. Simulated and resimulated data may also be included a priori or ante hoc (e.g., to shed light on behaviors in real data). Example such studies are [1, 30,31,32, 37, 45,46,47,48, 51, 54,55,56, 59, 60].

  2. 2.

    Distributed benchmark design. This design is a variation of the centralized benchmark and typically occurs in the context of a scientific or industry consortium or coalition. A central team of experts organizes a benchmark similar to design 1 however analyses are conducted by several groups within the consortium. Each group may employ different methods, protocols, etc., and this natural variation is studied by the organizers who analyze results across participating teams. For an example of this design see [49].

  3. 3.

    Public challenge design. This design is a variation of the distributed benchmark and typically occurs in the context of a participatory science framework. A central team of experts or a challenge organization organizes the challenge typically with one or few datasets and a fixed data design. Analyses are conducted by volunteers across the globe. Each group may employ different methods, protocols, etc., and this natural variation is studied by the organizers who analyze results across participating teams. For an example of this design see [44].

Table 1 summarizes the characteristics of each design and points to strengths and weakness.

Table 1 Empirical testing of AI/ML with real data and know answers: Three alternative designs (green: positive characteristics, red: weaknesses)

ML challenges serve two fundamental purposes. One is the education of scientists from different fields about the problems the challenge addresses and giving them data, and a platform to experiment. The second purpose is to explore which algorithmic methods are better at a particular point in time for a particular problem. Well-designed challenges can generate valuable information and enhance interdisciplinary engagement. Poorly-designed challenges (which in our estimation are the majority, currently) can be very misleading designs with respect to evaluating AI/ML methods. We elaborate in the following pitfall:

Pitfall 5.8

Issues and pitfalls of ML challenges. In many if not most cases, challenges suffer from fixing the data design and the error estimation thus removing from consideration two out of the three determinants of ML success (i.e., data design, ML model selection and error estimation protocol, algorithm).

Challenges also routinely restrict the design of modeling by pre-selecting variables, and over-simplifying the statement of problems, sometimes to meaningless extremes.

Challenges also often suffer from incomplete or highly biased representation in the competitor pool. Typically participants in challenges are either students or interested scientists who have competencies in areas unrelated  to AI/ML.

Another limitation is that not all appropriate algorithms are entered in a challenge and when they enter, they are not necessarily executed according to optimal specifications.

Finally, challenges typically involve a very small number of datasets that do not represent a large domain. Such representative coverage typically requires dozens of datasets or more in the same comparison.

Despite these limitations, a select number of challenges that are designed to a high degree of quality, when interpreted carefully and with the appropriate qualifications, can provide valuable empirical scientific information. For examples of well-designed, high quality, and carefully interpreted challenges the reader may refer to the challenges conducted by the ChaLearn organization [50].

Best Practice 5.6

The preferred design for validating AI/Ml methods with real life data with known answers is the centralized benchmark design. Distributed benchmark designs, whenever feasible, add value by exploring natural variation in how methods are applied by experts. Finally competitions have several intrinsic limitations and have to be interpreted carefully.

  • In-depth example for step 7 (Empirically test algorithms in real life data with known answers/solutions). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, CONTINUED.

  • In our running example, of developing principled and scalable biomarker and pathway discovery algorithms all of the algorithms discussed in section “Over- and Under-Interpreting Results” were also tested extensively with real data where true answers were known or could be established via predictive verification. Statnikov et al. showed the superlative performance for biological local pathway discovery. Aphinyanaphongs et al. showed the performance in text categorization comparing with all major acedemic and commercial comparators of the time. Statnikov and Aliferis showed the massive gene expression signature equivalence classes in cancer microarray data. Alekseyenko, Statnikov and Aliferis evaluated in GWAS prediction signatures and causal loci discovery. Ma, Statnikov and Aliferis showed minimization of experiments in biological experimental data. Other benchmarks addressed additional microbiomic, cancer, and other types of data [1, 18, 22,23,24, 29,30,31,32,33,34, 37, 40, 41, 45,46,47, 52, 53].

Step 8. Empirically test algorithms in real life data without known answers/solutions but where future validation can take place

In the final step of the new method development and validation process, the AI/ML method and the models produced by it are tested in real data where the correct answers can be obtained but only prospectively: If the models are predictive we obtain true values and compare them to the predicted values. In causal modeling we conduct randomized controlled experiments and compare the effects of interventions against the algorithmic estimates for such effects. Other forms of validation designs are also possible depending on the nature of the AI/ML methods’ goals and intended outputs.

We caution the reader that only after successful completion of ALL prior steps in the new or existing AI/ML method development or appraisal, is applying the method in real-life problems with any risk, warranted. 

  • In-depth example for step 8 (Empirically test algorithms in real life data with unknown answers/solutions). Development or evaluation of scalable discovery methods of biomarkers and molecular pathways from high dimensional biomedical data, FINAL.

  • In our running example, the methods were deployed in several real-life projects related to basic and translational science discovery, experimental therapeutics, and healthcare improvements. Application areas included: (1) predicting risk for sepsis in the neonatal ICU; (2) diagnosing stroke from stroke-like syndromes using proteomic markers; (3) modeling the decision making and determining melanoma guideline non-compliance of dermatologists; (4) determining which patients with ovarian cancer will benefit from frontline Tx with bevacizumab; (5) understanding mechanisms and predicting outcomes in children with PTSD, (6) creating models that predict accurately citations of articles in deep horizons; (7) creating models that characterize the nature of citations; (8) creating models to classify articles for methodological rigor and content; (9) models that scan the WWW for dangerous medical advice; (10) models for diagnosis of psoriasis using microbiomic signatures from the skin; (11) multi-omic clinical phenotype predictions; (12) discovery of new targets for osteoarthritis; (13) detection of subclinical viral infection using gene signatures from serum; (14) analysis and modeling of longevity using clinical and molecular markers and (15) modeling of the mediating pathways between exercise and diet and cardiometabolic outcomes for drug target discovery [35, 36, 40, 51, 54,55,56, 59, 60,61,62,63,64,65].

  • In conclusion, the above interrelated case studies give an “insider’s view” on, and showcase the feasibility and benefits of a complete rigorous development and validation approach for new methods using the example of local causal graph and MB algorithms and their extensions.

  • Similar in-depth rigorous development efforts have characterized the history of Bayesian Networks, SVMs, Boosting, Causal inference algorithms, and other methods (see chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML”). However we do point out that unfortunately in many cases, widely-adopted methods and tools both in the academic and commercial realms lack many of the steps outlined here and therefore they have to be used with caution especially in high-stakes (high risk, high cost) application domains. The next section gives a highly condensed overview of properties of major AI/ML methods.

Best Practice 5.7

Develop and validate ML/AI methods using the following stages/steps:

  • Step 1. Rigorous problem definition (in precise mathematical terms and establishing how the mathematical goals map to the healthcare or health science discovery goal).

  • Step 2. Theoretical analysis of problem (complexity, problem space characteristics etc.).

  • Step 3. First-pass algorithms solving the problem.

  • Step 4. Theoretical properties of first pass algorithms: focus on representation power, soundness and completeness, transparency.

  • Step 5. Algorithm refinements and optimizations.

  • Step 6. Empirically test algorithms in controlled conditions .

  • Step 7. Empirically test algorithms in real life data with known answers/solutions.

  • Step 8. Empirically test algorithms in real life data without known answers/solutions but where future validation can take place.

Best Practice 5.8

Avoid evaluating methods by employing expert narratives showing “validity”.

Best Practice 5.9

Do not reinvent the wheel. Verify that a new method does not solve a problem previously solved by a better performing method.

Best Practice 5.10

Create open box methods to the full extent possible. Do not pursue weak justifications that fail to translate the models to accurate human readable representations.

Best Practice 5.11

Do not confuse “open source” for “transparent” and “closed source” for black box”.

A Concise Overview of Properties of Major AI/ML Methods

Table 2 gives a high-level, very concise view on properties of key families of AI/ML methods. A few observations are in order:

  1. 1.

    Heuristic systems by definition lack well-defined, well-understood or confirmed properties. We include them in the table only to remind the reader that they are seriously handicapped in that regard, and should not be used in high-stakes applications (until a better understanding of their risk and benefits is achieved).

  2. 2.

    Most methods that are widely used today have well understood properties, or at most a few gaps in the understanding of their properties.

  3. 3.

    There is no perfect method across all properties and problem types: every method has weak spots.

  4. 4.

    Some methods are compromised by being commonly used in problems that they should not (i.e., “user error”). In these cases there is always a better method for that problem category.

  5. 5.

    Statistical machine learning methods have stronger/better studied properties in general.

  6. 6.

    It is possible for some methods to have sub-optimal theoretical properties but exhibit excellent empirical performance in some problem categories and/or selected datasets.

Explanation of terms:

Representation

What kind of relationships can the method represent? In case of modeling methods, are the model assumptions restrictive?

↑ (strong): Any relationship (e.g. Universal Function Approximator)

o (limited): Constrained functional form

↓ (weak): Restrictive assumptions apply

D (Depends): Inherits property from an component method, has other kind of dependence (e.g. on an underlying distribution), or the properties varies across members of a broader category of methods

Semantic clarity

Is the method (or model) semantics clear? E.g. in case of a predictive model, is the meaning of the model components clear?

↑ (strong): Yes

o (moderate): Some parts are not clearly defined

↓ (weak): Not having clear meaning

Transparency

Does the method semantics relate to real-world entities in a clear manner?

Is it easily understood by humans?

↑ (strong): The relationship between real-world entities and method components can be explained. Humans easily understand method/models.

↓ (weak): Method is difficult to interpret, additional algorithms are needed to interpret

o (moderate): Between ↑ and ↓

Soundness

When the model assumptions are met, are the results correct?

(E.g. for a predictive model, is the model error optimal?)

↑ (strong): Yes, guaranteed (e.g. convex problem, Optimal Bayes Classifier, etc);

o (medium): Trapped in local optima, only approximates target function to some acceptable degree;

↓ (weak): Output may considerably deviate from correct answer

Completeness

Will the method output correct answers for all problem instances?

↑ (strong): Yes, will produce correct answers for full problem space;

o (medium): considerable but not full coverage of problem space;

↓ (weak): only small portions of problem space correctly answered, or very significant regions are omitted.

Compute

Computational complexity. For predictive models, it includes the complexity of both model construction and prediction.

↑ (strong): Very fast. E.g., for executing predictive models, linear in number of variables, and with a small hyperparameter space

o (medium): May build multiple models.

↓ (weak): Requires extensive computing (e.g. immense hyper-parameter space, exponential in the number of variables, etc.)

D (depends): Typical and worst-case are very different; can take advantage of properties of the problem (e.g. graph connectivity)

Space

Storage complexity required for the computation, or for storing the model (if there is a model)

↑ (strong): Approx. number of variables

o (medium): linear in the number of variables

↓ (weak): super-linear in the number of variables or proportional to the data set size (if data set size exceeds order of number of variables)

Sample size

Sample size required to train the model (to an acceptable performance on problems that constitute the preferred use of this method)

↑ (strong): small sample size is sufficient (e.g. linear in the number of variables, in the number of effective parameters, support vectors, etc.)

o (medium): moderate (e.g., low order polynomial) sample sizes required to number of effective parameters

↓ (weak): large sample sizes are required (e.g., super- or higher order-polynomial to number of effective parameters)

Probabilistic consistency

(1) Model can return probabilities (or output can be converted to probabilities) and (2) probabilistic output is calibrated or can be calibrated

↑ (strong): designed to be probabilistically consistent

o (medium): output can be converted into probabilities

↓ (weak): not meant to produce probabilities and there is no easy way to convert predictions into consistent probabilities

Empirical accuracy in common use

In its common use, what is the method’s empirical performance in terms of result accuracy (e.g. accuracy or AUC for classification models)?

↑ (strong): One of the strongest among methods that are best-of-class for this problem. (E.g. DL for imaging; Cox PH for time-to-event, etc.)

o (medium): Performs noticeably worse than the best, but still outperforms several methods.

↓ (weak): Performs substantially worse than most applicable methods.

Empirical accuracy in recommended use

In its preferred use, what is the method’s empirical performance in terms of result accuracy (e.g. accuracy or AUC for classification models)?

↑ (strong): one of best-in-class methods for this problem. (E.g. DL for images; Cox PH for time-to-event, etc.)

o (medium): performs noticeable worse than the best-in-class, but still useful in some cases.

↓ (weak): Performs substantially worse than best and has no mitigating uses.

Method misapplication in practice

Common: The method is commonly used for tasks that is not best-in-class

Uncommon: The method is seldomly used for tasks that is not best-in-class

Table 2 Map of properties of major AI/ML methods

We advise the reader to study this table and cross-reference with the description of methods and guidance for their use in chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML”.

A Worksheet for Use when Evaluating or Developing AI/ML Methods

It is highly recommended as new or existing methods are being evaluated to use a chart such as the one shown in Table 3. There are three purposes in this endeavor: (a) Remind the developer or evaluator about the necessary dimensions of validation/appraisal. (b) Maintain a record of progress as the various stages of evaluation are advancing. (c) Enforce due diligence both in an absolute sense but also in comparison to applicable alternatives. (d) Enforce honesty/reduce developer bias in assessing the added value of the evaluated method over established incumbents. (e) More objectively assess marketing claims about the strengths of commercial products.

Table 3 Worksheet for evaluating new and existing AI/ML methods

Over-and Under-Interpreting Results

We conclude this chapter with a discussion of avoiding over-interpreting and under-interpreting AI/ML results.

A major principle for the scientifically valid use of AI/ML models is their proper interpretation for either driving healthcare decisions and improvements as well as for driving discovery in the health sciences. Two major and antithetical pitfalls are the over- and the under-interpretation of results given the data design (see chapter “Data Design”) and the properties of the algorithms and protocols employed.

Pitfall 5.9

Interpreting results of a method’s beyond what its known properties justify.

Pitfall 5.10

Interpreting results of a method’s below what its known properties justify.

A few examples of the over-interpretation pitfall include:

  1. (a)

    Interpreting weak predictive methods (and resulting models) as if they have much stronger accuracy (usually combined with omitting stating the weak aggregate signal of the method’s models e.g., in the context of regression analyses of biomarkers).

  2. (b)

    Assigning special biological or mechanistic significance to variables because they are stable in resampling or because they are ranked high according to univariate association with a response variable.

  3. (c)

    Generally interpreting causally the findings of predictive methods (and resulting models).

  4. (d)

    Failing to observe that some feature selection methods in omics data commonly do not, or marginally outperform random selection.

  5. (e)

    Ignoring the possibility of hidden (aka unmeasured or latent) variables distorting the observed effects of measured variables.

  6. (f)

    Ignoring effects of small sample size variation on results.

  7. (g)

    Assuming (without proof) that case matching according to hypothesized confounders has controlled all confounding. Assuming in SEM modeling that the domain causal structure is known with certainty.

  8. (h)

    Assuming that propensity scoring perfectly controls confounding.

  9. (i)

    Treating coefficient values in regularized regression methods (Lasso family, SVMs, and other “penalty+loss” algorithms) as if they are equivalent to statistical conditioning (e.g., in classical regression).

  10. (j)

    Assuming (without testing) that the assignment of subjects to treatment arms in trials from existing datasets is perfectly randomized and without bias. Etc.

A few examples of the under-interpretation pitfall include:

  1. (a)

    Focusing on small individual variable effects without noticing that the aggregate signal over many variables is large (e.g., in GWAS studies).

  2. (b)

    Focusing on small coefficient of variation of a model (i.e., total response variance explained) and failing to notice that some variables have strong individual effects (this is the reverse of the previous under-interpretation problem).

  3. (c)

    Failing to pick up strong putative causal factors even when causal ML algorithms indicate their significance, because “correlation is not causation”.

  4. (d)

    Dismissing methods (and resulting models) because they are not statistically stable under resampling.

We will revisit these problems in subsequent chapters as they require a holistic understanding of data design and proper AI/ML algorithm design and execution. The best practice we will state at this point however is:

Best Practice 5.12

Interpret results of application of a method (and resulting models) at the level justified by its know properties.

Key Messages and Concepts Discussed in Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”

  1. 1.

    Establishing and knowing the properties of AI/ML methods enables informed assessments about the Performance requirements, the Safety requirements, and the Cost-effectiveness requirements of corresponding AI/ML solutions, and leads to building trust in the AI/ML solution.

  2. 2.

    A best practice workflow was presented, that can be used to establish the properties of any new or pre-existing method, tool or system, so that a rational, effective and efficient solution to the problem at hand can be identified.

  3. 3.

    The importance of rigorous problem definitions (in precise mathematical terms, and with precise correspondence to health care or health science objectives).

  4. 4.

    Re-inventing the wheel and why it is undesirable.

  5. 5.

    First-pass algorithms vs algorithm refinements and optimization.

  6. 6.

    Parallel algorithms, Distributed/Federated, Sequential, and Chunked algorithms.

  7. 7.

    Relaxing algorithmic assumptions/requirements.

  8. 8.

    Generalized algorithm frameworks and generalized conditions for performance guarantees.

  9. 9.

    What are AI/ML (or data science) Stacks.

  10. 10.

    “Pipelines”, “Automodelers”, and “Platforms”.

  11. 11.

    Explanation, interpretability, and transparency: Black box AI/ML methods and/or models; Transparent (or open, or white, or clear) box.

  12. 12.

    Justification of a method or a model (and their outputs) vs high-fidelity explanation of a method or model (and their outputs) e.g., by functional equivalence.

  13. 13.

    Human-understandable models and formalisms.

  14. 14.

    Open source software vs Closed source (or proprietary) software.

  15. 15.

    The importance of testing algorithms in controlled conditions.

  16. 16.

    Simulated data; Label-reshuffled data; Re-simulated data, and their properties and use.

  17. 17.

    Real-life examples of using the new method development process to establish the properties of well known (new or pre-existing) methods, tools or systems.

  18. 18.

    Interpreting results of application of a method at the level justified by its known properties.

Pitfalls Discussed in Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”

Pitfall 5.1.: Developing methods with theoretical and empirical properties that are:

  1. (a)

    Unknown, or

  2. (b)

    Poorly characterized in disclosures and technical, scientific or commercial communications and publications, or

  3. (c)

    Clearly stated (disclosed) but not proven, or

  4. (d)

    Not matching the characteristics of the problem to be solved at the level of performance and trust needed.

Pitfall 5.2. Evaluating the success of methods with poorly defined objectives, by employing expert narratives showing “validity”.

Pitfall 5.3. Defining the goals of methods in mathematical terms but without establishing how the mathematical goals map to the healthcare or health science discovery goal.

Pitfall 5.4. Reinventing the wheel: whereby a new method is introduced but it has been previously discovered yet ignored (willfully or not) by the “new” method developers.

Pitfall 5.5. Reinventing a method but make it worse to established methods (…“reinventing the wheel and making it square”!).

Pitfall 5.6. Providing peripheral/oblique and thus inadequate justifications of the model and its decisions which do not “open” the black box.

Pitfall 5.7. Confusing “open source” for “transparent” and “closed source” for black box”.

Pitfall 5.8. Issues and pitfalls of ML challenges. In many, if not most cases, challenges suffer from fixing the data design and error estimation thus removing from consideration, two out of the three determinants of ML success (i.e., data design, ML model selection and error estimation protocol, algorithm).

Challenges also routinely restrict the design of modeling by pre-selecting variables, and over-simplifying the statement of problems, sometimes to meaningless extremes.

Challenges also often suffer from incomplete or highly biased representation in the competitor pool. Typically participants in challenges are either students or interested scientists who have competencies in areas unrelated to AI/ML.

Another limitation is that not all appropriate algorithms are entered in a challenge and when they enter, they are not necessarily executed according to optimal specifications.

Finally, challenges typically involve a very small number of datasets that do not represent a large domain. Such representative coverage typically requires dozens of datasets or more in the same comparison.

Pitfall 5.9.: Interpreting results of a method (and resulting models) beyond what its known properties justify.

Pitfall 5.10.: Interpreting results of a method (and resulting models) below what its known properties justify.

Best Practices Discussed in Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”

Best Practice 5.1. Methods developers should strive to characterize the new methods according to the dimensions of theoretical and empirical properties.

Best Practice 5.2. Methods developers should carefully disclose the known and unknown properties of new methods at each stage of their development and provide full evidence for how these properties were established.

Best Practice 5.3. Methods adopters and evaluators (users, funding agencies, editorial boards etc.) should seek to obtain valid information according to the dimensions of theoretical and empirical properties for every method, tool, and system under consideration.

Best Practice 5.4. Methods adopters and evaluators should map the dimensions of theoretical and empirical properties for every method, tool, and system under consideration to the problem at hand and select methods based on best matching of method properties to problem needs.

Best Practice 5.5. The properties of a ML algorithm can be negatively or positively affected by the ML protocol to extreme degrees (see chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring problems, and the role of BPs.” for several important case studies that show the immense and often under-appreciated practical consequences). Similarly the data design can negatively or positively affect the ML protocol and its embedded algorithms to extreme degrees. Therefore, it is imperative to design AI/ML methods taking into account any positive or negative interactions of data design with the protocols and embedded algorithms employed.

Best Practice 5.6. The preferred design for validating AI/Ml methods with real life data with known answers is the centralized benchmark design. Distributed benchmark designs, whenever feasible, add value by exploring natural variation in how methods are applied by experts. Finally competitions have several intrinsic limitations and have to be interpreted carefully.

Best Practice 5.7. Develop and validate ML/AI methods using the following stages/steps:

  • Step 1. Rigorous problem definition (in precise mathematical terms and establishing how the mathematical goals map to the healthcare or health science discovery goal).

  • Step 2. Theoretical analysis of problem (complexity, problem space characteristics etc.).

  • Step 3. First-pass algorithms solving problem.

  • Step 4. Theoretical properties of first pass algorithms: focus on representation power, soundness and completeness, transparency.

  • Step 5. Algorithm refinements and optimizations.

  • Step 6. Empirically test algorithms in controlled conditions.

  • Step 7. Empirically test algorithms in real life data with known answers/solutions.

  • Step 8. Empirically test algorithms in real life data without known answers/solutions but where future validation can take place.

Best Practice 5.8. Avoid evaluating methods by employing expert narratives showing “validity”.

Best Practice 5.9. Do not reinvent the wheel. Verify that a new method does not solve a problem previously solved by a better performing method.

Best Practice 5.10. Create open box methods to the full extent possible. Do not pursue weak justifications that fail to translate the models to accurate human readable representations.

Best Practice 5.11. Do not confuse “open source” for “transparent” and “closed source” for black box”.

Best Practice 5.12. Interpret results of application of a method at the level justified by its known properties.

Classroom Assignments and Discussion Topics, Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”

  1. 1.

    Choose one ML/AI method of your choice and characterize it according to best practice 5.7. use any literature that is adequate for that purpose.

  2. 2.

    Choose a well-cited paper from health sciences or healthcare that uses AI/ML. Characterize the primary methods using best practice 5.7.

  3. 3.

    Can you describe a real-life example of over- or under-interpreting a type of AI/ML analysis or modelling?

  4. 4.

    Are there safeguards in human professional training and certification analogous to the best practices presented in this chapter?

  5. 5.

    Revisit the question of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” stating: “The so-called No Free Lunch Theorem (NFLT) states (in plain language) that all ML and more broadly all AI optimization methods are equally accurate over all problems on average. Discuss the implications for choice of AI/ML methods in practical use cases” using the tools of the present chapter

  6. 6.

    Revisit the question of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” stating: “’It is not the tool but the craftsman’. Does this maxim apply to health AI/ML?” using the tools of the present chapter.

  7. 7.

    Revisit the question of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” stating: “Construct a ‘pyramid of evidence’ for health ML/AI similar to the one used in evidence based medicine.” using the tools of the present chapter.

  8. 8.

    Revisit the question of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” stating:

    You are part of an important university/hospital evaluation committee for a vendor offering a patient-clinical trial matching AI product. Your institution strongly needs to improve the patient-trial matching process to improve trial success and efficiency metrics. The sales team makes the statement that “this is a completely innovative AI/ML product; nothing like this exists in the market and there is no similar literature; we cannot at this time provide theoretical or empirical accuracy analysis however you are welcome to try out our product for free for a limited time and decide if it is helpful to you”. The product is fairly expensive (multi $ million license fees over 5 years covering >1,000 trials steady-state).

    What would be your concerns based on these statements? Would you be in position of making an institutional buy/not buy recommendation?”

    Use the guidelines of the present chapter to compose a brief report and recommendations.

  9. 9.

    Revisit the question of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” stating:

    “A company has launched a major national marketing campaign across health provider systems for a new AI/ML healthcare product based on its success on playing backgammon, reading and analyzing backgammon playing books and human games, and extracting novel winning strategies, also answering questions about backgammon, and teaching backgammon to human players.

    How relevant is this impressive AI track record to health care? How would you go about determining relevance to health care AI/ML? How your reasoning would change if the product was not based on success in backgammon but success in identifying oil and gas deposits? How about success in financial investments?

    Use the guidelines of the present chapter to compose a brief report and recommendations.

  10. 10.

    Revisit the question of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” stating:

    “Your university-affiliated hospital wishes to increase early diagnosis of cognitive decline across the population it serves. You are tasked to choosing between the following five AI/ML technologies/tools:

    • AI/ML tool A guarantees optimal predictivity in the sample limit in distributions that are multivariate normal.

    • AI/ML tool B has no known properties but is has been shown to be very accurate in several datasets for microarray cancer-vs-normal classification.

    • AI/ML tool C is a commercial offshoot of a tool that was fairly accurate in early (pre-trauma) diagnosis of PTSD.

    • AI/ML tool D is an application running on a ground-breaking quantum computing platform. Quantum computing is an exciting and frontier technology that many believe has potential to make AI/ML with hugely improved capabilities.

    • AI/ML tool E runs on a novel massively parallel cloud computing platform capable of Zettascale performance.

    What are your thoughts about these options?”

    Use the guidelines of the present chapter to compose a brief appraisal.

  11. 11.

    Revisit the question of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML” stating:

    “The same question as #10 but with the following additional data:

    • AI/ML tool A sales reps are very professional, friendly and open to offering deep discounts.

    • AI/ML tool B is offered by a company co-founded by a Nobel laureate.

    • AI/Ml tool C is offered by a vendor with which your organization has a successful and long relationship.

    • AI/Ml tool D is part of a university initiative to develop thought leadership in quantum computing.

    • AI/Ml tool E will provide patient-specific results in 1 picosecond or less.

    How does this additional information influences your assessment?”

    Use the guidelines of the present chapter to compose a brief appraisal.

  12. 12.

    Comment on the representation power of the following methods and corresponding problems:

     1. Decision Trees

    ←→ predictive modeling

     2. KNN

    ←→ outlier detection

     3. Deep Learning

    ←→ pathway reverse engineering

     4. Simple Bayes

    ←→ simulating an arbitrary joint probability distribution

     5. SVMs

    ←→ predictive modeling with a random subset of the inputs missing

     6. Regularized repressors

    ←→ evaluate similarity of distributions

  13. 13.

    Rank the transparency and interpretability of the methods of question 12 for their preferred context of use.

  14. 14.

    Method X is correct whenever condition Y holds in the data. However, Y is not testable. Is X well characterized for soundness? Can it be?

    Bonus add-on: comment along the same lines on the soundness of the Aracne algorithm’ monotone faithfulness condition and of the Propensity Scoring’s “ignorability condition”.

  15. 15.

    How the notions of “heuristic power” and soundness relate?

  16. 16.

    A classifier model has acceptable error rate in 2/3 of the patient population. This subpopulation is identifiable by the model and its properties. What are its soundness and completeness? What would the soundness and completeness be if no one could identify the cases with unacceptable error margins?

  17. 17.

    Why worst-case complexity is less useful than “complexity in x% of problem instances”?

  18. 18.

    In some cases it is possible to use properties over the problem class to immediately determine properties of specific algorithms. Developer D introduces method M and claims it can solve a problem with known exponential worst case complexity in polynomial time. What can you immediately prove?

  19. 19.

    Can clustering methods be used to discover causality soundly and completely? Use your observations on its computational complexity and representational power to disprove this notion.

  20. 20.

    If discovering causal relations is worst-case exponential and regularized regression and SVMs are guaranteed quadratic time, what can you infer about the ability of regularized regression and SVMs for discovery of causality?

  21. 21.

    A faculty in a university brings forward a proposal to administration for installing a large compute cluster that has the compute power of 10,000 desktop computers. The faculty wishes to use brute force algorithms to discover non-linear discontinuous functions in the form of parity functions that have average case exponential cost to model to the number of variables. If a single desktop can solve such problems for up to 3-way variable interactions, what maximum degree interactions can be discovered with the proposed cluster?

  22. 22.

    How Bayesian Networks can reduce exponentially the model storage requirements compared to simple use of Bayes’ theorem?

  23. 23.

    Suppose we store models that incorporate discrete conditional probability tables over binary variables. For a variable X that has parents P1, …, Pk how does the storage complexity grows, as k grows? How is this complexity self-limiting if available sample size is relatively small?

  24. 24.

    In the 1980s a popular AI representation was the Certainty Factor Calculous (CFC, created by medical AI pioneers B.G. Buchanan and E. Shortliffe), a form of stochastic rules expert system representation. It was subsequently discovered by D. Heckerman et al. that unless limitations on the data distribution form were present, the CFC was not compatible with probability theory. Can you articulate when and why this might be a problem?

  25. 25.

    A commercial product promises that it can find “valuable insights” for healthcare improvement. Can you translate this deliverable in mathematical computable terms? What does it mean in terms of guaranteed properties if this type of output cannot be formalized mathematically?

  26. 26.

    Discuss the (humorous) maxim: “2 months in the lab will save one 2 hours in the library”.

  27. 27.

    Woods wrote a 1975 classic AI paper titled “What’s in a link”. In it he criticized the vague specification of the technical semantics of semantic networks (a prominent AI knowledge representation of his time derived from formal logic) and the impact that different semantics have on computability and complexity. Can you identify analogous problems in today’s AI/ML? For example, consider the causal semantics (or lack thereof) in the field of network science and of the numerous biological pathway reverse engineering methods.

  28. 28.

    “Data Hubris” is described by the statement “Having lots of data is more important than choice of algorithm”. What does this maxim means? Can you comment on the validity of this statement? What about if you add consideration of data design?