Keywords

Theoretical AI/ML Properties and Formal Vs Heuristic Systems

We will first address a few key concepts regarding studying and understanding, but also designing, AI systems by way of their formal properties. By formal properties we mean theoretical properties that are mathematical or computational, and technical and objective in nature.

Computability/Provability and Turing-Church Thesis

The most foundational property for any computer system (not just AI/ML systems) is computability, that is the fundamental question of whether there can even exist a computer program or system that achieves the computation needed for the inferences that we want this system to perform. Goedel [1] proved a theorem that shook the mathematical and computer science worlds.

  • Goedel’s celebrated “incompleteness theorem” shows that any non-trivial mathematical system for making deductive inferences is either complete or consistent but not both. Or stated differently, statements can be formed in this system that are true but cannot be proven if we wish to maintain the correctness of deductions.

  • A complete system is one that can deduce (or prove) from the axioms of the system all statements that are true.

  • A consistent system is one that does not produce contradictory conclusions (which entails false conclusions).

  • Correspondingly, in the realm of computing there are functions that are not computable, that is there is no computer program that can compute them.

These two results (provability and computability) are essentially mirroring each other because there is a close correspondence relationship between a “proof” in a mathematical system and a “program” in an equivalent computing system implementing the mathematical system. Non-computable functions are the ones that cannot be proven and vice versa.

Notice that the existence of non-computable functions/non-provable statements is with reference to a specific computing system. A different system may be able to prove certain statements at the expense of not being able to prove others that the first system can. Also, we note that in systems involving a finite number of domain elements, we do not face restrictions in computability. However, this is of small consolation if we realize that even systems as “basic” as common arithmetic, for example, involve many non computable functions.

What is the relationship of computability/provability in the computational/mathematical realm with that in human intelligence and reasoning?

The Turing-Church thesis posits that everything that the human mind can infer can also be inferred by a computer/mathematical system [1]. According to the thesis, there are no special functions of human intelligence that a computer or mathematical system cannot emulate. This thesis is axiomatic, meaning not proven. From what we know so far from neuroscience, cognitive science etc., there is nothing in the human brain that a computer system cannot model in principle, and the vast majority of AI scientists accept the Turing-Church thesis.

Computational Complexity of a Problem, Algorithm of Program

Computational complexity of a program refers to the efficiency of running a computer program that solves a particular problem according to a specific algorithm (that the program implements). In other words, it describes (for problems that can be solved by computer), how expensive is to solve the problem. Computational complexity is in the form of a function that typically takes the size of the problem instance as inputs.

Computational complexity of an algorithm applies the same rationale to algorithms instead of programs. Typically we analyze computational complexity at the level of algorithms assuming that programs will be the most efficient implementation of the algorithm (when exceptions happen in practice, we state upfront that a particular implementation of an algorithm is not as efficient as it can be).

Computational complexity of a problem is then analyzed at the level of the most efficient algorithm known (or that could be devised but not yet known—we will see later how this is accomplished) for solving this problem.

Space complexity refers to how much space the computer program/algorithm/problem class requires to reach a solution. Time complexity refers to how much time the computer program/algorithm/problem class requires to reach a solution. Because different computers differ greatly in the time needed to execute the same basic operation (e.g., one addition or one access of a random access memory location, etc.) we often measure time complexity not in units of time but in numbers of some essential operation (and then we can translate these units to time units for available computer systems). Because the differences between computers are within constant factors, this does not make a difference in an asymptotic sense.

Worst, average, and best-case complexity. Often, not all problem instances require the same amount of resources (space or time) to be solved by the same program/algorithm. Worst case complexity refers to the cost of the worst (most expensive) instance of the problem when solved by the best possible algorithm (or, alternative for a specific algorithm of interest). Best case complexity refers to the cost of the best (least expensive) instance of the problem. Average case complexity refers to the cost averaged over all instances of the problem.

Exact complexity refers to a precise complexity for example:

$$ Cost(x)={x}^2 $$
(1)

where x is the size of the ith problem instance. In this example, the cost of solving the problem is exactly the square of the size of the problem instance.

Asymptotic complexity refers to complexity as an asymptotic growth function, i.e., that is how fast the complexity grows as input size grows. For example,

$$ Cost(x)=O\left(f(x)\right) $$
(2)

The “Big O” notation O(f(.)) denotes that there is a problem instance size k, above which the complexity (cost) of all problem instances of size at least as large as k, is bounded from above within a positive constant from the value of function f(.), or more compactly stated:

$$ \exists k\;\mathrm{s}.\mathrm{t}.,\forall x\ge k: Cost(x)\le cf(x) $$
  • Where ∃ is the existential operator (denoting that the quantity in the scope of the operator exists)

  • ∀ is the universal operator denoting that for all entities in the scope of the operator a statement that follows is true

  • x is the size of the ith problem instance

  • Cost (x) is the computational cost of running the algorithm for input size x

  • k is a input size threshold above which the complexity statement holds

  • c is a positive constant

  • “S.t.” is the common abbreviation “such that”.

We often use asymptotic cost complexity for two reasons: (a) It eliminates confusion created by differences in the speeds of various computer systems since in practice these are all within a small constant factor of each other. (b) It shifts the attention to the broad classes of rates of cost growth (e.g. linear, quadratic, exponential, etc.) and not the precise cost formulas that can be convoluted. Mathematical analysis can accordingly be greatly simplified.

To understand the implications of asymptotic growth contrast the polynomial asymptotic complexity of formula (1) with the one below:

$$ Cost(x)=O\left({2}^x\right) $$
(3)

The following Table 1 shows how quickly these cost functions grow (assuming, for illustration purposes, c = 1). For input sizes above 100, the cost in terms of space and time complexity grows to sizes comparable to the size of the universe (measured in atoms) and quickly becomes much larger than the size of the universe. This means that there is not enough physical space or time to solve these problems!

Table 1 Demonstration of the practical significance of asymptotic computational complexity

The fallacy/pitfall that we will “use a big enough cluster” (or other high-performance computing environment) to solve a high-complexity problem is addressed in the parallel column where it is shown that the number of CPUs needed would quickly exceed the size of the universe. The fallacy/pitfall that Moore’s law (e.g., computing power doubles every few years) will provide enough power is addressed in the Moore’s law column where is shown that millions of years would be needed to address problems of any significant size, and after some point the space and time requirements exceed the size of the known universe.

We will refer to problems, algorithms, programs and systems exhibiting such exorbitant complexities, as intractable. The following pitfalls and corresponding best practices need be taken into consideration:

Pitfall 2.1

From a rigorous science point of view, an AI/ML algorithm, program or system with intractable complexity does not constitute a viable solution to the corresponding problem.

Pitfall 2.2

Parallelization cannot make an intractable problem, algorithm or program practical.

Pitfall 2.3

Moore’s law improvements to computing power cannot make an intractable problem algorithm or program practical.

Best Practice 2.1

Pursue development of AI/ML algorithm, program or systems that have tractable complexity.

Best Practice 2.2

Do not rely on parallelization to make intractable problems tractable. Pursue tractable algorithms and factor in the tractability analysis any parallelization.

Best Practice 2.3

Do not rely on Moore’s law improvements to make an intractable problem algorithm or hard program practical. Pursue tractable algorithms and factor in the tractability analysis any gains from Moore’s law.

It is very common in modern AI/ML to be able to address problems that have worst case exponential (or other intractable) complexity and routinely tackle, for example, analyses of datasets with >106 variables for problems with worst-case exponential cost by using a number of strategies that we will summarize below. First we round up the introduction to complexity properties with an overview of complexity classes.

Reduction of Problems to Established Complexity Class

Earlier we mentioned that computational complexity of a problem can be analyzed at the level of the most efficient algorithm known, or that could be devised but not yet known. How is this possible? One ingenious way to achieve this was discovered by Cook who proved a remarkable theorem (and received a Turing award for the work) [2]. Karp, based on Cook’s result, showed how to prove that several other problems were in the same complexity class (and also won a Turing award for this work) [3].

The above constitute a generalizable methodology, very widely used in computer science and AI/ML, comprising two steps:

  1. 1.

    First establish via mathematical proof that a problem class P1 has an intrinsic minimum complexity regardless of the algorithm or program that has been devised or could be devised to solve it (i.e., intrinsic to the problem and independent of algorithm, in the sense that no Turing machine can exist that could do better). This part does not require the knowledge of a conventional algorithm that solves P1.

  2. 2.

    Second, in order to prove that problem Pi at hand belongs to the same or harder complexity class as P1, it suffices to establish that a fast reduction (e.g., with polynomial-time complexity) exists that maps problems and their solutions in P1 to problems and solutions in Pi, such that when a problem solution to a Pi problem instance is found then it can be converted fast to a problem solution for P1.

“Fast” in this context means that: cost of the reduction + cost of solving the P1 version of the Pi problem, will be no costlier (asymptotically) than solving Pi. For example, if Pi has cost O (2X), a reduction with cost O (x2) satisfies the requirement since O (2x + x2) = O (2x).

Step 1 has to be accomplished only once for a prototypical problem class and is of the greatest mathematical difficulty. Step 2, which is typically considerably easier, is done each time a new method is introduced and is conducted once for the new method, with reference the prototypical problem class.

Cook’s discovery provided exactly step 1 and opened the flood gates via the reduction methods of Karp (step 2) for assigning whole problem classes to complexity cost classes regardless of the algorithm or problem used to solve it and regardless of whether even a single algorithm is currently known for solving the problem.

AI/ML and computer scientists often use prototypical complexity classes to study and categorize problems and the algorithms solving them, the most common ones being:

  • The P complexity class: contains problems that can be solved in polynomial time. These are considered as tractable (assuming, as is typically the case, that the polynomial degree is small).

  • The NP complexity class: contains problems that have the property that a solution can be verified as correct in polynomial time.

  • The NP-Complete complexity class: These are problems that are in NP and moreover if any of the problems in this class can be solved in polynomial time, then all other problems in the class can also be solved in polynomial time.

  • NP Hard problems. Are problems that are as hard as those in NP but it is unknown whether they are in NP.

Several other classes exist and are subject to study and exploration (as to what problems belong to them or what relationships exist among them).

The practical significance of the complexity classes is as follows:

  • Problems in P are considered as tractable (assuming, as is typically the case, that the polynomial degree is small).

  • Problems in NP-Complete or NP Hard classes are considered very hard and it is extremely unlikely that algorithms that solve such problems tractably in the worst case, can be created.

A fundamental property of AI/ML problem solving is that it usually operates in problem spaces belonging to the very high complexity/worst-case intractable classes. Many strategies have been invented to circumvent these theoretical difficulties and guide creation of efficient algorithms and systems, however (discussed later in the present chapter).

A List of Key and Commonly Used Formal Properties of AI/ML (Table 2)

Many additional special-purpose or ancillary formal properties can also be studied and established such as: whether performance estimators are biased, statistical decision false positive and false negative errors when fitting models, whether scoring rules or distance metrics used are proper or improper, various measures of statistical certainty, etc. We emphasize that the properties listed in Table 2 have immediate and obvious relationship with, and consequences for, the common objectives of health AI/ML. In the present volume we will refrain from study of properties that do not have strong relevance to the success or failure of AI/ML modeling. For example, the accuracy of a predictive model has immediate consequences for its usefuleness. By contrast, the centrality measures of network science models say very little about their predictive (or causal ) value. Similarly, the use of perplexity measure to study the degree by which a Large Language Model has learned (essentially the grammar underlying) a text corpus, does not indicate the clinical error severity resuting from output errors made by the model, which may of much higher importance for health applications. 

Table 2 Commonly-considered important formal (theoretical) properties that characterize all AI/ML algorithms, programs and systems

Formal (aka theoretical) properties are “hard” technical properties (i.e., mathematical, immutable). There exist “softer” properties (i.e., less technical, more transient, or even harder to establish objectively) such as compliance to regulatory or accreditation guidance, reporting standards, ethical principles, etc.

An additional category with special significance is that of empirical performance properties. These are obtained using methods of empirical evaluation (chapters “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems”, “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”, “Evaluation”, and “Characterizing, Diagnosing and Managing the Risk of Error of ML & AI Models in Clinical and Organizational Application”).

Importance of theoretical and empirical properties. Taken together these characterizations of AI/ML systems provide an invaluable framework for:

  1. (a)

     Understanding the strengths and limitations of AI/ML methods, models and systems;

  2. (b)

     Improving them;

  3. (c)

     Understanding, anticipating, and effectively managing the risks and benefits of using AI/ML; and

  4. (d)

     Choosing the right method for the problem at hand, among the myriad of available methods and systems.

We will see many examples of these formal, empirical and ancillary properties in the chapters ahead, considered in context. Chapters “Foundations and Properties of AI/ML Systems”, “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Foundations of Causal ML” describe properties of main AI/ML methods and chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” provides a summary table with the main properties of all main health AI/Ml methods.

Principled Strategies to Achieve Practically Efficient Algorithms, Programs and Systems for Worst-Case Intractable Problems

Since most intractability results pertain to worst-case complete and sound problem solving, a number of strategies can be used to achieve tractability, by trading off computational costs with reduction in soundness, completeness or worst-case complexity. Such common example strategies are listed below.

  1. (a)

    Focus on portions of the problem space that admit tractable solutions and ignore the portions with intractable solutions. In problem domains where the worst case complexity diverges strongly form the average case complexity, such an approach is especially appealing. For example, in ML problems, focus on restricted data distributions or target function sub-classes that lead to tractable learning. Or focus on sparse regions of the data-generating processes and ignore dense (and commonly intractable) regions.

  2. (b)

    Exploit prior domain knowledge to constrain and thus speed-up problem solving. For example, in discovery problems, avoid generating and examining many possible solutions that are incompatible with prior biomedical knowledge about the credible solution space. This may viewed as a case of “knowledge transfer” from this or similar problem domains. This is also often called pruning, where large branches, that are guaranteed to not contain the correct solution, are eliminated from vast solution search trees.

  3. (c)

    Instead of an intractable complete solution, provide a tractable localized part of the full solution that is still of significant value. For example, when pursuing a causal model of some domain, focus on the partial causal model around some variables of interest (i.e., biological pathway discovery involving a phenotype instead of full network discovery).

  4. (d)

    Instead of an intractable complete solution, provide a tractable non-local portion of the full solution that is still of significant value. For example, when pursuing a causal model of some domain allow discovery of a portion of correct relationships of interest (i.e., biological causal relationship discovery involving factors across the data generating network instead of full network discovery; or recovering a correct but unoriented causal network instead of the oriented one).

  5. (e)

    Instead of producing perfectly accurate but intractable solutions focus on more tractable but acceptable approximations of the true solutions.

  6. (f)

    Do not solve harder problems than what is needed by your application. A classic example demonstrating this principle is to prefer discriminative models over generative ones in predictive modeling. In plain language, this means that we can often solve a hard problem (e.g., what treatment to give to a patient with a kidney stone?) by building simple decision functions describing only relevant facts and not a full computational theory of the domain (e.g., a full theory of the function of the kidneys from the nephron up, and the interaction of the kidneys with the rest of the body are not needed to conclude that removing the stone or breaking it up with ultrasounds will be sufficient for curing the patient with a kidney stone).

  7. (g)

    Perform operations on compact representations. This strategy involves replacing intractable large data structures with declarative and highly compact representations. For example, “every person who suffers from a mental health disorder” encompasses an estimated 109 people globally but does not enumerate or even identify them. This approach also involves operating on classes of the problem space simultaneously rather than each member in the class. This is particularly evident in several forms of ML modeling where astronomical numbers of model structures are scored at once and represented compactly.

Best Practice 2.4

When faced with intractable problems, consider using strategies for mitigating the computational intractability by trading off with less important characteristics of the desired solution.

Heuristic and Ad Hoc Approaches and the Prescientific-to Scientific Evolutionary AI/ML Continuum

The term Heuristic” AI/ML methods or systems refers to several types of systems or strategies: First, rules of thumb that may give a good solution sometimes but do not guarantee this. Second, functions used inside AI search methods to accelerate finding problem solutions. Third, ad hoc systems, i.e., that are not designed based on a formal frameworks for AI/ML, and do not guarantee strong or safe performance in a generalizable sense. Fourth, methods and systems applied outside their scope of guaranteed safe, effective, or efficient use (i.e., hoping that an acceptable solution may be produced).

To clarify these concepts consider the following examples (note: all of the mentioned methods and systems will be thoroughly discussed in this and subsequent chapters):

  • “We need at least 10 samples per variable when fitting an ordinary least squares regression model” is an example of the first type of heuristic. Another example of the first heuristic type is “choosing 100 genes at random from a cancer microarray dataset will yield predictor models with very high accuracy for diagnosis, often near the performance of special gene selection algorithms” (for surprised readers, not familiar with such data, this is because there is large information overlap and redundancy among genes with respect to clinical cancer phenotypes).

  • The “Manhattan distance” as an estimate for the spatial distance between the current location and the goal location in a robot navigation problem is a heuristic than when used inside the A* search algorithm (see later in present chapter) allows the algorithm to find a path with minimum cost to the goal. This is an example of the second type of heuristic.

  • The well-known INTERNIST-I system for medical diagnosis in internal medicine was an example of the third type of heuristic AI. It lacked a formal AI foundation both in knowledge representation and inference. It was shown to be highly accurate in certain tests of medical diagnosis problems, however [4].

  • Examples of the fourth type are: (1) using Naïve Bayes (a formal ML method that assumes very special distributions in order to be correct) in distributions where the assumptions are known to not hold, hoping that the error will be small. (2) Using Propensity Score (PS) techniques for estimating causal effects, without testing the distributional assumption that makes PS correct (i.e., “strong ignorability”, which is not testable within the PS framework). (3) Using Shapley values, a Nobel-Prize winning economics tool devised for value distribution in cooperative coalitions to explain “black box” ML models (a completely different task, for which the method was not designed or proven to be correct; as we will see later in the present volume, it can fail in a wide variety of models). (4) Using IBM Watson, a system designed and tested in an information retrieval task (Jeopardy game) for health discovery and care (for which it had no known properties of correctness or safety). (5) Using Large Language Models (LLMs), e.g., ChatGPT and similar systems (designed for NLP and conversational bot applications) for general-purpose AI tasks (not supported by the known properties of LLMs).

For the purposes of this book, the third and fourth type are most interesting and we will focus on them in the remainder of this section.

In earlier times in the history of health AI/ML as well as broad AI/ML, proponents of ad hoc (heuristic type 3) systems argued that as long as heuristic systems worked well empirically they should be perfectly acceptable especially if more formally-constructed systems did not match the empirical performance of heuristic systems or if constructing formal systems or establishing their properties was exceedingly hard. Proponents of formal systems counter-argued that this ad-hoc approach to AI was detrimental since one should never feel safe when applying such systems, especially in high-stakes domains. At the same time many proponents of the formal approach engaged in practices of the fourth type of heuristic (not testing assumptions, or using a system designed for task A, in unrelated task B).

From a more modern scientific perspective (with substantial benefit of hindsight) of performant and safe systems operating in high-risk domains such as health, the above historical debate is more settled today than in the earlier days of exploring AI. Heuristic systems and practices represent pre-scientific approaches in the sense that a true scientific understanding of their behavior does not exist (yet) and that with sufficient study in the future, a comprehensive understanding of a heuristic system of today can be obtained. In other words, the heuristic system of today will be the well-characterized, scientific, non-heuristic system of tomorrow.

In this book we adopt the sharp distinction:

AI/ML systems with well understood theoretical properties and empirical performance vs. systems that lack these properties (aka Heuristic systems).

A further distinction can be made regarding whether a system is based on formal foundations or being ad hoc. The importance of formal systems is that they make the transition to well-understood systems faster and easier. In the absence of formal foundations, it is hard to derive properties and expected behavior. If formal foundations exist, often many of the properties are immediately inherited from the general properties of the underlying formal framework. In any case, deriving formal properties of methods with strong formal foundations is vastly easier than of ad hoc methods.

In addition, there is a strong practical interplay between theoretical properties and empirical performance. If theory predicts a certain behavior and empirical tests do not confirm it, this means that errors likely occurred in how the models/systems have been implemented and debugging is warranted. Alternatively, it may suggest that we operate in a domain with characteristics that are different from the theoretical assumptions of our model (and we need to change modeling tools or strategy). If model A empirically outperforms model B on a task for which A is not built but B is theoretically ideal, this suggests that there are implementation errors or evaluation data/methodology errors in model B, and so on. In other words, a strong theoretical understanding bolsters, and is enhanced by, the empirical application and validation. What is not working well is lacking one or both of these important components (theoretical base + empirical base). More on the interplay of theorical properties and empirical performance can be found in chapter “Principles of Rigorous Development and Appraisal of ML and AI Methods and Systems”.

It is also significant to realize that there is an evolutionary path from pre-scientific informally-conceived systems, to partially-understood (theoretically or empirically) systems, and finally to fully-mature and well-understood AI/ML.

In earlier related work Aliferis and Miller [5] discussed the “grey zone” between formal systems with known properties but with untestable or unknown preconditions for correctness in some domain, and ad hoc systems with unknown properties across the board. Their observation was that both classes required a degree of faith (with no guarantees) for future success. This early work can be elaborated taking into account the following parameters: formal or ad hoc foundation; known theoretical properties or not; whether the known properties are testable and have been tested vs not; known empirical performance or not; and whether empirical performance is satisfactory and what alternatives may exist.

The following table (Table 3) distills the above multi-dimensional space to its essential cases and describes this landscape and developmental journey from pre-scientific systems (lacking properties, rows 1, 3) to intermediate level systems (with partial properties, rows 2, 4, 5, 6), to mature reliable science-backed systems (with known properties, rows 7, 8). The table also points to pitfalls and BPs of building and using systems of the listed characteristics.

It is worth emphasizing that systems with known properties are not automatically optimal or even suitable for solving a problem. Knowledge of properties of various methods and approaches can be used to find the best solution for a task, however.

Chapters “Foundations of Causal ML” and “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” further elaborate on how these concepts can be implemented in practice during the practical development of performant and safe AI/ML.

Table 3 Classification of AI/ML systems based on their formal foundation and properties. The development spectrum from pre-scientific to mature science-backed systems

Pitfall 2.4

Believing that heuristic systems can give “something for nothing” and that have capabilities that surpass those of well-characterized systems. In reality heuristic systems are pre-scientific or in early development stages.

Best Practice 2.5

As much as possible use models and systems with established properties (theoretical + empirical). Work within the maturation process starting from systems with unknown behaviors and no guarantees, to systems with guaranteed properties.

Foundations of AI: Logics and Other Symbolic AI and Non symbolic Extensions

Symbolic vs. Non-Symbolic AI

Logic is a staple of science and the cornerstone of all types of so-called symbolic AI.

  • By symbolic AI we refer to AI formalisms that focus on representing the world with symbolic objects and logical relations, and making inferences using logical deductions.

  • Symbolic systems typically contain deep, structured representation of the problem solving domain.

Examples include production systems, rule-based systems, semantic networks, deductive reasoners, causal modeling with detailed causal relations, and other types of systems discussed later in this chapter.

  • By contrast, non symbolic AI encompasses various formal systems that focus on uncertain and stochastic relationships using various forms of inference that either rely on probability theory or can be understood in terms of probability.

  • Non-symbolic systems are typically shallow representations of input-output relationships without a detailed model of the structure of the problem solving domain.

  • Terminology Caution: Deep Learning neural networks are designated as such because they have many hidden node layers (as opposed to shallow ANNs that have few). However they are both shallow AI systems because they lack a rich representation of the domain and its entities (i.e., they are ontologically shallow).

Examples of non-symbolic AI (in the ontological shallowness sense) include connectionist AI that approaches AI from an artificial neural network point of view, probabilistic AI that uses a probability theory perspective, shallow causal models, genetic algorithms that adopt an evolutionary search perspective, reinforcement learning, predominantly within the data-driven ML which is the currently dominant paradigm of AI, and most recently Large Language Models (LLMs).

There exist also formalisms that transcend and attempt to unify symbolic and non-symbolic AI such as causal probabilistic graphs, probabilistic logics or ANN-implemented rule systems.

See chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML in Healthcare and the Health Sciences. Enduring problems, and the role of Best Practices” for discussion of this important class of AI/ML.

Propositional Logic (PL)

Propositional logic [1, 6] is the simplest form of logic allowing the construction of sentences like:

$$ {\displaystyle \begin{array}{l}\left(\begin{array}{l}\left(\left(\mathrm{Symptom}\_\mathrm{positive}\_\mathrm{A}=\mathrm{True}\right)\wedge \left(\mathrm{Test}\_\mathrm{positive}\_\mathrm{B}=\mathrm{False}\right)\right)\\ {}\vee \neg \left(\mathrm{Test}\_\mathrm{positive}\_\mathrm{C}=\mathrm{True}\right)\end{array}\right)\\ {}\to \mathrm{Diagnosis}\_\mathrm{Disease}1=\mathrm{True}\end{array}} $$

Or in words, if the patient has symptom A and test B is negative, or if she does not have a positive test C, then she has Disease1.

As can be seen, PL uses propositions (statements) that can take values in {True, False}, and logical connectives (and, or, implication, negation, equivalence, parentheses). By combining the above based on the straightforward syntax of PL, we create complex sentences that may be valid or not. Other than the (tautological) meaning of the truth values {True, False}, the precise meaning (semantics) of the propositions is embedded in them (i.e., it is not explicit in the PL language).

A PL Knowledge Base (KB) contains a set of sentences that are stated as axioms (true propositions or valid complex sentences) by the user and then other sentences can be constructed and proven to be valid or not using the truth table of a PL sentence. The correspondence between the validity of propositions and sentences in the KB and the real world is provided by the notion of a model for that KB which is some part of the world (e.g., a biomedical problem domain) where the KB truth assignments hold. Syntactic operations (e.g., by a computer) on the KB prove the validity of sentences in all models of that KB. Inferring that some manipulation of computer symbols has automatically a valid interpretation in the real world (originating from the validity of axioms) is commonly referred by the expression “the computer will take care of the syntax and the semantics will take care of themselves”.

Truth tables can be used to show that a sentence is valid or not by examining if the sentence is true for all truth assignments of the propositions involved (hence valid), otherwise it is not valid. Sentences are decomposed to smaller parts in a truth table so that truth values for the sentence can be determined. Commonly-used inference steps are encapsulated in inference rules such as Modus Ponens, And-elimination, And-introduction, Resolution, etc. These are used to avoid constructing very large/complex truth tables. The computational complexity of proving that a sentence in PL is valid is worst-case intractable but quite manageable in small domains [1].

First Order Logic (FOL)

FOL is a vastly more powerful form of logic than PL and can represent:

  • Objects (e.g., patients, genes, proteins, hospital units)

  • Properties of objects (e.g., alive, up-regulated, secondary structure, full)

  • Relations among objects and terms(e.g., patient 1 has a more severe disease than patient 2, gene 1 determines phenotype 1, protein 1 catalyzes chemical reaction 2, hospital unit 1 is less utilized than unit 2)

  • Functions (e.g., BMI of a patient P, length of a gene G, molecular weight of a protein P, bed capacity of a hospital unit U)

The syntax of FOL uses:

  • Objects,

  • Constants,

  • Variables,

  • Predicates used to describe relations among constants or variables,

  • Functions over constants, variables or other functions

  • Connectives: equivalent, implies, and, or

  • Parentheses

  • Negation

  • Quantifiers: there exists, for all (defined over objects)

  • Terms formed from constants, variables and functions over those

  • Atomic sentences defined over predicates applied on terms

  • Complex sentences defined using atomic sentences, connectives, quantifiers, negation and parentheses.

For a technical syntax specification see [1].

Higher order logics allow expression of quantifiers applied over functions and relations (not just objects). These logics are more powerful, but inference is much harder, thus logic-based AI typically deals with FOL or simpler derivative formalisms.

The application of FOL (or derivatives) to build a Knowledge Base (KB) useful for problem solving in some domain is an instance of Knowledge engineering. It involves: (a) ontology engineering, that is identifying and describing in FOL (or other appropriate language chosen during ontology engineering) the ontology (objects, types of relationships) in that problem domain; and (b) knowledge acquisition that is identifying and describing in FOL the relevant axioms (facts) describing key aspects of the domain, and from which inferences can be drawn to solve problems.

In the health sciences and healthcare a number of ontologies have been created and are widely used. A most significant component of those are the common data models used to describe entities and variables. These are of essential value for both symbolic and data driven ML methods and for harmonizing data and knowledge across health care providers, studies, and scientific projects. See chapter “Data Preparation, Transforms, Quality, and Management” for a discussion of the most commonly used common data models and standards.

Knowledge engineering could be substituted for ordinary programming however the fundamental advantage of Knowledge engineering is that it is a declarative approach to programming with significant advantages (whenever applicable) such as: ability to represent compactly facts and complex inferences that may be very cumbersome to conventional procedural or functional programming methods. Moreover, once the AI knowledge engineer has constructed the knowledge base, then a myriad of inferences can be made using the pre-existing inferential mechanisms of FOL. In other words, no new problem solving apparatus is needed, because it is provided by FOL. Declarative programming needs only a precise statement of the problem.

Logical Inference

FOL has a number of sound inference procedures that differ in their completeness. Such procedures are Generalized Modus Ponens (that can be used in Forward-Chaining, Backward-Chaining directions), and Resolution Refutation.

Forward chaining is an algorithm that starts from the facts and generates consequences, whereas Backward chaining starts from what we wish to prove and works backward to establish the necessary precedents. The “chaining” refers to the fact that as new sentences are proven correct, they can be used to activate new rules until no more inferences can be made.

  • As a very simple example consider a KB with:

Axioms:

A, B

Rules of the type x → y:

A → C, and C ∧ B → D

  • From this KB,

  • The Forward Chaining algorithm will perform the following sequence of operations:

    1. 1.

      From A and A → C it will infer C and add it to the KB

    2. 2.

      From B, C and C ∧ B → it will infer D and add to the KB

    3. 3.

      Will terminate because no new inferences can be made

  • The Backward Chaining algorithm from the same original KB, and user request to prove D, will:

    1. 1.

      First see that D is not an axiom

    2. 2.

      Identify that C ∧ B → D can be used to try and prove D

    3. 3.

      Will seek to prove C and B separately

    4. 4.

      B is an axiom so it is true

    5. 5.

      C is not an axiom but rule A → C can be used to prove it

    6. 6.

      Will seek to prove A

    7. 7.

      A is an axiom thus true

    8. 8.

      Thus (by backtracking to (5) C is true

    9. 9.

      Thus (by backtracking to (2) D is true

    10. 10.

      Terminate reporting success in proving D

Forward and backward chaining strategies are widely used in biomedical symbolic AI expert systems. They are not FOL-complete however! Recall that Goedel proved that in sufficiently complex reasoning systems (such as FOL) there are true statements that cannot be proven from the axioms of the system. He also proved that if there are provable sentences, then there exists an algorithm to generate the proof. Robinson [7] discovered such an algorithm (Resolution Refutation) which operates by introducing the negation of a sentence we wish to prove in the knowledge base and deriving a contradiction. Resolution Refutation is complete with respect to what can be proven in FOL. For technical details of the algorithm refer to [1, 7].

  • The Resolution Refutation algorithm in the KB of the previous example, will:

    1. 1.

      Add ¬ D to the KB

    2. 2.

      From A and A → C it will infer C and add it to the KB

    3. 3.

      From C and C ∧ B → D it will infer D and add to the KB

    4. 4.

      From D and ¬ D it will derive a contradiction and will terminate declaring success in proving D

The above examples are hugely simplified by not addressing predicates, variables, functions, quantification, conversion to different canonical forms and their matching, which are all needed for the general case algorithms operation. For technical details see [1, 6, 7]. Nilsson [7] in particular gives a definitive technical treatment of rule-base systems and their properties.

Logic-Derivative Formalisms and Extensions of FOL

FOL is almost never used in its pure form in biomedical AI applications. Instead, it serves as a foundation for other more specialized and invariably simplified formalisms. Occasionally researchers have extended ordinary FOL to accommodate reasoning with probabilities, or time. The following Table 4 lists important FOL derivatives and extensions.

Table 4 Types of logic-based systems (FOL derivatives)

Although FOL is not used without major modifications and simplifications in the above, it remains the most important theoretical framework for understanding the structure, capabilities and limitations of such methods and systems. Chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring problems, and the role of Best Practicess” discuss present-day concerns in the AI/ML science and technology community that a drastic departure from symbolic AI (e.g., in favor of purely statistical and ontologically shallow input-output representations), does not bode well for the ability of the field to successfully address the full range and complexity of health science and health care problems.

Non-Symbolic AI for Reasoning with Uncertainty

Numerous non-symbolic methods exist and the most important ones in current practice are covered in detail in chapters “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” and “Foundations of Causal ML”. Here we will address two methods of great importance in the modern practice of healthcare and health science research: Decision Analysis and Bayesian networks.

Methods that have predominantly historical significance will not be addressed, in order to preserve reader and book bandwidth and focus more on techniques that are part of modern practices.

Decision Analysis (DA) and Maximum Expected Utility (MEU)-Based Reasoning

Decision Analysis using Maximum Expected Utility stems for the fundamental work of Von Neumann and Morgenstern dating back to 1947. This theory provides a model of prescriptive decision making designed to limit the risk of a decision agent facing uncertainty. Whereas the theory may not be universally applicable in all situations involving biomedical decisions with uncertainty, they still describe a powerful model with wide applicability.

The principles of MEU and DA can be readily grasped with a simple example. Consider the hypothetical case of a patient facing the dilemma of whether to undergo a surgery for a condition she has, or to opt for the conservative treatment. Assume that either decision cannot be followed by the other (e.g., a failed surgery precludes improvement by the conservative treatment, whereas the conservative treatment exceeds the time window when the surgery is beneficial).

Furthermore let the probability of success of surgery in such patients be p(surgical success) = 0.9 and probability of success of non-surgical treatment in such patients be p(nonsurgical success) = 0.6. Finally let the quality of life (measured in a utility scale ranging in [0,1]) after successful surgery be 0.8, after failed surgery be 0.2, after successful conservative treatment be 1, and under failed conservative treatment be 0.5.

Utility assessment protocols designed to identify a patient’s preferences and map them to a utility scale exist. Expected utility defines four axioms describing a rational decision maker: completeness; transitivity; independence of irrelevant alternatives; and continuity [57]. The principle of MEU decision making based on these axioms, designates the optimal decision as the one that maximizes the expected utility over all possible decisions:

$$ Optimal\ decision={\mathrm{argmax}}_{\mathrm{i}}\kern0.33em \mathbf{E}\left(U\left( decisio{n}_i\right)\right) $$

where

$$ \mathbf{E}\left(U\left( decisio{n}_i\right)\right)=\sum \limits_j\mathbf{E}\left(U\left( outcom{e}_{ij}\right)\right) $$

and:

  • U (decisioni) is the expectation of the utility of the ith decision,

  • U (outcomej,i) is the patient-specified utility of the jth outcome based on decision i and

  • E (U(outcomej,i) is the expected patient-specified utility of the jth outcome based on decision i

In our hypothetical example, we can easily see that

$$ \mathbf{E}\left(U\left( decisio{n}_{surgery}\right)\right)=0.9\ast 0.8+0.1\ast 0.2=0.74 $$

whereas

$$ \mathbf{E}\left(U\left( decisio{n}_{conservative}\right)\right)=0.6\ast 1+0.4\ast 0.5=0.80 $$

The decision that maximizes expected utility is thus the non-surgical treatment.

In graphical form the above scenario is captured by the following Decision Analysis Fig. 1.

Fig.1
A decision tree chart. The optimal decision towards surgery has probabilities of 0.9 and 0.1 for positive and negative outcomes with utilities of values 0.8 and 0.2 respectively. Towards conservative has probabilities 0.6 and 0.4 in utilities 1 and 0.5 for positive and negative outcomes respectively.

A decision analysis tree augmented with probabilities and utilities corresponding to the hypothetical example in the text

We note that MEU DA, whenever applicable, is a powerful and principled way to make decisions that maximize benefit. Pitfalls in MEU-DA based reasoning are:

Pitfalls 2.5

Decision Analysis (DA) and Maximum Expected Utility (MEU)-based reasoning

  1. 1.

    Errors in the estimation  of probabilities for various events.

  2. 2.

    Errors in eliciting utility estimates in a way that captures patients’ true preferences (including using the care providers’ utilities rather than the patients’).

  3. 3.

    The structure or complexity of the problem setting defies analyst’s ability to completely/accurately describe it.

  4. 4.

    Developing a DA for one population and applying in another with different structure of the problem, different probabilities for action-dependent and action-independent events, or with different preferences.

The corresponding best practices are addressing these sources of errors that can lead a decision analysis astray.

Best Practice 2.6

Decision Analysis (DA) and Maximum Expected Utility (MEU)-based reasoning

  1. 1.

    Ensure that the structure of the problem setting is sufficiently/accurately described by the DA tree. Omit known or obvious irrelevant factors.

  2. 2.

    Elicit utility estimates in a way that captures patients’ true preferences using established utility-elicitation methods.

  3. 3.

    Accurately estimate probabilities of action-dependent events and action-independent events.

  4. 4.

    In most conditions, and whenever applicable, data-driven approaches should be preferred to subjective probability estimates. Use probability-consistent statistical or ML algorithms to estimate the probabilities.

  5. 5.

    Ensure that the decision analysis is applied to the correct population.

  6. 6.

    Conduct sensitivity analyses that reveal how much the estimated optimal decision is influenced by uncertainty in the specification of the model.

  7. 7.

    Whenever possible, produce credible intervals/posterior probability distributions for the utility expectations of decisions.

These cover only the most salient aspects of the art and science of MEU driven decision analysis. A more detailed introduction is given in [20] and a comprehensive applied treatment in [21].

Reasoning with Uncertainty: Probabilistic Reasoning with Bayesian Networks

Bayesian Networks (BNs) are an AI/ML family of models that can describe the probabilistic (or deterministic, or hybrid) relationships among variables. They have extensive usability, range of application, attractive properties and thus high practical value. They can support several use cases and types of problem solving (which can be combined):

  • Use 1: Overcome the limitations of intractable (brute force Bayes), or unduly restrictive (Naive Bayes), classifiers.

  • Use 2: They are very economical to describe (i.e., they have space complexity that closely mirrors the distribution complexity).

  • Use 3: They can be created both from expert knowledge and from data (or with hybrid sources).

  • Use 4: They can be used for MEU DA (providing probability estimates for DAs or in their “influence diagram” form).

  • Use 5: They can perform flexible classification and other sophisticated probabilistic inferences.

  • Use 6: They can be thought of as probability-enhanced logical rules and combine forward and backward, as well as forward-backward inferences in a way that is probabilistically accurate.

  • Use 7: They can be used (with very mild additional restrictions) to reason causally including: (1) Distinguishing between observing passively a variable’s value vs. applying interventions that cause the variable to take that value, and reason accordingly. (2) Reasoning from causes to outcomes, from outcomes to causes, and simultaneously in both directions over multiple and overlapping causal pathways.

  • Use 8: Their causal variants can be used to discover causality, not just perform inferences with existing causal models.

  • Use 9: They have close relationship to the Markov Boundary theory of optimal feature selection.

Because of these properties we touch upon various forms and derivatives of BNs in several chapters and contexts in this volume: AI reasoning under uncertainty (chapter “Foundations and Properties of AI/ML Systems”), Bayesian classifiers (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”), Markov Boundary based feature selection (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”) and causal discovery and modeling (chapter “Foundations of Causal ML”).

We caution that not every graph or every probabilistic graph is a BN and the BN properties derive from very specific requirements. Because there is confusion in parts of the literature (where some authors derive models that are not BNs but present them as such), we will provide here, for clarity, an unambiguous technical description of this family of AI/ML models.

BN Definitions

  • Definition. Bayesian Network. A BN comprises (1) a directed acyclic graph (DAG); (2) a joint probability distribution (JPD) over variable set V such that each variable corresponds to a node in the DAG; and (3) a restriction of how the JPD relates to the DAG, known as the Markov Condition (MC).

  • Definition. Directed Graph. directed graph is a tuple <V,E>, where V is a set of nodes representing variables 1-to-1, and E is a set of directed edges, or arrows, each one of which connects an ordered pair of members of V.

  • Definition: Two nodes are adjacent if they are connected by an edge. Two edges are adjacent if they share a node.

  • Definition: A path is any set of adjacent edges.

  • Definition: A directed path is a path where all edges are pointing in the same direction.

  • Definition: A directed acyclic graph (DAG) is a directed graph that has no cycles in it, that is, there is no directed path that contains the same node more than once.

  • Definition: The joint probability distribution (JPD) over V is any proper probability distribution (i.e., every possible joint instantiation of variables has a probability associated with it and the sum of those is 1).

  • Definition: Parents, children, ancestors, and descendants: In a directed graph, if variables X,Y share an edge X→ Y then X is called the parent of Y, and Y is called the child of X. If there is a directed path from X to Y then X is an ancestor of Y and Y is a descendant of X.

  • Definition: Spouses: In a directed graph, the spouses of a variable Vj is the set comprising all variables that are parents to at least one of the children of Vj.

  • Definition: The Markov Condition (MC) states that every variable V is independent of all variables that are non-descendants of V given its direct causes.

  • Definition: If all dependencies and independencies in the data are the ones following from the MC, then the encoded JPD is a Faithful Distribution to the BN and its graph.

  • Definition: Degree of a node is the number of edges connected to it. In a directed graph, this can be further divided into in-degree and out-degree, corresponding to the number of parents (edges oriented towards the node) and children (edges oriented away from the node) that the node has.

  • Definition: A collider is a variable receiving incoming edges from two variables. For example in: X → Y ← Z, Y is the collider. A collider is either “shielded” or “unshielded iff the corresponding parents of the collider are connected by an edge or not, respectively. Unshielded colliders give form to the so-called “v-structures”.

  • Definition: A trek is a path that contains no colliders.

  • Definition: The graphical Markov Boundary of a variable Vj is the union of its parents Pa(Vj), its children Ch(Vj) and the parents of its children Sp(Vj).

  • Definition: The probabilistic Markov Boundary of a variable Vj is the set of variable S that renders Vj conditionally independent of all other variables, when we condition on S, and is minimal.

Key Properties of Bayesian Networks

Unique and Full Joint Distribution Specification. If the Markov Condition (MC) holds, then the conditional probabilities of every variable given its parents specifies a well-defined and unique joint distribution over variables set V.

Any Joint Probability Distribution Can be Represented by a BN. If we wish to model JPD J1 by a BN, we can order the variables arbitrarily, connect with edges every variable Vj with all variables preceding it in the ordering, and define the conditional probability of Vj given the parents in the graph equal to the one calculated from J1. Then the implied JPD J2 of the BN will be J2 = J1. Note: the outline constructive proof presented here is a large sample result. Much more sample-efficient procedures exist for small sample situations.

The Joint Distribution of a BN Can Be Factorized Based on Parents. The joint distribution is factorized as a product of the conditional probability distribution of every variable given is direct causes set

$$ probability\;\left({V}_1,{V}_2,\dots .,{V}_k\right)=\prod \limits_j probability\;\left({V}_j\;| Pa\;\left({V}_j\right)\right) $$
(4)

Where j indexes the variables in V, and Pa (Vj) is the set of parents of variable Vj.

Because of this factorization, we only need to specify up to |V| conditional probabilities in order to fully specific the BN (where |V| is the number of variables. When all variables have a small number of parents, the total number of probabilities is linear to |V|. By comparison in a Brute Force Bayes classifier we always need specify 2|V| probabilities.

Similarly whenever we need to compute the joint probability of Eq. 4, for a particular instantiation of the variables involved, this is a linear time operation in the number of variables.

Definition: D-separation

  1. 1.

    Two variables X, Y connected by a path are d-separated (aka the path is “blocked”) given a set of variables S, if and only if on this path, there is

    1. (a)

      A non-collider variable contained in S, or

    2. (b)

      A collider such that neither it nor any of its descendants are contained in S.

  2. 2.

    Two variables, X and Y, connected by several paths are d-separated given a set of variables S, if and only if in all paths connecting X to Y, they are d-separated by S.

  3. 3.

    Two disjoint variable sets X and Y are d-separated by variable set S iff every pair <Xi,Yj > is d-separated by S, where Xi and Yj are members of X, Y respectively.

Inspection of the Graph of a BN Informs us About all Conditional Independencies in the Data. By inspection (by eye or algorithmically) of the causal graph (and application of d-separation) we can infer all valid conditional independencies in the data, without analyzing the data as follows:

  • If variable sets X and Y are d-separated by variable set S then they will be conditionally independent given S in the JPD encoded by the BN.

Inspection of the Graph of a BN Encoding a Faithful Distribution, Informs us about all Conditional Independencies and Dependencies in the Data. A BN encoding a faithful distribution entails that all dependencies and independencies in the JPD can be inferred by the DAG by application of the d-separation criterion as follows:

  • If variable sets X and Y are d-separated by variable set S in the BN graph, then they will be conditionally independent given S in the JPD encoded by the BN. Otherwise they will be dependent.

  • Equivalently:

  • Variable sets X and Y are conditionally independent given S in the JPD encoded by the BN, iff they are d-separated by variable set S in the BN graph.

Therefore in faithful distributions, the BN graph becomes a map (so-called i-map) of dependencies and independencies in the data JPD encoded by the BN. Conversely, by inferring dependencies and indecencies in the data we can construct the BN’s DAG and parameterize the conditional probabilities of every variable given its parents, effectively recovering the unoriented causal process that generates the data. This is a fundamental principle of operation of causal ML methods (discussed in more detail in chapter “Foundations of Causal ML”).

A Variable in a BN is Independent of all Other Variables Given its Graphical Markov Boundary and Equivalently a variable in a JPD is independent of all other variables given its Probabilistic Markov Boundary in Faithful Distributions.

Relationship of Markov Boundary and Causality. This can be used to obtain optimal feature sets for predictive modeling when the BN is known or is inferred from data. Because the graphical and probabilistic Markov Boundary are identical in faithful distributions, in causal BNs, there is a close connection of the local causal network around a variable and its probabilistic Markov Boundary (see chapter “Foundations of Causal ML”).

BNs Allow Flexible Inference

We will illustrate flexible inference with an example depicted in Fig. 2.

Fig. 2
6 directed network graphs. Networks 1, 2, 3, 5, and 6 consists of nodes from A to M. Network 4 consists of nodes C, B, H, pointing to F. Some of the nodes are shaded.

Flexible predictive modeling and forward/backward reasoning in BNs

In Fig. 2 part (1) we see a BN model for some problem solving domain. In part (2) we query the BN with the question: what is the probability of F (grey node) given that we have observed the values of variables {C, B, H} (green nodes)? The inference algorithms propagate and synthesize information upward (e.g., from C and B to A) and downward from A and B to F. Notice that given B, H is irrelevant to F.

If we wish to set up a conventional classifier (of any type, Logistic Regression, Boosting, Deep Learning, SVM, Random Forest etc., it has to obey a fixed input-response structure depicted in (4)); in other words in order to answer this question we need to approximate a function of the probability (F | C, B, H) and train it from scratch for that query. The BN (3) requires no modification however.

If we wish to answer next what is the joint probability of {F, B, H} given that we observe {G, A, K, M}, again the same BN (5) can give us the answer. Other predictive modeling methods however (6) will need to be trained from scratch to estimate: probability (F, B, H | G, A, K, M).

Because the number of such queries grows exponentially to the number of variables, it is essentially impossible to answer all the answerable queries by a BN by training specialized classifiers. Moreover any subdivision of the variables sets as observed, unknown, or query variables is allowed and needs not be known a priori.

We now examine (Fig. 3) how a causal BN (see chapter “Foundations of Causal ML” for formal definitions) can answer causal questions. Consider the query: what is the probability of F (grey node) given that we have observed the values of variables {B, H} (green nodes) and we have manipulated C to take a specific value (via intervention denoted by do(C))? The causal BN (left) knows that when we manipulate a variable, nothing else can affect it. Thus the Arc: A→C is effectively eliminated by the manipulation in the context of the query. Consequently, information does not travel from C to F via A as in the case of observing C. The predictive modeling models lacking causal semantics (e.g., Logistic Regression, Boosting, Deep Learning, SVM, Random Forest etc.) will propagate information from C to F thus arriving at a wrong conclusion. Incidentally this problem cannot be fixed in the conventional predictive modelers by eliminating C from the model, since valid causal/information paths may exist from C to F than need be considered even if we manipulate C (and indeed the causal BN will do so).

Fig. 3
Two directed network graphs. 1. Has nodes from A to M. 2. Nodes C, B, H, point to F. Node C is marked as, do of C in both.

Causally Consistent Inference with BNs AND Forward/Backward reasoning

Computational complexity for both learning BNs and for conducting inference with them is intractable in the wort case. However highly successful mitigation strategies have led to super-efficient average case or restricted-purpose algorithms (see chapters “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”, and “Lessons Learned from Historical Failures, Limitations and Successes of ML in Healthcare and the Health Sciences. Enduring Problems and the Role of Best Practices” for details). Key references for properties of BNs discussed here are [22,23,24,25].

AI Search

Search as a General Problem Solving Strategy. Conventional Search Versus AI Search

Search is a general problem solving methodology in which many (if not most) problems can be reduced to. Somewhat similar to physical search of an object or a location inside a physical space, AI search constructs a state space with each state representing a possible solution or partial solution to a problem. The search algorithms then traverse this state space trying to discover or construct a solution. For example, in a ML context, the state space could comprise models fit from data, such that each model has a different structure and corresponding estimated generalization predictivity. AI search in the ML context seeks then to find ML models that achieve the highest or sufficient high predictivity. In an autonomous navigation context, AI search would seek to find a navigation path that achieves smallest traversed distance, smallest cost of trip, or other objectives. In a scheduling context AI search may seek to schedule patients and operating room personnel to operating rooms so that cancelations are minimized. The diversity of problems that can be solved with search is infinite and covers the full space of computable functions.

General AI Search Framework: “Best” First Search (BeFS) Family and Variants

Whereas search is also accomplished outside AI, most notably with linear and non linear optimization methods, ad hoc search algorithms, and Operations Research algorithms, AI search has distinctive qualities:

  • AI search can use state spaces that are infinite in size.

  • AI search can attempt to solve problems in the hardest of complexity classes.

  • AI search admits any computable goal, not just a small space of computable functions.

  • AI search can operate with symbolic and non-symbolic representations.

Table 5 outlines a very general framework for AI search.

Table 5 High-level operation of general AI search:

This prima facie very simple procedure has immense power and flexibility and can be instantiated in a variety of ways leading to different behaviors and properties. For example, the following variants can be had as follows (Table 6):

Table 6 Notable instantiations of general BeFS AI search:

The fundamental general search algorithm and its instantiations can be readily extended to cope with infinite size search spaces and computing environments with limited space using depth-limited, iterative deepening and simplified memory-bounded A* versions. For details see [1].

Other Notable AI Search Methods

In addition to the above “classical” AI search algorithms notable search methods include Simulated Annealing, Genetic Algorithms, Ant Colony Optimization and search procedures applicable to rule based systems and resolution refutation.

Simulated annealing [27] is inspired by metallurgy and the annealing process. It comprises a classic hill climbing method modified to incorporate a randomized jump to one of nearby states so that local optima have a larger chance (but no guarantee) of not trapping the algorithm in suboptimal solutions.

Genetic algorithms [28, 29] are inspired by biological evolution which they mimic. Genetic algorithms represent solutions in digital chromosome representation on which they perform, just like evolution does on actual organisms, random mutations and crossover operations. This is their way to generate successor solution states. Multiple states are maintained at each stage of the algorithms with the best-fit ones having a smaller chance to be discarded. The algorithm has the attractive property that an exponentially increasing portion of better performing states are considered in each step. It can be applied in domains where the data scientists now nothing about the domain (i.e., they are “black box optimizers”). On the other hand, they have important limitations: they are not guaranteed to reach an optimal solution (i.e., can be trapped in local optima), it has been proven that they cannot learn certain classes of functions (e.g. epistatic functions); and they are not efficient when compared to non-randomized algorithms solving the same problems.

Ant Colony Optimization (ACO) and “Swarm intelligence” is inspired by the foraging behavior of some ant species. These ants deposit pheromone on the ground in order to mark some favorable path that should be followed by other members of the colony. ACO can be used for graph searching, scheduling problems, classification, image recognition and other problems. For several ACOs, convergence has been proven (albeit at an unknown number of iterations). Finally empirical results in >100 NP-hard problems has shown competitive and ocasionaly excellent performance compared to best known algorithms [30].

Specialized search procedures include AO* (suitable for searching AND/OR graphs used in decomposable rule based systems), MINIMAX and ALPHA-BETA search (suitable for game tree search), and Resolution Refutation search strategies (e.g., unit preference, set of support identification, input resolution, linear resolution, subsumption etc.) designed to make the resolution refutation algorithm reach a proof faster [1, 6, 7].

AI/ML Languages

Whereas statement of algorithms and theoretical analysis is typically conducted using pseudocode, practical development depends on choice of programming languages. A few languages are particularly suitable for AI/ML and their properties are summarized in the next Table 7:

Table 7 Notable AI/ML Languages

Foundations of Machine Learning Theory

AI and ML as Applied Epistemology

Epistemology is the branch of philosophy concerned with knowledge: its generation and sources, nature, its achievable scope, its justification, the concept of belief as it relates to knowledge, and related issues [31]. From the perspective of this volume it is worthwhile to notice first that AI formalizes knowledge so that it can be used in applied settings. AI can also inform what types of knowledge are computable and what inferential procedures can be applied and with what characteristics. ML in particular, by virtue of being able to generate knowledge from data, on one hand obtains its justification not just by empirical success but by epistemological principles of science. On the other hand, ML puts to test epistemological hypotheses and theories about how knowledge is, can, or should be generated. The following sections provide a concise outline of the key theories that provide the firm scientific ground on which ML is built and in particular for fortifying its performance and generalization properties. They also summarize a few related pitfalls and high level best practices that will be developed further in other chapters.

ML Framed as AI search and the Role of Inductive Bias

We showed earlier in this chapter how AI search can be used to solve hard problems. ML itself can be cast as a search endeavor [29, 32, 33]. In this framework, ML search comprises:

  1. (a)

    A model language L in which a family of ML models can be represented. For example, decision trees, logic rules, artificial neural networks, linear discriminant equations, causal graphs etc. Typically the model language will come with associated procedures that enable models to be built when data D are provided (i.e., model fitting procedure MF).

  2. (b)

    A data-generating or design procedure DD that creates data (typically by sampling from a population or other data-generating process) from which models are fit.

  3. (c)

    A hypothesis space S. The language L implicitly defines a space S comprising all models expressible in the language and that can be fit with MF applied on D, with each model Mi representing a location or state in S. For example, the space of all decision trees, neural networks, linear discriminant functions, boosted trees, etc. that can be built over variable set V using MF on D.

  4. (d)

    A search procedure MLS that navigates the space in order to find a model representing an acceptable solution to the ML problem as defined by a goal criterion. For example, a steepest ascend hill climbing search procedure over the space of decision trees over V fit by MF given data D.

  5. (e)

    A goal (or merit) function GM that examines a search state (i.e., a model Mi) and decides whether it is a solution, or how close to a solution it is (its merit function value). For example, whether Mi has acceptable predictivity, uncertainty, generalizability etc. or what is the merit value (e.g., difference of its properties to the goal ones).

  • The tuple: < L, MF, DD, S, MLS, GM > defines the architecture of a ML method.

  • Every ML method can be described and understood in these terms (although additional perspectives and analytical frameworks are also valuable, and in some cases necessary).

  • The tuple: < L, MF, S, GM > describes what is commonly referred to as a ML “algorithm”, whereas MLS describes the model selection procedure that ideally will incorporate an error estimator procedure for the final (best) model(s) found. GM and its estimators from data may or may not be identical to the error function and its estimators (see chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”).

  • The tuple: < L, MF, S, MLS, GM > describes the Inductive Bias of that ML method.

  • The inductive bias of a ML method is the preference (‘bias”) of that method for a class of models over other models that are not considered at all or are not prioritized by the method.

Notice that in practice there are two search procedures in operation:

MF is a search procedure in the space of model parameter values once a model family and its model family parameter values (aka “hyperparameters) have been visited by the second (top-level, or over-arching) search procedure MLS which searches over possible model families and their hyper parameters.

For example, the search procedure in decision tree induction algorithm is a greedy steepest ascent while a hyper parameter may be the minimum number of samples allowed for accepting a new node or leaf. In an SVM model the search procedure is quadratic programming and hyper parameters may be the cost C and the kernel functions and their parameters. The model selection procedure that decides over these two model families and the right hyperparameter values for them may be grid search using a cross validation error estimator, or other appropriate model selection process (see chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” for details).

The ML search framework above readily entails important properties of ML:

  1. 1.

    The choice of model language affects most major model properties like error, tractability, transparency/explainability, sample efficiency, causal compatibility, generalizability etc.

  2. 2.

    The data generating procedure implements the principles and practices of data design which is a whole topic by itself (see chapter “Data Design for Biomedical AI/ML”). Because the whole operation of ML as search is so dependent on the data D, the data design/data generating procedure strongly interacts with the other components and determines the success of the ML model search.

  3. 3.

    The search space typically has infinite size, or finite but astronomically large size.

  4. 4.

    The ML search procedures MF typically have to find sparse solutions in the infinite/practically infinite search space. Therefore they are often custom-tailored or optimized for the specific ML algorithm.

  5. 5.

    The MLS search procedures are designed to operate over several ML algorithm families and their hyper-parameters. They are typically much less intensive and typically informed by prior analyses in the problem domain of similar data, giving guidance about which hype-parameter ranges will likely contain the optimal values.

    Worth noting in particular with respect to the inductive bias:

  6. 6.

    The match of the inductive bias of a ML method to the problem one wishes to solve (hence the data generating function to be modeled and the data design procedure that samples from the data generating function) determines the degree of success of this ML method.

  7. 7.

    It also follows, that if a ML method does not have restrictions on inductive bias, it cannot learn anything useful at all, in the sense that it would accept any model equally as well as any other (i.e., accept good and bad models alike) and in the extreme it would amount to random guess among all conceivable models).

  8. 8.

    At the same time, a successful ML method must not have a too restrictive inductive bias because this may cause lack the ability to represent or find good models for the task at hand.

  9. 9.

    Taken together (6), (7), and (8) show that a successful ML method must find the right level of restriction or “openness” of the inductive bias.

We note that the inductive bias of ML is a useful bias and should not be used as a negative term (as for example, ethical, social, or statistical estimator biases which are invariably negative).

Pitfall 2.6

Using the wrong inductive bias for the ML solution to the problem at hand.

Pitfall 2.7

Ignoring the fit of the data generating procedure with the ML solution to the problem at hand.

Best Practice 2.7

Pursue ML solutions with the right inductive bias for the problem at hand.

Best Practice 2.8

Create a data generating or design procedure that matches well the requirements of the problem at hand and works with the inductive bias to achieve strong results.

ML as Geometrical Construction and Function Optimization

As will be elaborated in chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” ML methods are in some important cases cast as geometrical constructive solutions to discriminating between objects. Figure 4 below shows a highly simplified example of diagnosing cancer patients from healthy subjects on the basis of 2 gene expression values. The ML method used (SVMs in the example) casts this diagnostic problem as geometric construction of a line (in 2D space, and hyperplane in higher dimensions) so that the cancer patients are cleanly separated from health subjects (and subject to a maximum gap achieved between the two classes).

Fig. 4
A scatterplot of gene y versus gene x. The clusters represent normal and cancer patients above and below an increasing line respectively. The distance between the 2 parallel lines on either side is marked as gap, and 2 dots of cancer patients and 1 of normal fall on the lines.

Geometrical constructive formulation of ML. See chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” for mathematical formulation

Such geometrical formulations of ML can be analytically and algebraically described and then operationalized using linear algebra and optimization mathematical tools and codes. See chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” for several examples and details on the mathematical formulations and the ensuing properties.

Computational Learning Theory (COLT): PAC Learning, VC Dimension, Error Bounds

Computational Learning Theory (COLT) , formally studies under which conditions learning is feasible and provides several bounds for the generalization error depending on the classifier used, the definition of error to be minimized (e.g., number of misclassifications), and other assumptions. While theoretical results in classical statistics typically make distributional assumptions about the data (i.e., the probability distribution of the data belongs to a certain class of distributions), COLT results typically make assumptions only about the class of discriminative model considered. Notice though, that it may be the case that an optimal discriminative model never converges to the data generating function of the data.

COLT research has defined several mathematical models of learning. These are formalisms for studying the convergence of the errors of a learning method. The most widely-used formalisms are the VC (Vapnik-Chervonenkis) and the PAC (Probabilistically Approximately Correct) analyses. A VC or PAC analysis provides bounds on the error given a specific classifier, the size of the training set, the error on the training set, and a set of assumptions, e.g., in the case of PAC, that an optimal model is learnable by that classifier. Typical PAC bounds, for example, dictate that for a specific context (classifier, training error, etc.) the error will be larger than epsilon with probability less than delta, for some given epsilon or delta. Unlike bias variance decomposition, COLT bounds are independent of the learning task. From the large field of COLT we suggest [34,35,36] as accessible introductions.

The VC (Vapnik-Chervonenkis) dimension (not to be confused with the VC model of learning above) is (informally) defined as the maximum number of training examples that can be correctly classified by a learner for any possible assignment of class labels. The VC dimension of the classifier is a quantity that frequently appears in estimation bounds in a way that all else being constant, higher VC dimension leads to increased generalization error. Intuitively, a low complexity classifier has low VC dimension and vice-versa. An example of VC bound follows: if VC dimension h is smaller than l, then with probability of at least 1-n, the generalization error of a learner will be bounded by the sum of its empirical error (i.e., in the training data) and a confidence term defined as:

$$ \sqrt{\frac{h\left(\log \frac{2l}{h}+1\right)-\log \left(n/4\right)}{l}} $$

Where 0 < n ≤ 1. Notice how this error bound is independent of dimensionality of the problem [37]. The number of parameters of a classifier does not necessarily correspond to its VC dimension. In [38] (examples are given of a classifier with a single parameter that has infinite VC dimension and classifiers with an unbounded number of parameters but with VC dimension of 1.

Thus, a classifier with a large number of parameters (but a low VC dimension) can still have low error estimates and provide guarantees of non-over-fitting. In addition, some of these bounds are non-trivial (i.e., less than 1) even when the number of dimensions is much higher than the number of training cases.

Such results prove unequivocally that learning is possible (when using the right learning algortihms) in the situation common in modern health science and healthcare data where the number of observed variables is much higher than the number of available training sample. Many popular classical statistical predictive modeling methods in contrast break down in such situations.

The mentioned COLT results also justify the assertion that over-fitting is not equivalent to a high number of parameters. Unfortunately, many of the estimation bounds provided by COLT are not tight for the number of samples available in common practical data analysis. In addition, COLT results often drive the design of classifiers with interesting theoretical properties, robust to the curse of dimensionality, and empirically proven successful, such as Support Vector Machines (discussed in detail chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

ML Theory of Feature Selection (FS)

Traditional ML theoretical frameworks (e.g., PAC and VC frameworks of COLT) focus on generalization error as a function of the model family used for learning, sample size and complexity of models. The theory of feature selection is a newer branch of ML that addresses the aspect of selecting the right features for modeling. It aims to guide the design and proper application of principled feature selection (as opposed to heuristic FS with unknown and suboptimal properties).

Table 8 summarizes key areas and example of results in the theory of feature selection. In the remainder we will discuss two formal feature selection frameworks (i.e., Kohavi-John and Markov Boundary) and will describe certain classes of feature selection problems that are commonly addressed.

Table 8 Summary of major topics and examples in the theory of feature selection

The standard feature selection problem. Consider variable set V and a data distribution J over V, from which we sample data D. Let T be a variable which we wish to predict as accurately as possible by fitting models from D. The standard feature selection problem is typically defined as [40]:

  • Find the smallest set of variables S in V s.t. the predictive accuracy of the best classifier than can be fit for T from D, is maximized.

Kohavi-John framework for Standard predictive feature selection problem. Kohavi and John [39] decompose the standard feature selection problem as follows:

  • A feature X is strongly relevant if removal of X alone will result in performance deterioration of the Optimal Bayes Classifier using the feature. Formally:

    • X is strongly relevant iff: X T | {V – X,T}

  • A feature X is weakly relevant if it is not strongly relevant and there exists a subset of features, S, such that the performance of the Optimal Bayes Classifier fit with S is worse than the performance using S U {X}. Formally:

    • X is weakly relevant iff: X is not strongly relevant and ∃ S ⊆ {V-S,T} s.t. X T | S.

  • A feature is irrelevant if it is not strongly or weakly relevant.

The strongly relevant feature set solves the standard feature selection problem.

Intuitively, choosing the strongly relevant features provides the minimal set of features with maximum information content and thus solves the standard feature selection problem since a powerful classifier in the small sample or the Optimal Bayes Classifier in the large sample will achieve maximum predictivity. The Kohavi-John framework does not provide efficient algorihms for discovery of the strongly relevant feature set, however. 

Markov Boundary framework for Standard predictive feature selection problem. Recall from the section of Bayes Networks (BNs) that a set S is the Markov Boundary of variable T (denoted as S = MB(T)), if S renders T independent on every other subset of the remaining variables, given S, and S is minimal (cannot be reduced without losing its conditional independence property). This is the MB(T) in the probabilistic sense. Tsamardinos and Aliferis [24] connected the Kohavi-John relevancy concepts with BNs and Markov Boundaries as follows:

In faithful distributions there is a BN representing the distribution and mapping the dependencies and independencies so that:

  1. 1.

    The strongly relevant features to T are the members of the MB(T).

  2. 2.

    Weakly relevant features are variables, not in MB(T), that have a path to T.

  3. 3.

    Irrelevant features are not in MB(T) and do not have a path to T.

Thus in faithful distributions, the Markov boundary MB(T) is the solution to the standard feature selection problem and algorithms that discover the Markov boundary implement the Kohavi-John definition of strong relevancy.

Local causally augmented feature selection problem and Causal Markov Boundary. In faithful distributions with causal sufficiency (see chapter “Foundations of Causal ML”) there is a causal BN that is consistent with the data generating process and can be inferred from data in which: strongly relevant features = members of MB(T), and also comprise the solution to the local causally augmented feature selection problem of finding:

  1. 1.

    The direct causes of T.

  2. 2.

    The direct effects of T.

  3. 3.

    The direct causes of direct effects of T.

Thus in faithful distributions with causal sufficiency, the causal graphical MB(T) set is connected with the probabilistic MB(T). Inducing the probabilistic MB(T) then: 1. Solves the standard predictive feature selection problem, and 2. Solves the local causally augmented feature selection problem [24, 41].

Equivalency-augmented feature selection problem and Markov Boundary Equivalence Class. In faithful distributions the MB(T) exists and is unique [22]. However, in non-faithful distributions where variables or variable sets exist that have the same information for the target variable (i.e., target information equivalences exist in (“TIE distributions”) and we may have more than one MB(T) [46]). The number of Markov Boundaries can be exponential to the number of variables [46] and in empirical tests with real life genomic data Statnikov and Aliferis extracted tens of thousands of Markov boundaries before terminating the experiments [48].

In TIE distributions:

  1. 1.

    The Kohavi John definitions of relevancy break down since there are no Kohavi-John strongly relevant features any more, only weakly relevant and irrelevant ones. This is because if S1, S2 are both in the MB equivalence class {MBi(T)} then: S1 ⊥ T | S2 and S2 ⊥ T | S1.

  2. 2.

    The 1-to-1 causal and probabilistic relationship of the probabilistic and graphical MB(T) breaks down. A variable can be a member in some MBi(T) without having a direct causal or causal spouse relationship with T.

  3. 3.

    The standard predictive feature selection problem is solved by the smallest member in the equivalence class of MBi(T).

  4. 4.

    The Equivalency-augmented feature selection problem is to find the equivalence class of all probabilistic MBi(T).

Chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” provides further details of the above 3 fundamental feature selection problem classes, organizes them into a hierarchy of increasing difficulty, shows examples, and presents and contrasts practical algorithms based on their ability to solve these problems.

Theory of Algorithmic Causal Discovery and of Computational Properties of Experimental Science

The theory of causal discovery extends traditional ML theoretical frameworks that focus on generalization error, by investigating the feasibility, complexity, and other properties of causal discovery algorithms from passive observational data, of experimental interventional approaches (e.g., RCTs, biological experiments, etc.) and hybrid experimental-observational algorithmic approaches. Pearl provides a comprehensive modern theory of causality [49] and Spirtes et al., a historically influential algorithmic framework for its discovery [50]. Chapter “Foundations of Causal ML” presents an extensive introduction to the function of causal discovery algorithms from non-experimental data, and their properties under specific assumptions. Chapter “Foundations of Causal ML” also lists several algorithms used for causal discovery, including more modern and scalable ones. Chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” as well as section of “ML Framed as AI search and the Role of Inductive Bias” of the present chapter reference theory and algorithms at the intersection of feature selection and causality. Moreover we mention here selected additional fundamental results that will round the readers’ understanding of causal discovery from a theory perspective:

Eberhardt et al. showed that under assumptions: if any number of variables are allowed to be simultaneously and independently randomized in any one experiment, then log2 (N) + 1 experiments are sufficient and in the worst case necessary to determine the causal relations among N ≥ 2 variables when no latent variables, no sample selection bias and no feedback cycles are present [51]. Bounds are provided when experimenters can’t intervene on more than K variables simultaneously. These results point to fundamental limitations of RCTs and biological experiments conducted with small number of variables manipulated at a time, and is further discussed in chapter “Lessons Learned from Historical Failures, Limitations and Successes of AI/ML In Healthcare and the Health Sciences. Enduring problems, and the role of Best Practices”.

The same researchers showed that: by combining experimental interventions with causal algorithms for graphical causal models under familiar assumptions of causal induction, with perfect data, N - 1 experiments suffice to determine the causal relations among N > 2 variables when each experiment randomizes at most one variable [52]. These results require that all variables are simultaneously measured, however.

Statnikov et al. [47] showed that in TIE distributions (i.e., with multiple equivalent Markov Boundary sets with respect to the response variable T), an algorithm exists that guides experimentation combined with causal discovery from observations, so that at most k single-variable experiments are needed to learn the local causal neighborhood around T where k is the size of the union of all Markov Boundaries of T.

Mayo-Wislon showed [53] that for any collection of variables V, there exist fundamentally different causal theories over V that cannot be distinguished unless all variables are simultaneously measured. Underdetermination can result from piecemeal measurement, regardless of the quantity and quality of the data.

The same investigator in [54] found that when the underlying causal truth is sufficiently complex, there is a significant possibility that a number of relevant causal facts are lost by trying to integrate the results of many observational studies in a piecemeal manner. Specifically, he shows that as the graph gets large, if the fraction of variables that can be simultaneously measured stays the same, then the proportion of causal facts (including e.g., who mediates what relationships) that can be learned even with experiments, approaches 0.

Optimal Bayes Classifier

The optimal Bayes Classifier (or OBC for short) is defined by the following formula:

$$ argma{x}_{(i)}\;\sum \limits_jP\left(T=i\;|\; Mj\right)\ast P\left( Mj\;|D\right) $$

Where i indexes the values of the response variable T and j indexes models in the hypothesis space where the classifier operates. In plain language, the OBC calculates the posterior probability that a model has generated the data (i.e., it is the data generating function) given the data, for every model in the hypothesis space. It also calculates for each of the response variable’s values, the probability for that value’s probability given each model. The predictions are summed over all models, weighted by the probabilities of the models given the data, and the value with the higher value is the one that the classifier outputs.

Because the hypothesis space can be infinite or intractably large, the calculations involved are also intractable. Also, if we calculate the conditional probabilities using Bayes’ rule we also have to deal with the problem of prior probability assignment over the model space members; in case of very biased priors, the calculated posteriors will converge slowly to the large sample correct ones. These issues place the application of the OBC outside the realm of the practical. However, it turns out that the error of the OBC is optimal in the large sample. Hence the OBC is a valuable analysis tool when we consider the errors of various learning algorithms by comparing them to the OBC error (as we will see in chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”) [29, 32].

No Free Lunch Theorems (NFLTs)

NFLTs is a general class of theorems each one applying to optimization, search, machine learning and clustering (in the last case referred to as Ugly Duckling Theorem or UDT for short) [32, 42].

The crux of these theorems is that under a set of conditions intended to describe all possible application distributions, there is no preferred algorithm, and that by implication the right algorithm should be chosen for the right task, since there is no dominant algorithm irrespective of task. This particular interpretation is commonsensical and useful. It is also stating in different terms essentially the notion that a well-matched inductive bias to the problem at hand will led to better solutions.

This is especially important for clustering algorithms and the UDT. The UDT entails that in the absence of external information, there is no reason to consider two patterns P1 and P2 more or less similar to each other than P3. Over all possible functions associated with such patterns (and the features that define them) any grouping is as good as any other. This implies that similarity/distance functions that define the behavior of clustering algorithms must be tailored to specific use contexts of the resulting clusters (which in turn entails a restriction on the class of functions modeled).

The problems with common use of clustering are three-fold: (a) Per the UDT, clustering by algorithm X is as good as random clustering over all possible uses of the clusters. Unless we select or construct a distance/similarity function designed to solve the specific problem at hand, clustering will not provide any useful information. (b) There is no useful unbiased clustering. Researchers who present clustering results as “unbiased” (meaning “hypothesis free” - a practice very common in modern biology research and literature) fail to realize that any practical clustering algorithm has an inductive bias implemented as a distance function and as a grouping/search strategy. And as we saw, in the absence of (a well-chosen) inductive bias little can be accomplished. (c) Finally clustering should not be used, for predictive modeling [55]. Clustering algorithms can only know something about a classification problem, e.g. of response T as T+ or T-, if and only if we design a similarity function that distinguishes between T+ and T- and embed it in the clustering algorithm. But this function is precisely a predictive modeling classifier, rendering the whole clustering-for-prediction endeavor, redundant.

In chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” we give examples of the above as well as recommendations for goal-specific clustering.

Pitfall 2.8

Probably more than other theoretical results, the NFLT for ML has the largest risk to be misunderstood and misapplied.

In summary form the NFLT for ML states that all learning methods have on average the same performance over all possible applications, as a mathematical consequence of 3 conditions:

  1. (a)

    The algorithm performance will be judged over all theoretically possible target functions that can conceivably generate data.

  2. (b)

    The prior over these target functions is uniform.

  3. (c)

    Off Training Set Error (OTSE) will be used to judge performance [32, 42].

This result has been misinterpreted to suggest that we could use models that have low instead of high accuracy according to unbiased error estimators and do as well as when choosing the high accuracy models. In this (mis)interpretation random classification is as good overall as classification using sophisticated analytics and modeling. The mathematics of the NFLT derivation are impeccable but the results are problematic because of the flaws of the 3 underlying assumptions:

  1. (a)

    In real life a tiny set of data generating functions among infinite ones are the ones that generate the data. Nature is highly selective to its distributions.

  2. (b)

    The prior distribution over these data generating functions is highly skewed.

    Taken together assumptions (a) and (b) of the NFLT, are mathematically equivalent with a label random reshuffle procedure (see chapter “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models”). This procedure distorts the relationship between inputs and response variable and creates a distribution of target functions that on average have zero signal. Because on average there is no signal, this target function space is on average random and thus unpredictable. Therefore no learning algorithm exists that can do better than random and NFLT, naturally, finds and states as much. If such an algorithm existed then the distribution would be predictable and thus non-random. The NFLT for ML then just says that when there is no signal (on average), every algorithm will fail (i.e., will be as good as the random decision rule on average) and thus all algorithms will be equally useless (on average).

  3. (c)

    In addition, by OTSE excluding the input patters that have been seen by the algorithm during training, an artificially low biased performance estimate is obtained for future applications. By contrast statistical theory and all branches of statistical ML and of science adopt for purposes of validation OSE (Off Sample Error) which is just a random sample from the data generating function.

In the present volume we specifically discussed at some length the dangers in over-interpreting the NFLT for ML because of published claims that the theorem somehow entails that choosing the models with best cross validation error (or best independent validation error, or best reproducibility of error) are just as good as choosing the model with worst reproducibility or independent validation error [42].

Best Practice 2.9

Cross validation and independent data validation, as well as their cousin reproducibility, are robust pillars of good science and good ML practice and are not, in reality, challenged by the NFLT.

Best Practice 2.10

Clustering should not be used for predictive modeling.

Best Practice 2.11

A very useful form of clustering is post-hoc extraction of subtypes from accurate predictor models.

Universal Function Approximators (UFAs) and Analysis of Expressive Power

UFAs are ML algorithms that can represent any function that may have generated the data. UAF theorems establish that certain ML algorithms have UAF capability [29].

Pitfall 2.9

If a ML algorithm cannot represent a function class, this outright shows the inability or sub- optimality of this algorithm to solve problems that depend on modeling a data generating function that is not expressible in that algorithm’s modeling language.

For example, clustering (i.e., grouping objects or variables into groups according to similarity or distance criteria) does not have the expressive power to represent the causal relationships among a set of variables or entities. Thus the whole family of clustering algorithms is immediately unsuitable for learning causal relationships. Similarly, simple perceptron ANNs cannot represent non-linear decision functions and that places numerous practical modeling goals and applications outside their reach.

By contrast, Decision Trees can represent any function over discrete variables. Similarly, ANNs can represent any function (discrete or continuous) to arbitrary accuracy by a network with at least three layers [29]. BNs can represent any joint probability distribution as we show in the present chapter. AI search can be set up to operate on model languages that are sufficiently expressive to represent any function as well. Genetic Algorithms, being essentially search procedures share this property.

Pitfall 2.10

UAF theorems should not be over-interpreted. While it is comforting that e.g., algorithm A can represent any function in the function family F (i.e., the model space and corresponding inductive bias are expressive enough), learning also requires effective (space and time-tractable, sample efficient, non-overfitting etc.) model search and evaluation in that space.

For example, Decision Trees (DTs) do not have practical procedures to search and learn every function in the model space expressible as a DT, since practical (tractable) DT induction involves highly incomplete search of the hypothesis space. Similarly, ANNs can represent any function however, the number of units needed and the time needed for training are intractable and the procedures used to search in the space of ANN parameters are not guaranteed to find the right parameter values.

Generative vs. Discriminative Models

Generative models are typically considered the ones that can model the full joint distribution of the variables in an application domain. Discriminative models, by contrast, are ones that only model a decision function that is sufficient to problem of interest in that domain. Consider as example the SVM hyperplane model in Fig. 4. This model solves the diagnostic problem stated perfectly without modeling the probability distribution of the variables involved.

The optimal choice of generative vs. discriminative model entirely depends on the application domain. For example, for general predictive modeling as well as other pattern recognition such as text categorization, the use of discriminative models confers practical advantages and better performing models than generative models, in many datasets. For causal modeling, simulation, natural language understanding, density estimation, or language generation, generative models are necessary or advantageous. We also wish to clarify a terminology confusion (especially in non-technical literature) between generative modeling at large vs “Generative AI”. The latter refers to a small number of specific classes of algorithms that generate data (e.g., Generative Adversarial Networks, Large Language Models) with established or unknown properties. Generative modeling on the other hand includes all methods that model the data generating distribution and in typical usage the term refers to algorithms that have guarantees for correct modeling of the data generating distribution (e.g., BNs, Logistic Regression, Density estimator algorithms).

Best Practice 2.12

The choice of generative vs. discriminative modeling affects quality of modeling results and has to be carefully tied to the problem domain characteristics. All else being equal discriminative models confer efficiency (computational and sample) advantages.

Bias-Variance Decomposition of Model Error (BVDE)

The concept of BVDE is one that originates from statistical machine learning but has broad applicability across ML and all of data science. It is pervasively useful, yet not as widely known as it deserves among non-technical audiences, so we will present it here in some detail. A detailed treatment can be found in [56]. While the whole idea of BVDE is that other than noise in measurements (which is intrinsic in the data and independent of modeling decisions) or inherent stochasticity of the data generating function (which is intrinsic in the data generating process and independent of modeling decisions), the remaining modeling error of any ML (or for that matter any statistical or quantitative data science) model has two components: a component due to the inductive bias mismatch with the problem vs. the data at hand; and another component due to sample variation in small sample settings.

In the terminology of BVDE, the error due to inductive bias mismatch is referred to as “bias” with “high bias” indicating a severe mismatch, toward simplicity (aka small complexity, or small capacity) of the model language (and related search and fitting procedures) with respect to the data generating function. The error due to sampling variation is referred to as “variance” with variance increasing as sample size decreases. More precisely, the bias is the error (in the sample limit, relative to the data generating function) of the best possible model that can be fitted within the class of models considered. The variance is the error of the best model (in the small sample, within the class of models considered) relative to the error of the best model that can be fitted in the large sample within the class of models considered. The bias then is a function of the learner used and the data generating function; the variance, for a fixed data generating function, is a function of the learner and the sample size.

Implications for modeling: When the modeling bias is fixed one can reduce total model error by increasing the sample size (reduce variance), and when the sample is fixed one can reduce total error by optimizing the bias. More importantly, when both sources of error are under analyst control, BVDE explains that there is an ideal point of balance of error due to bias and error due to variance. The optimal error will be found when these two sources of error are balanced for a particular modeling setting. Moreover high bias models have smaller variance (i.e., are more stable in low samples) but on average over many samples will approximate the target function worse. Low bias models have higher variance, hence are unstable in small samples but on average(!) approximate the target function better. We now delve into BVDE with a concrete example.

Figure 5 depicts a two-dimensional data set, based on one input variable x, plotted along the horizontal axis and response y along the vertical. The black points represent the observed data points and the blue line f depicts the true generating function, which in this example is:

$$ f(x)=\Big\{\kern0.5em {\displaystyle \begin{array}{l}\kern2.6em {x}^2\kern2em when\;x<-3\\ {}81-{x}^2\kern2em when-3\le x\le 3\\ {}\kern3.2em {x}^2\kern2em when\;x>3\end{array}} $$

The black lines represent quadratic models fit to random samples from this data.

Fig. 5
A graph of y versus x represents the bias-variance tradeoff. 2 lines form a concave upward trend. A line follows a W-pattern, with data points along the line.

Illustration of the bias-variance tradeoff. The x-axis shows models’ input values and the y-axis is the response. The training data observations are the black points, and the true data generating function is depicted in blue. The two black curves represent two models

Consider the expected generalization error from this procedure when predicting the point at (say) x = 0. The error has three components. First, the observed dependent variable can differ from the true value of the generating function, that is measurement noise. Visually, noise is the difference between the black points and the blue line. The second component, variance, is the variability of the model due to the specific sample, which is visually represented by the spread of the predictions from the different models (black lines) at x = 0. In this example, with only two models, this ranges from 35 to 45. Finally, the third component is bias, which is the difference between the expectation of the prediction (expectation is taken over the different models built on the different samples) and the true value of the dependent variable (the blue line at x = 0). In this example, the expectation of the prediction from different models appears to be approximately 40, while the true value is 81.

The generalization error expressed as MSE at any x can be written as:

$$ \mathbf{E}\left[{\left(y-\hat{f}(x)\right)}^2\right]=\mathbf{E}\left[{\left(y-f(x)\right)}^2\right]+\mathbf{E}\left[{\left(f-E\hat{f}(x)\right)}^2\right]+\mathbf{E}\left[{\left(\hat{f}(x)-E\hat{f}(x)\right)}^2\right] $$

The three terms correspond to noise, bias and variance of \( \hat{f} \).

Figure 6 shows the bias (orange), variance (blue) and mean squared error (MSE) (gray) of models of increasing complexity on a test set. Model complexity is controlled by the degree of x the model is allowed to use and how far the optimizer can optimize the training MSE. Complexity increases from left to right. As the model complexity increases, variance increases while bias decreases. For improved readability, bias, variance and MSE are scaled to the same range in the figure. MSE is thus a weighted sum of bias (squared) and variance. The optimal fit is achieved where MSE is minimal (in the middle of the complexity range). Lower complexity leads to underfitting, which is characterized by lower variance and higher bias, while increased complexity leads to overfitting, which is characterized by higher variance and lower bias (compared to the bias and variance at the optimal complexity).

Fig. 6
A multiple line graph of the scaled bias, variance, M S E versus complexity. The lines are bias, variance, and M S E. The line bias follows a concave-down decreasing trend. The other two lines are in proximity and ascend after an initial stable trend.

The relationship between the bias-variance components and complexity in the example (fixed noise is considered). The horizontal axis is complexity (of quadratic models) and the vertical axis is the various bias/variance components scaled to the same range. Orange corresponds to bias, blue to variance and gray is total MSE

These concepts are critical in helping analysts create models with maximum predictivity (see chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI”).

Essential Concepts of Mathematical Statistics Applicable to ML

Mathematical Statistics is the subfield of Statistics that studies the theoretical foundations of statistics. At the same time, many of the concepts and tools of mathematical statistics are useful across data science broadly and for ML more specifically. Table 9 provides examples of important areas and analytical tools developed within this field that have value for understanding, advancing, and practicing ML [58].

Table 9 Key concepts and areas of mathematical statistics useful for ML

Techniques and results from mathematical statistics (and its applications) are present throughput the chapters of the present volume.

Conclusions

The successful design of problem-solving AI/ML models and systems can be guided by and evaluated according to well specified technical properties. Systems that lack properties are pre-scientific and used heuristically, whereas systems with well-established properties and guarantees provide more solid ground for reliably solving health science and healthcare problems. The nature of the properties listed matches well the practical applications of AI/ML. Properties disconnected from practical implications are not subject of study in the present volume.

AI has both symbolic and non-symbolic methods as well as hybrid variants. Foundational methods in the symbolic category are logics and logic-based systems such as rule based systems, semantic networks, planning systems, NLP parsers and certain AI programming languages. In the non-symbolic category, ML, probabilistic, connectionist, decision-theoretic formalisms, systems and languages dominate.

Among health AI methods capable of reasoning with uncertainty, Bayes Nets and Decision Analysis stand out for their ability to address a variety of uses cases and problem classes.

A major distinction is between shallow systems that essentially are equivalent to function relating outputs to inputs (e.g., most of ML predictive modeling), and systems with rich ontologies and elaborate models of the physical world in the application domain. Non-symbolic AI systems tend to fall in the former category, whereas symbolic ones, in the latter.

The framework of AI search is especially powerful both as a problem-solving technology but also as an analytical tool that helps us understand and architect successful methods and systems. AI search cuts across the symbolic vs. non-symbolic, shallow vs. rich, and the data-driven (Ml) vs. knowledge driven distinctions.

ML has solid and extensive theoretical foundations that include: Computational Learning Theory, ML as AI search, ML as geometrical construction and function optimization, COLT (PAC learning, VC dimension), Theory of feature selection, Theory of causal discovery, Optimal Bayes Classifier, No Free Lunch Theorems, Universal Function Approximation, Generative vs. Discriminative models; Bias-Variance Decomposition of error, and an extensive set of tools borrowed from the field of mathematical statistics.

Key Concepts and Messages Chapter “Foundations and Properties of AI/ML Systems”

  • The critical importance of knowing or deriving the properties of AI/ML models and systems.

  • The main technical properties of AI/ML systems.

  • Tractable vs. intractable problems and computer solutions to them.

  • The various forms of Logic-based (symbolic) AI.

  • Non-symbolic AI, reasoning under uncertainty and its primary formalisms.

  • AI search.

  • Foundations of Machine Learning Theory

Pitfalls and Best Practices Chapter “Foundations and Properties of AI/ML Systems”

Pitfall 2.1. From a rigorous science point of view an AI/ML algorithm, program or system with intractable complexity does not constitute a viable solution to the corresponding problem.

Pitfall 2.2. Parallelization cannot make an intractable problem, algorithm or program practical.

Pitfall 2.3. Moore’s law improvements to computing power cannot make an intractable problem algorithm or hard program practical.

Pitfall 2.4. Believing that heuristic systems can give “something for nothing” and that have capabilities that surpass those of formal systems. In reality heuristic systems are pre-scientific or in early development stages.

Pitfall 2.5 in Decision Analysis (DA) and Maximum Expected Utility (MEU)-based reasoning

  1. 1.

    Errors in the estimation of probabilities for various events.

  2. 2.

    Errors in eliciting utility estimates in a way that captures patients’ true preferences (including using the care providers’ utilities rather than the patients’).

  3. 3.

    The structure or complexity of the problem setting defies analyst’s ability to completely/accurately describe it.

  4. 4.

    Developing a DA for one population and applying in another with different structure of the problem, different probabilities for action-dependent and action-independent events, or with different preferences.

Pitfall 2.6. Using the wrong inductive bias for the ML solution to the problem at hand.

Pitfall 2.7. Ignoring the fit of the data generating procedure with the ML solution to the problem at hand.

Pitfall 2.8. Probably more than other theoretical results, the NFLT for ML has the largest risk to be misunderstood and misapplied.

Pitfall 2.9. If a ML algorithm cannot represent a function class, this outright shows the inability or sub- optimality of this algorithm to solve problems that depend on modeling a data generating function that is not expressible in that algorithm’s modeling language.

Pitfall 2.10. UAF theorems should not be over-interpreted. While it is comforting that e.g., algorithm A can represent any function in the function family F (i.e., the model space and corresponding inductive bias are expressive enough), learning also requires effective (space and time-tractable, sample efficient, non-overfitting etc.) model search and evaluation in that space.

Best Practices Discussed in Chapter “Foundations and Properties of AI/ML Systems”

Best Practice 2.1. Pursue development of AI/ML algorithm, program or systems that have tractable complexity.

Best Practice 2.2. Do not rely on parallelization to make intractable problems tractable. Pursue tractable algorithms and factor in the tractability analysis any parallelization.

Best Practice 2.3. Do not rely on Moore’s law improvements to make an intractable problem algorithm or hard program practical. Pursue tractable algorithms and factor in the tractability analysis any gains from Moore’s law.

Best Practice 2.4. When faced with intractable problems, consider using strategies for mitigating the computational intractability by trading off with less important characteristics of the desired solution.

Best Practice 2.5. As much as possible, use models and systems with formal and established properties (theoretical + empirical). Work within the maturation process starting from systems with unknown behaviors and no guarantees, to systems with guaranteed properties.

Best Practice 2.6. Decision Analysis (DA) and Maximum Expected Utility (MEU)-based reasoning

  1. 1.

    Ensure that the structure of the problem setting is sufficiently/accurately described by the DA tree. Omit known or obvious irrelevant factors.

  2. 2.

    Elicit utility estimates in a way that captures patients’ true preferences using established utility-elicitation methods.

  3. 3.

    Accurately estimate probabilities of action-dependent events and action-independent events.

  4. 4.

    In most conditions, and whenever applicable, data-driven approaches should be preferred to subjective probability estimates. Use probability-consistent statistical or ML algorithms to estimate the probabilities.

  5. 5.

    Ensure that the decision analysis is applied to the correct population.

  6. 6.

    Conduct sensitivity analyses that reveal how much the estimated optimal decision is influenced by uncertainty in the specification of the model.

  7. 7.

    Whenever possible, produce credible intervals/posterior probability distributions for the utility expectations of decisions.

Best Practice 2.7. Pursue ML solutions with the right inductive bias for the problem at hand.

Best Practice 2.8. Create a data generating or design procedure that matches well the requirements of the problem at hand and works with the inductive bias to achieve strong results.

Best Practice 2.9. Cross validation and independent data validation, as well as their cousin reproducibility, are robust pillars of good science and good ML practice and are not in reality challenged by the NFLT.

Best Practice 2.10. Clustering should not be used for predictive modeling.

Best Practice 2.11. A very useful form of clustering is post-hoc extraction of subtypes from accurate predictor models.

Best Practice 2.12. The choice of generative vs. discriminative modeling affects quality of modeling results and has to be carefully tied to the problem domain characteristics. All else being equal discriminative models confer efficiency (computational and sample) advantages.

Discussion Topics and Assignments, Chapter “Foundations and Properties of AI/ML Systems”

  1. 1.

    Revisit questions (10–13) of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML from the perspective of which properties of the proposed systems are known.

  2. 2.

    Use Table 3 to classify the proposed systems in questions (10–13) of chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML”

  3. 3.

    Which of the following are heuristic systems (and in what category of the classification of Table 3 in this chapter:

    1. (a)

      INTERNIST-I

    2. (b)

      MYCIN

    3. (c)

      QMR-BN

    4. (d)

      A classical regression model for which we do not know if data is normally distributed. Compare to a classical regression model for which we know that data is not normally distributed. 

    5. (e)

      A Large Language Model implementing an EHR “ChatBot” tool answering queries about the patients’ medical history.

    6. (f)

      IBM Watson Health

  4. 4.

    Based on your findings in question 3, how would you go about next steps toward putting these systems into practice from a perspective of accuracy and safety?

  5. 5.

    Discuss: are BNs deep or shallow representations?

  6. 6.

    Consider a population with age distribution as in the table below:

    Age →

    0–10

    11–20

    21–30

    31–40

    41–50

    51–60

    61–70

    >70

    % →

    20

    20

    10

    10

    10

    15

    10

    5

    1. (a)

      What would be a good 2-way clustering (grouping) of individuals in this population?

    2. (b)

      For a pediatrician: what would be a good 2-way clustering (grouping) of individuals in this population?

    3. (c)

      For a gerontologist: what would be a good 2-way clustering (grouping) of individuals in this population?

    4. (d)

      For an obstetrician: what would be a good 2-way clustering (grouping) of individuals in this population?

    5. (e)

      What can you conclude about the value of a priori clustering without any reference to use of the produced groups?

  7. 7.

    Occam’s Razor is the epistemological principle that says that given two explanations that fit the data equally well, we should choose the simplest one. Analyze this proposition from a BVDE viewpoint.