Introduction

Previous chapters have discussed methods for using ML to predict outcomes. We will start by illustrating the concepts of causal ML techniques using a hypothetical vignette. Imagine a scenario where we have used predictive methods that estimate that a particular patient, Amy Anonymous, who has been diagnosed with alcohol use disorder (AUD) but is currently abstaining, has a high probability of relapsing. The next step would naturally be to perform interventions with the goal of preventing the relapse. Can ML help identify the best interventions for preventing Amy’s relapse? Causal ML methods can help solve such problems, specifically addressing questions like:

  1. 1.

    How much would Amy’s chance of relapsing be reduced if Amy receives a specific therapy?

  2. 2.

    What factors other than therapy, if any, might also help prevent relapse?

  3. 3.

    What additional complications might ensue if Amy receives a specific therapy?

These are all fundamentally causal questions because they refer to actions that change the usual function of the modeled system, whereas predictive modeling applies only when we model the system’s (the human organism or a healthcare system) behavior in its natural state (without any interventions). Using data to answer them requires causal ML. Using data to answer the first of these questions is a causal inference problem, [1] while using data to answer the second and third questions are causal structure discovery problems [2].

  • Causal inference is the problem of quantifying the effect of specific interventions on specific outcomes.

  • Causal structure discovery is the problem of identifying the causal relationships among a set of variables.

Like regression and classification, these are very broad problems. Numerous solutions within AI/ML and outside these disciplines have been developed for each. Both of these problems can further be complicated by the presence of unmeasured (aka “hidden” or “latent”) variables. Specialized algorithms exist to address such settings [1,2,3,4]. For pedagogical simplicity we focus in the present chapter mostly on situations where all relevant factors have been measured.

We will next review the core concepts of causal modeling. As we saw in chapter “Foundations and Properties of AI/ML Systems” any ML method can be conceptualized as search in the space of models appropriate for the problem domain. Causal ML therefore deals with the space of causal models.

Causal Models Versus Predictive Models

Predictive knowledge is associative, capturing co-variation between two or more phenomena. In contrast, causal knowledge is etiological, and captures whether and to what degree the manipulation of one phenomenon results in changes in another. For example, to reduce Amy’s risk for relapse, one needs to manipulate, (aka intervene on, or treat), its causes. In contrast, while being in a rehab facility is strongly associated with experiencing the symptoms of a relapse, preventing Amy from entering the rehab will not prevent or treat a relapse, since presence in the rehab facility is not a cause of relapse but an effect. These forms of knowledge are distinct, and distinct methods are required to model them.

Causal models must therefore be able to (a) represent cause-effect relationships; (b) answer questions of the type “what will be the effects of manipulating factor X” and “which factors should one manipulate in order to affect X”; (c) generate data from the model for simulation purposes. In addition, causal models can answer also the usual predictive queries, e.g.: if we observe X what is the probability of Y?”

The fundamental distinction between predictive and causal ML models is that predictive models inform us about the unperturbed distribution over a set of variables (i.e., patterns that occur “normally”, without any intervention on the factors); whereas causal models inform on what modified patterns and distributions one will obtain when interfering and altering the underlying process that generates the data.

Pitfall 4.1

Popular and successful predictive ML methods are not designed and equipped to satisfy the essential requirements of causal modeling.

While causal ML methods are capable of being used for prediction under no interventions as well under interventions, predictive ML methods have several practical advantages over causal ML ones when used for prediction under no intervention. Therefore:

Best Practice 4.1

For predictive tasks (i.e., without interventions contemplated) use of Predictive ML should be first priority. For causal tasks (i.e., with interventions contemplated) use of Causal ML should be first priority.

Also as discussed in chapter “Foundations and Properties of AI/ML Systems”, in some cases we need to construct flexible predictive models that can accept queries designating any subset of variables as evidence and other subsets as outcomes of interest, while leaving the remaining variables unspecified. We saw in chapter “Foundations and Properties of AI/ML Systems” how BNs can attain this goal without building a new model for every new query (as the majority of predictive modeling algorithms do; see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”). The same is true when a mixture of observational and manipulation evidence variables are part of the query (i.e., flexible causal/predictive modeling). Causal Probabilistic Graphs (e.g., Causal BNs) can handle such flexible reasoning.

Properties of Well-Constructed Causal Models

So that the present chapter is self-contained some of the BN definitions and properties of chapter “Foundations and Properties of AI/ML Systems” are re-visited in “Properties of Causal Models” and “Structural Equation Models (SEMs)”. Readers can skip previously encountered material.

First and Foremost Causal Models Must Be Able to Represent the Cause-and-Effect Relationships Among Variables

For causal discovery, this is necessary for answering questions about which variables cause which other variables. For causal inference, this is necessary for removing bias due to confounding. For example, consider a study (Fig. 1) on diet and alcohol abuse, that finds that people who eat fast food regularly are more likely to abuse alcohol. This does not mean that preventing a fast food diet will decrease the chance of Amy’s alcohol abuse relapse. Other factors, for example motivation and stress may lead a person to eat more fast food and may also lead a person to abuse alcohol (thus the relationship between fast food and alcohol abuse is confounded by motivation and stress in the Fig. 1 hypothetical example). Causal models can represent complex systems of relationships involving up to hundreds of thousands of variables with clear distinction between confounded vs causal associations.

Fig. 1.
A relationship diagram. Motivation with negative 0.5 links to fast food and with 0.7 to adherence. Stress with 0.5 to fast food and with negative 0.7 to days abstained. Group with 0.8 to adherence, which with 0.2 to days abstained.

A highly simplified example of a clinical trial for a hypothetical drug for treating alcohol use disorder (AUD). Group (Treatment) indicates whether participants receive a placebo or the tested treatment, Adherence indicates how often the participant takes the assigned treatment pills, Motivation indicates a measure of the participant’s motivation to have a healthy lifestyle, Fast Food indicates the amount of fast food the participant eats, Stress indicates the stress levels that the participant experiences, and Days Abstained indicates primary outcome, of number of days the person is able to go without relapsing into problematic alcohol use behaviors. Note: causal ML does not require experimentation in the form of randomized clinical trials (RCTs) or otherwise to infer causal relationships, although it can as in this example be combined with such data to uncover more fine grain information about the causal process than a simple average treatment effect that a RCT provides.

Second, Causal Models are Markovian

That is, from the perspective of what are the effects of causal manipulations, the distribution of every variable can be entirely determined by its immediate causes [1]. If Variable T has direct causes A and B and we manipulate A and B this will have the maximum possible effect on T. No other simultaneous manipulation of variables in the system will have an effect on T once we manipulate A and B. Stated differently, T is “shielded” causally by all variables once we manipulate its direct causes.

From an information transfer (and thus predictive) perspective, however, every variable is independent of every non-descendant variable given its immediate causes (a condition known as the Markov Condition [1]). In other words, non-direct causes of T will have no information about what happened to T after we manipulated A and B. Downstream effects of T and spouses of T (i.e., direct causes of the direct effects of T) will have additional information about what happened to T after we manipulated A and B.

It is widely accepted that causation itself is Markovian (in the macroscopic world, whereas exceptions happen in the quantum world [2], luckily with no relevance to health science and care). This is reflected by common practice in medical science, epidemiology, biology etc., where causation is studied in two fundamental ways:

  1. (a)

    Causal chains of the type A → B → C

  2. (b)

    Causal chains of the type A ← C → B

In the chain (a) common health science and health care intuitions are that manipulating A will affect B and thru B will affect C; A, B and C all correlated with one another; once we fix B then manipulation of A will stop affecting C; the correlation of A and C will vanish if we observe or fix B; and the correlation of A and B and of B and C will not ever vanish regardless of observing any other variable.

In the chain (b) common health science and health care intuitions are that manipulating A will not affect B and manipulating B will not affect A; A, B and C all correlated with one another; once we fix C or observe C then the correlation of A and B will vanish; and the correlation of A with C and of C with B will not ever vanish regardless of observing or fixing any other variable.

Third, Causal Models are Modular [1]

Being Markovian leads to the models being decomposable into smaller parts, where we can study these local regions and their function without reference to remote areas in the causal process.

Fourth, Causal Models Are Manipulatable and Generative

Manipulatable refers to capturing in the model the changes that a physical manipulation makes in the actual part of the world that the model represents. A way to represent a physical world manipulation in the model is to assign the corresponding value to the manipulated variable and eliminate all causal edges to it (i.e., because physical manipulations neutralize all other possible causes).

A generative model is one that encodes the full distribution of variables at hand so that every probabilistic calculation can be made with the model. This is sharp distinction with discriminative predictive models that only seek to encode a small fragment of the full distribution, typically the conditional probability of a response given a fixed set of inputs (and in many cases less than that, for example decision surfaces that encode even less information but manage to predict the response to arbitrary accuracy).

In the next sections, we review the predominant class of causal models.

Causal Probabilistic Graphical Models (CPGMs) Based on Directed Acyclic Graphs (DAGs)

A CPGM comprises (1) a DAG; (2) a joint probability distribution (JPD) over variable set V such that each variable corresponds to a node in the DAG; and (3) a restriction of how the JPD relates to the DAG, known as the Causal Markov Condition (CMC).

  1. 1.

    A directed graph is a tuple <V,E>, where V is a set of nodes representing variables 1-to-1, and E is a set of directed edges, or arrows, each one of which connects a pair of members of V. A path is any set of adjacent edges. A directed path is a path where all edges are pointing in the same direction. A directed graph is a DAG if it has no cycles in it, that is, if there is no directed path that contains the same node more than once. Fig. 1 is an example of a DAG.

  2. 2.

    The JPD over V is any proper probability distribution (i.e., every possible joint instantiation of variables has a probability associated with it and the sum of those is 1).

  3. 3.

    The CMC states that every variable V is independent of all variables that are non-descendants (i.e., not downstream effects) of V given its direct causes.

Causal ML Models Have Well-Defined Formal Causal Semantics

They typically take the form:

  1. 1.

    Parent set (direct causes) Pa(Vi) of variable Vi, Pa(Vi) = {Pa(Vi)1, Pa(Vi)2, …} is defined over all Vi; and

  2. 2.

    Probability Pr(Vi = j| Pa(Vi)1 = k, Pa(Vi)2 = l, …) is defined over all variables Vi for all value combinations of Vi and its direct causes.

  1. 1.

    Describes the direct causal relationships among all variables

  2. 2.

    Describes the conditional probability distribution of every variable given its direct causes.

If Vi has Parent Pa(Vi)j that means that in a randomized experiment where one would manipulate Pa(Vi)j, changes would be observed in the distributions of Vi relative to the distribution when Pa(Vi)j is not manipulated and these changes are entailed by the conditional probability distribution of the variable given its parents.

Unique and Full Joint Distribution Specification in GCPMs

If the fundamental property of the Causal Markov Condition (CMC) holds, then (1) and (2) (parents set, and conditional probabilities of each variable given its parents, respectively) together specify a full and unique joint distribution over variables set V.

Joint Distribution of GCPMs Can Be Factorized Based on Local Causes

This joint distribution is factorized as a product of the conditional probability distribution of every Vi given its direct causes Pa(Vi):

$$ \Pr \left(\left\{{V}_1,{V}_2,\dots, {V}_n\right\}\right)=\prod \limits_i\Pr \left({V}_i\left|, Pa{\left({V}_i\right)}_1\right|, Pa{\left({V}_i\right)}_2\left|,\dots \right|, Pa{\left({V}_i\right)}_{m_i}\right), $$
(3)

where Vi has mi parents.

Inspection of the Causal Graph of a GPCM Informs About Conditional Independencies in the Data

By inspection of the causal graph (and application of an interpretive rule following from the CMC, called d-separation) we can infer a set of conditional independencies in the data, without statistically analyzing the data.

If furthermore, all dependencies and independencies in the data are the ones following from the CMC, then we speak of Faithful distributions encoded by such GCPMs.

A CPGM encoding a faithful distribution entails that all dependencies and independencies in the JPD can be inferred from the DAG. Therefore the DAG becomes a map (so-called i-map) of dependencies and independencies in the data JPD. Conversely by inferring dependencies and indecencies in the data we can construct the CPGM’s DAG and parameterize the CPDs of every variable given its parents effectively recovering the causal process that generates the data. This is the fundamental principle of operation of causal ML methods.

Additional Notes

Two nodes are adjacent if they are connected by an edge. Two edges are adjacent if they share a node.

Parents, children, ancestors, and descendants: In a directed graph, if variables X, Y share an edge X → Y then X is called the parent of Y, and Y is called the child of X. In Fig. 1, Motivation is a parent of Adherence, and Adherence is a child of Motivation. If there is a directed path from X to Y then X is an ancestor of Y and Y is a descendant of X.

Degree: The degree of a node is the number of edges connected to it. In a directed graph, this can be further divided into in-degree and out-degree, corresponding to the number of parents (connected edges oriented towards the node) and children (connected edges oriented away from the node) that the node has. For example, in Fig. 1, Group has degree 1, in-degree 0, and out-degree 1. In the same figure, Adherence has degree 3, in-degree 2 and out-degree 1. A node with degree 0 is said to be “disconnected”.

Collider: A collider is a variable receiving incoming edges from two variables. For example in: X → Y ← Z, Y is the collider. A collider is either “shielded” or “unshielded” iff the corresponding parents of the collider are connected by an edge or not, respectively. Unshielded colliders give form to the so-called “v-structure”. In Fig. 1, Fast Food is a collider of stress, and motivation.

Trek: A trek is a path that contains no colliders. In Fig. 1, the path from Motivation to Days Abstinent through Adherence is a trek; also the path connecting Fast Food and Days Abstinent through Stress is also a trek. However, the path connecting Motivation and Group through Adherence is not a trek, since it contains the collider Adherence.

D-Separation

  1. 1.

    Two variables X, Y connected by a path are d-separated (aka the path is “blocked”) given a set of variables S, if and only if on this path, there is (1) a non-collider variable contained in S, or (2) a collider such that neither it nor any of its descendants are contained in S.

  2. 2.

    Two variables, X and Y, connected by several paths are d-separated given a set of variables S, if and only if for all paths connecting X to Y, they are d-separated by S.

  3. 3.

    Two disjoint variable sets X and Y are d-separated by variable set S iff every pair <Xi,Yj > are d-separated by S, where Xi and Yj are members of X, Y respectively.

Structural Equation Models (SEMs)

SEMs are causal inference models that can be used after causal relationships have been discovered or when they are known a priori. Left panel of Fig. 2 shows an example causal graph with three variables X, Y, and Z. The three SEM equations show the quantitative functions for each variable given its parents.

Fig. 2
A model depicts the causal graph that consists of X pointing to Y and Z variables, which leads to the joint distribution properties of variables X, Y, and Z, which further leads to data table with values listed for X, Y, and Z in 10 rows.

Correspondence of causal models (left), distribution properties (middle) and data sample (right)

Note that these equations are termed structural equations to emphasize the causal/generative nature of the relationship. SEMs can be continuous, discrete, or mixed, thus extending the definitions of discrete causal models we have seen so far.

The general form of a structural equation for modeling variable x is x = f(Parent(x), ε), where Parent(x) represents the parents of x, and ε is a noise term representing the unexplained variance in x. The variables on the right hand side of the causal structural equations are causes of the variables on the left hand side. This information mirrors the causal graph, where directed edges (−>) represent direct causal relationships. The quantitative aspect of the causal relationship (e.g. how much change in Y is expected if X is manipulated from 0 to 1) is represented by function f). For example, in the structural equation for Y in Fig. 2, Y is a linear function of X with additive noise εY.

In the same example the expected effect of changing X from 0 to 1 via manipulation of X, on Y is 1 (measured in units of Y), which can be computed by E(Y|do(X = 1))–E(Y|do(X = 0)) = E(1 + e_Y)–E(0 + e_Y) = 1. The do(.) notation in the equation represents a manipulation involving the assignment of a value that is enacted regardless of other factors in the model that appear as parents of the manipulated variable. To further clarify the causal vs. associational relationship, consider variables Y and Z. The expected effect of changing Z from 0 to 1 on Y is 0, which can be computed by E(Y|do(Z = 1))–E(Y|do(Z = 0)) = E(X + e_Y)–E(X + e_Y) = 0. Y is not affected by manipulating Z, since Z is not a cause of Y. This is also obvious from the causal graph, since there is no directed path (sequence of variables connected by directed edges pointing to the same direction) leading from Z to Y. However, the values of Y are associated with values of Z since they are both caused by X.

In summary, when Z is manipulated from value 0 to value 1, Y does not change, even though Y and Z are correlated in observational data without manipulations. The correct causal model explains the observed statistical association between Y and Z as confounded by X and indicates that if we wish to change (e.g., treat) Y, we should manipulate X rather than Z.

  • The above could be inferred using a CPGM with equivalent results (albeit using probabilities and probabilistic inference rather than structural equations and expectations).

  • The fundamental difference however is that for a SEM model to be useful we need to first infer the causal relationships via an external mechanism. By contrast many algorithms exist that infer a CPGM from data automatically.

Pitfall 4.2

Using SEMs to estimate causal effects with the wrong causal structure.

Going back to Fig. 1 assume that this model represents the causal process correctly and further that motivation and stress were not measured in this RCT. Also assume that a data analyst is interested in how diet affects relapse rates in patients with AUD, and has access to the trial data, which includes how much the participants eat fast food. The analyst regresses Days Abstinent on Fast Food, and finds that eating more fast food is negatively associated with days abstinent. With the goal of ensuring that this association is not related to the RCT design itself, the analyst also regresses Days Abstinent against both Fast Food and Group, and again finds that there is a negative association between fast food consumption and days abstinent. Excitedly, the analyst goes forth to publicize the findings, and recommends that perhaps limiting the consumption of fast food will reduce relapse rates for AUD patients. Based on these findings, other researchers carry out an RCT to test this theory, but find that (as expected by the true causal model) restricting fast food consumption has no effect on relapse.

The analyst has fallen into a common pitfall when they performed regression analyses without any knowledge of the structure of the underlying causal process, and inappropriately interpreted confounded associations as causal. Even if the analyst reports that an association was found, and does not make claims of causal effects, such associations carry a promise of possible valuable causal relationships.

The same problem is encountered when applying any predictive ML method. It is a grave error to apply such non causal methods when causal results are sought.

Consider the following variant of the above scenario where Group (treatment) is not manipulated but observed. Furthermore consider that the analyst, in an effort to eliminate confounding bias, models Adherence as a confounder (on the grounds that it correlates with both group and days abstained). In such a scenario the estimated causal effect of Group will be falsely zero because the plausible confounder, in reality, is on the causal path between group and the outcome.

Pitfall 4.3

Using regression to estimate causal effects without knowing the true causal structure (and making assumptions about which are the true measured confounders).

Causal Effect Estimation

In this section we make extensive use of d-separation to infer whether variables are dependent or independent given other variables (also assuming a faithful distribution encoded by the CPGM). Readers not proficient in application of d-separation in faithful DAGs are advised to simply use the provided dependence/independence statements.

Further elaborating on causal inference, often we wish to estimate the quantitative causal influence that a variable C has on variable T for a manipulation that causes a 1 unit change of C. The key to obtaining unbiased causal effect estimates from observational data is to partition the total (bivariate, i.e., marginal) statistical association between pairs of variables into the components of that total association due to causal vs. confounded relationships across all connecting paths. In principle, this is relatively straightforward if we know the true causal structure governing the variables we observe. For example, consider estimating the causal effect of C on T using data generated from the causal system depicted in Fig. 3.

Fig. 3
Three directed graphs depict a network pattern of variables for true causal structure, association, and causal structure discovery. The variable T at the center of the networks is highlighted. Each includes nodes from A to L.

Illustration of the local causal structure around a variable T

In the true causal graph, there are two paths contributing to the overall observed association between C and T. path 1: C → T, and path 2: C ← A → D → TFootnote 1. The first path is a causal path, and when one changes the value of C, the value of T will change as a result through this path. The second path is a confounding path, since the change in C cannot causally propagate through this path to affect T. In other words, when we compute the total marginal (aka univariate) association between C and T, the resulting value reflects the combined contributions from the causal effect C → T and the confounding C ← A → D → T.

In order to estimate the causal effect of C on T, we need to eliminate the component of association due to the confounding path. This can be easily achieved by estimating the relationship between T and C, using A as a covariate in a regression model, or more generally controlling for the associational effect of A. Assuming for example linear functions with Gaussian noise, the causal effect of C on T can be estimated by fitting the linear regression T = βc ∗ C + βA ∗ A + ε.

The estimated coefficient for C then will be the unbiased causal effect estimate of the causal influence of C on T. The reason that this regression model can result in unbiased estimates of causal effects is because adding A to the regression model (i.e. conditioning on A) blocks the confounding path C ← A → D → T and therefore removes the association component between C and T due to the confounding (path 2).

Conditioning on the wrong variables will likely result in biased effect estimation. For example, conditioning on G by fitting the regression model T = βc ∗ C + βg ∗ G + ε results in biased effect estimation, since conditioning on G, which is a collider on the C → G ← D → T opens the path and introduces additional association between C and T due to this non-causal path (plus the confounding thru A is not controlled).

In practice there are numerous choices for variables to condition on, especially in high dimensional data and in domains with poor prior knowledge. For example, as an alternative to A, conditioning on D by fitting the regression T = βc ∗ C + βd ∗ D + ε will also result in unbiased effect estimation, since it also blocks the C ← A → D → T path and does not open any additional paths (the C ← A → D ← B → I ← H ← E → T path is still blocked when conditioning on D, due to the presence of I as a collider).

As illustrated above, the principle for achieving unbiased causal effect estimation from observation data is to ensure that only the true causal paths are open (between the pair of variables under consideration) given the variables controlled by conditioning.

Best Practice 4.2

In order to estimate unbiased causal effects, control variables that are sufficient to block all confounding paths. These variables can be identified by causal structure ML algorithms.

Best Practice 4.3

Often there is a choice of multiple alternative variable sets that block confounding paths. An applicable choice is to control/condition on the set Pa(A) in order to block all confounding paths connecting A and T. However this sufficient confounding blocking variable set is not the minimal one and it is recommended to use the minimal blocking variable set in order to maximize statistical power and minimize uncertainty in the estimation of the causal effect.

Pitfall 4.4

Discovering the correct variables to condition on can be hard or even impossible in the presence of hidden variables. Discovering the minimal blocking variable set may be computationally hard or intractable when the causal structure is large and complex.

It is also possible that the causal effect for a specific variable cannot be estimated from observational data for some causal data generating functions. In these cases experiments are needed. For those cases that causal effect estimation is feasible from experimental data, Pearl’s Do-Calculous procedure will return the right set of conditioning variables. Do-calculus specifies a systematic procedure to determine if a causal effect is estimable and sequences of operations to compute the causal effect when possible [1].

Also in some distributions, discovery of a causal edge may require conditioning on all variables (which is statistically not feasible).

The Do-Calculus is critically different from conventional methods for causal structural learning and causal inference, e.g. structural equation modeling [5], path analysis [6], matching [7], propensity scoring [8]. The conventional methods of the structural equation family are generally hypothesis-driven and only examine a small fraction of possible causal structures governing the data, which make them likely to miss the true causal structure and result in biased estimates of causal effects. Moreover, even without any hidden variables present, the number of possible models is astronomical for even a few dozen variables, making specification of a good model like winning the lottery. Causal structure discovery algorithms together with Do-calculus circumvent these difficulties in many practical settings.

Causal Structure Discovery ML Algorithms

Given that the definition of causes involves manipulation, experimentation is by default one way to discover causal knowledge. However, in domains where experiments are unethical, technically not feasible, or resource prohibitive, or when we want to construct system-level causal models with numerous variables and all their interactions, it is inefficient, impractical and occasionally impossible to derive causation strictly with experimentation. When combining experimentation with observational causal algortihms (and simultaneous measurement of all variables) however, in the worst case, N-1 experiments are needed to derive all causal relationships in a non-cyclic causal system with N variables, given that one variable can be randomized per experiment [9].

Instead, investigating the statistical relationships among variables in observational data and using the result to guide experimentation can be more efficient. In fact, in many scientific domains, in order to discover causal factors, investigators often first examine observational data for variables that are associated with their outcome of interest and then conduct experiments on a subset of the associated factors to determine the causes. This common practice reflects the attempt to use observational data to improve the efficiency of experimental causal discovery. However, as we will illustrate, association is a poor heuristic for causation. In some cases, it provides very little guidance to which experiments need to be performed.

The right panel of Fig. 3 illustrates the local causal structure around a variable T of interest. Based on this causal structure, association will be identified among most variables, indicated by lines connecting most pairs of variables shown in the middle panel (this is referred to as the correlation network in some literature). Specifically, other than variable J, all measured variables would be univariately associated with T. Therefore, statistical association is not a good strategy for identifying causes of T. Furthermore, the strength of the association is also not a good indicator of causality, since given the true causal structure, there exist distributions where non-causal variables (e.g. G, K, L) can have stronger associations with T than that of the causes of T. Chapter “Foundations and Properties of AI/ML Systems” provides additional theoretical results about causal ML.

Different from association based methods, causal structure discovery methods are designed to discover causal relationships given observational data up to statistical equivalency [2]. Domain knowledge and knowledge regarding the data collection process can be readily incorporated to facilitate the discovery process [1, 2]. Different algorithms for causal structure discovery leverage different statistical relationships. For example, constraint-based algorithms infer causal relationships using conditional independence relationships, whereas score-based algorithms search for the causal structure that maximizes likelihood-based scores given data. Further, algorithms such as the IC* [1] and FCI3 [2] can identify hidden confounding variables, which is very helpful when we are not certain if we have measured all possible entities that participate in the causal process.

When the Causal Markov Condition and Faithfulness Condition hold, statistical associations that are non-causal can be identified by examining statistical properties such as conditional independence. For example, the association between A and T is deemed not directly causal, since A and T are independent given variables C and D (denoted as A ⟂ T ∣ C, D where ⊥ denotes conditional independence, and ∣ denotes conditioning. Conditional dependence is denoted with . Also, determination of the direction of causal relation can be resolved due to the collider relations (i.e., in “Y structures” such as T→ C ← A, where C is a variable known as an “unshielded collider”). Importantly, the presence of hidden variables can also be identified by examining conditional independence. It is worth noting that even in systems with a moderate number of variables (e.g. > a few dozen), it is computationally impossible and sample inefficient, to examine all conditional independence relationships, since the number of all possible conditional independence tests grows exponentially with the number of variables. Therefore, modern causal discovery ML algorithms implement efficient search strategies to ensure that the discovered causal structure is correct and the procedure is scalable to millions of variables in many real-life settings (see Chapter “Principles of Rigorous Development and of Appraisal of ML and AI Methods and Systems” for detailed description of development and validation of such methods). Table 1. below lists several classic causal structure discovery algorithms, their assumptions, the statistical relationships they examine, and the search strategies they employ.

Table 1 Summary of classic causal discovery algorithms

It is worth noting that models that rely solely on conditional independence cannot fully resolve the orientation of all edges in the causal system in general. Since the same conditional independence relationships can correspond to multiple causal structures (i.e., belonging to the same so-called Markov Equivalence Class). For example, the conditional independence relationships corresponding to the following causal three structures X → Y → Z, X ← Y → Z, and X ← Y ← Z, are identical (i.e. X, Y, and Z are pairwise dependent, but X ⊥ Z ∣ Y, X Y ∣ Z, Y Z ∣ X), i.e. they are Markov equivalent. Markov equivalent causal structures may still be distinguishable by statistical properties other than conditional independence. Methods that aim to tackle this problem are generally referred to as pairwise edge orientation methods, since they explore the statistical asymmetry between pairs of variables to determine causal direction. These methods, originally pioneered by D. Janzing et al. [42] typically require non-linear generating function and/or non-Gaussian noise terms to break the symmetry in causal direction [18].

Trace of PC on Alcohol Abstinence Example

The PC algorithm is an early algorithm in the field that is no longer used in practice (because of low efficiency and high error) but is useful pedagogically to explain how causal discovery may take place. PC begins by forming a completely connected graph of undirected edges, representing no conditional independence anywhere. The algorithm proceeds by iteratively testing for conditional independence between pairs of (currently) adjacent variables conditioned on sets of increasing size. To begin with, unconditional independence is tested, followed by conditional independence with conditioning sets of cardinality one, then of cardinality two, then of cardinality three, and so forth. The members of these conditioning sets are pulled from the variables adjacent to either member of the current pair being tested. After completing all such conditional independence tests (which thins out the graph), the PC algorithm orients the undirected edges by referencing which conditioning sets were used to separate the independent pairs. For each unshielded triple, if the mediating node was not in the stored conditioning set (aka “sepset”), then it orients the triple as a collider. Lastly, a final set of orientation rules are applied that take advantage of the acyclicity constraint [2].

Let us revisit the AUD example and step through a run of the PC algorithm. First, a completely connected graph of undirected edges is formed in Fig. 4. This represents that we have not yet seen any conditional independences. Next, unconditional independence relations are tested (i.e., conditional sets of size zero). In this case, five independencies are found which results in the removal of five edges as shown in Fig. 5. Next conditional independence relations with conditioning sets of cardinality one are tested. In this case, three conditional independencies are found which results in the removal of three more edges as shown in Fig. 6. Next conditional independence relations with conditioning sets of cardinality two are tested. In this case, one more conditional independence is found which results in the removal or one more edge in Fig. 7. The PC algorithm will go on testing conditional independences (up to the point that the conditioning sets lead to under-powered test), but it will not find any more. Accordingly, the conditional independence phase of the algorithm will result in the undirected graph in Fig. 8.

Fig. 4
A node graph with a network of interconnected nodes labeled motivation, fast food, stress, group, adherence, and days abstained, which indicate the P C trace.

PC trace on the AUD example: forming a completely connected graph

Fig. 5
A node graph with a network of interconnected nodes labeled motivation, fast food, stress, group, adherence, and days abstained, which indicate the P C trace. Stress is linked less strongly to adherence, group, and motivation, and similarly group with fast food.

PC on the AUD example after testing unconditional independence

Fig. 6
An illustration presents a network of interconnected nodes labeled motivation, fast food, stress, group, adherence, and days abstained, which indicate the P C trace. The links between nodes, motivation and group to days abstained are lighter. There is no link between group and motivation.

PC on the AUD example after testing conditional independence with conditioning sets of cardinality one

Fig. 7
An illustration of a network displaying the A U D trial testing with nodes labeled motivation, fast food, stress, group, adherence, and days abstained. Fast food to days abstained has a weak link. There is no link between group and motivation.

PC on the AUD example after testing conditional independence with conditioning sets of cardinality two

Fig. 8
An illustration of a network displaying A U D trials with nodes labeled motivation, fast food, stress, group, adherence, and days abstained. Motivation links to adherence.

PC on the AUD example after completing the conditional independence phase

The edges of the undirected graph in Fig. 8 are oriented by referencing which conditioning sets were used to separate the independent pairs. For each unshielded triplet, if the mediating node was not in the conditioning set, then PC orients the edges to make the middle node in the triplet a collider. In the AUD example, the endpoints of the triplets <Motivation, Fast Food, Stress>, <Motivation, Adherence, Group>, and < Stress, Adherence, Days Abstained> were not separated by Fast Food, Adherence, and Days Abstained, respectively. Accordingly, PC orients these triplets as involving colliders in Fig. 9. At this point, the final set of orientation rules would be normally applied, however, the graph has already been fully oriented. Note that in general, there are many cases where the final model produced by PC will have some undirected edges remaining, corresponding to a set of possible graphs that are not fully resolved.

Fig. 9
An illustration of a network displaying A U D trial with nodes labeled motivation, fast food, stress, group, adherence, and days abstained. Stress and motivation link to fast food. Motivation and group to adherence. Adherence and stress to days abstained.

PC on the AUD example after orienting unshielded colliders

Latent Variables

As depicted in Fig. 10 (left), suppose Motivation and Stress are latent (i.e., unmeasured, aka “hidden”). In this case, if we run PC, the resulting graph would be the one shown in Fig. 10 (right). This graph correctly identifies the causal path from Group to Days Abstained and can be used to correct estimation the effect of Group to adherence. However the graph is misleading in that it suggest that the effect of Adherence to Days Abstained can be estimated by conditioning on Fast Food, when in reality conditioning on Fast Food opens a confounding path and leads to inaccurate causal effect estimate. Such examples showcase the need to use algorithms that reveal latent variables and their confounding on measured ones.

Fig. 10
2 illustrations of networks. 1. With nodes labeled motivation, fast food, stress, group, adherence, and days abstained, with lines connecting nodes group, adherence, and days abstained highlighted. 2. Group links adherence, which links days abstained. Fast food to adherence and days abstained.

A DAG with latents and the corresponding PC output

We can think about latent variables as variables that have been marginalized out of a larger, complete, but not fully observed, set of variables. In this paradigm, we assume that the causal model over the complete set of variables is a DAG. Thus, under this assumption, the “margin” of a DAG is a natural choice to represent the model’s structure over the observed set of variables.

Intuitively, the “margin” of a DAG should be a graph whose restriction of the model manifests as conditional independence in the marginal probability distribution. More precisely, the conditional independence statements implied by the marginalized graph should be the subset of conditional independence statements implied by the DAG over the remaining variables after marginalization.

Unfortunately, DAGs are not closed under marginalization in this sense. That is, there are DAGs with margins that are not consistent with any DAG. For example, in Fig. 10 (left), Group is independent of Fast Food and independent of Days Abstained conditioned on Adherence, however, no DAG represents these exact relations. Accordingly, a richer family of graphs is required to represent margins of DAGs.

Acyclic Directed Mixed Graphs

Acyclic directed mixed graphs (ADMGs) characterize margins of DAGs (Richardson and Spirtes 2002, Richardson 2003). These graphs additionally include bi-directed edges; see Fig. 11 (right) for an illustration. Intuitively, an ADMG can be constructed from a DAG with latent variables by a simple latent projection.

  1. 1.

    For each unshielded non-collider with a latent mediating variable,

    1. (a)

      If the triple is directed, add a similarly directed edge between the endpoints,

    2. (b)

      Otherwise, add a bi-directed edge.

  2. 2.

    Remove all latent variables.

Importantly, ancestral relationships are invariant under latent projection. For example, in Fig. 11, Group and Adherence are ancestors of Days Abstained while Fast Food is not. These (non-)ancestral relationships are preserved during the marginalization process. Accordingly, it is sufficient to learn an ADMG over the measured subset of variables to infer the presence (or absence) of causal relationships between the measured variables. To this end, the FCI and GFCI algorithms learn Markov equivalence classes of ADMGs [2, 11, 13].

Fig. 11
2 illustrations of a network. 1. With nodes labeled motivation, fast food, stress, group, adherence, and days abstained, with lines connecting nodes group, adherence, and days abstained highlighted. 2. Group links adherence, which to days abstained. Fast food interacts with adherence and days abstained.

A DAG with latents and the corresponding ADMG

General Practical Approach to Causal ML

In this section, we describe a protocol for conducting causal analysis that involves the following 6 steps:

Best Practice 4.4

A Protocol for health science causal ML

  1. 1.

    Define the goal of the analysis.

  2. 2.

    Preprocess the data.

  3. 3.

    Conduct causal structure discovery.

  4. 4.

    Conduct causal effect estimation.

  5. 5.

    Assess the quality and reliability of the results.

  6. 6.

    Implementation and enhancement of results.

We will next walk the reader through the six step process, pointing also out common pitfalls and how to overcome them.

Evaluate whether the Goals of Modeling Require Causal Analysis

Problem solving that is causal in nature is best addressed by causal modeling. In health-related domains, causal questions generally involve the mechanism and especially the treatment of diseases or interventions on the healthcare system, or discovery of biological causality etc. Some example causal questions in our simple vignette are: (1) what are the causes of alcohol abuse? (2) How much improvement in abstinence days can be expected if some motivation enhancer improves it by one standard deviation? (3) What is the best treatment for a 35 year-old male with a high stress job that suffers from alcohol abuse?

Questions regarding risk prediction, (e.g. what is the probability of failing 30 days abstinence for a 35 year-old male with a high stress job?), can also be answered with causal discovery analysis. After all, if we have a accurate causal model, we can instantiate the causal model with relevant observational information to deduce a risk prediction. However, the current generation of predictive modeling methods have advantages in answering risk prediction questions not involving manipulations compared to the current generation of causal discovery methods. The main reasons for this are: the current generation of predictive models are discriminant and not generative, can represent more complex mathematical relationships, have built in regularization to avoid overfitting, and most importantly are fitted by analytical protocols such as nested cross validation to maximize their predictive performance via model selection and unbiased performance estimation (whereas causal models seek first and foremost causal validity and have no access to causal error estimators that are available to predictive models). Therefore, we recommend using predictive modeling methods (see chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”) for prediction related tasks.

We also note that Markov Boundary feature selectors naturally bridge the predictive and causal domains by revealing local causal edges and retaining the maximum predictive information. They are not however full-fledged causal discovery procedures and need be combined with other causal algorithms for complete causal discovery (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”).

It is also worth noting that some questions appear to require predictive modeling on the surface, but they actually require casual knowledge to answer correctly. For example, the question “what is the risk of relapse within 30 days for a 35 year-old who suffers from alcohol abuse, if he were treated with naltrexone?” This question can be addressed correctly with predictive modeling if we have observed similar patients taking the medication and others not in a randomized assignment. If, however, observed treatments were not randomized, then we would need causal modeling to answer correctly.

Check If the Data Is Suitable for Causal Discovery Analysis and Preprocess Data Appropriately

Causal discovery analysis can be applied to a wide range of data. To ensure that the discovered causal relationships correspond to the goal of the project and have biological and clinical relevance, special attention needs to be paid to the data design, the data collection process, and the data elements being collected. We point out several common situations where data preprocessing might be needed.

Deterministic relationships: Existing causal discovery algorithms may produce erroneous results when deterministic relationships are present among variables. Examples of deterministic relationships are: item scores for a depression inventory vs. the total score for the same depression inventory; height and weight vs. BMI. How to incorporate this kind of information into causal discovery analysis is an area of active research, but at present our recommendation is to eliminate deterministic relationships by using a subset of the variables that are involved in the deterministic relationships.

Specificity of Measurements: Some measurements/variables can contain information from multiple related variables. This is common in mental health data. For example, the depression inventory also often measures anxiety symptoms, and vice versa. As a result, when causal discovery analysis is conducted on these types of variables, a high amount of interconnectivity is found. High connectivity in a causal graph is not in general problematic. But in this case, it is an artifact of the lack of specificity in the measurements and the findings are hard to interpret causally. One way to improve the measurement specificity is through feature engineering, i.e. instead of directly using the original variables with low specificity, we can construct new variables (e.g. separate out depression specific items from the depression inventory and anxiety inventory, and construct a variable that represents depression more specifically). This can be done by either using prior knowledge, or using data driven methods such as factor analysis [19].

Using EHR Data: Different from observational research and clinical trial data, EHR data are collected at irregular time intervals as part of the patients’ clinical care. Using the EHR data for any modeling generally requires the researcher to come up with a study design, construct a specific patient cohort, extract relevant EHR data from it, and preprocess the data according to the goal of the study. These steps should follow the same principles for designing observational studies while considering specific properties of the EHR data [20,21,22,23]. For data preprocessing, one needs to consider the nature of various EHR data elements and the nature of diseases. For example, a diagnosis code might appear in the patients’ record multiple times. This may be due to multiple episodes of acute disease, or may be due to chronic disease and differentiating between them may be important. Similarly, the timestamps in the EHR reflect the documentation time of an event and often do not coincide with the onset time of the measured event. Further, missing measurements in the EHR data is almost definitely not missing at random, since the care providers decide if a measurement would be taken based on the patients’ symptoms. Therefore, using imputation algorithms that assume missing at random would cause errors in the analysis. Some of these challenges in the EHR data can be handled with preprocessing, but others might require adaptations to generic causal discovery algorithms [24, 25].

Causal Structure Discovery

Prior knowledge can be readily incorporated into many causal structure discovery algorithms and can greatly facilitate the structure discovery, especially for edge orientation. Prior knowledge can come from multiple sources. One source is the knowledge of the data collection process. For example, in datasets with longitudinal design, information is collected at multiple time points. This timing information can assist edge orientation since events that happen later cannot be the cause of events that happened earlier. It is worth noting that one needs to distinguish between the time that an event happens vs. the timing that the event was measured or documented. For example, we might have measured a patient’s Single-nucleotide polymorphism (SNP) and their depression score at the same clinical encounter, but to study the causes of depression, we would assign the SNP to an earlier time point than the depression score. Further, one needs to consider if the variable contains information over a time span. For example, HbA1C contains information regarding glucose over the past 2 or 3 months, it can be assigned an earlier time point compared to variables that reflect instantaneous information such as the blood oxygen level if these variables were measured at the same time.

Prior knowledge can also come from experimental design. For example, in randomized trials, due to the randomization, the participant’s pretreatment measurements should not cause the treatment assignment. This prior knowledge can be encoded by prohibiting edges that emerge from the pretreatment variables to the treatment variable.

A third source of prior knowledge is existing domain knowledge. One can enforce the presence or absence of edges according to existing domain knowledge, but this needs to be done with caution. Since the condition under which the domain knowledge was obtained can be different from the current dataset and its data generating process.

Since incorporating prior knowledge can have a significant effect on the resulting causal structure, we recommend only incorporating the most reliable prior knowledge as input to the causal discovery algorithm. We also encourage performing the causal structure discovery analysis with and without all or a subset of the prior knowledge as sensitivity analysis to investigate the added value of incorporating prior knowledge.

Choice of Algorithms At a high level, the performance of the causal structure discovery largely depends on: (1) if the algorithm used is theoretically guaranteed to produce the correct results under its assumptions, (2) how far does the data deviate from the assumptions of the algorithms, and (3) statistical power. One should always choose algorithms that have solid theoretical properties and clearly defined assumptions (see for example the list of algorithms and assumptions in Table 1). Choice of algorithms can also be informed by benchmark studies applied to data with similar characteristics [15, 26, 27].

One important assumption is causal sufficiency, i.e. the absence of latent common causes. Latent common cause, also referred to as latent variables are likely present for health data, therefore, it is recommended to apply algorithms that can detect latent variables such as the FCI or GFCI directly to the data for causal discovery, or use FCI and GFCI [2] as a second step for latent variable discovery following the skeleton discovery by another more scalable algorithm.

Another common violation of assumptions of most causal discovery algorithms is target information equivalence [28]. In health data, overlapping or redundant information often exists, such as co-occurring symptoms in multiple organs and systems, concurrent abnormal lab values from different lab tests, and simultaneously disturbed molecular pathways. This information redundancy can result in target information equivalence, where multiple variable sets contain statistically equivalent information regarding a target variable of interest. Due to the statistical equivalency, the causal role of these variables cannot be determined from observational data alone. Applying algorithms that are not designed and equipped to handle target information equivalence will likely result in missing important causal information. We recommend to always investigate presence and consequences of target information equivalence when appropriate algorithms are available (e.g., TIE*) [28,29,30,31].

Statistical power also influences the choice of algorithm. For smaller sample sizes with large numbers of variables, local causal discovery algorithms have advantages over global causal discovery algorithms due to their sample efficiency [15, 16]. It is also worth noting that the algorithms’ parameter setting also impacts statistical power and therefore the algorithms’ performance. The choice of the underlying statistical test (for constraint based algorithms) or scores (for score based algorithms) need to correspond to the distribution of data to maximize the statistical power. For example, for constraint based algorithms, Fisher’s test is recommended for multivariate Gaussian data, and the G2 test is recommended for discrete data. Special distributions and data designs (e.g., time-to-event longitudinal designs) might require specially-designed statistical tests. Further, only a subset of algorithms can scale to a very high number of variables without compromising correctness thus being applicable to high dimensional data such as omics data [15].

Causal Effect Estimation

Causal effect estimation should follow the causal structure discovery and be based on the discovered causal structure. Causal effects can be estimated using the Do-calculus principles with SEM as stated in section “Structural Equation Models (SEMs)” above. Note that the SEM model for effect estimation is not restricted to linear regressions. Other mathematical models can be adopted to accommodate non-Gaussian distributions and complex relationships as needed.

It is worth noting that traditional causal effect estimation methods, such as propensity score-based methods and matching generally are either based on hypothesized causal graphs, untestable assumptions for correctness (i.e., “strong ignorability”) or do not have an explicitly defined causal graph associated with the effect estimation. Given the large space of possible graphs over the set of observed variables, it is highly unlikely that the hypothesized graph would correspond to the true causal structure, therefore these methods can and often lead to biased effect estimation. On the other hand, the lack of an explicitly defined causal graph makes it difficult to interpret the result practically and state the properties of the result of effect estimations [1].

Quality Check and Interpretation of Results

There are several ways to evaluate the discovered causal graphs and estimated causal effects. We recommend several analyses that are suitable for most causal analysis, but problem-specific evaluation should be designed and conducted when appropriate.

Stability of causal discovery. The stability of the causal discovery procedure due to sampling variability can be assessed by bootstrap analysis [32]. In bootstrapping, the same causal discovery procedure (causal structure discovery and causal effect estimation) that was performed on the entire sample can be repeatedly applied to different bootstrap samples. The percentage of time an edge is discovered across all bootstrap samples represents the stability of the causal structure discovery. Edges with percentages closer to 0 or 1 indicate higher stability, representing they are consistently absent or present across all bootstrap samples (i.e., they are more robust to sampling variation). The empirical distribution of the estimated causal effect for manipulating one variable on another variable over all bootstrap samples represents the stability of both the causal structure discovery and the effect estimation. An empirical distribution with smaller variance indicates better stability. Poor stability can indicate issues related to the distribution of the data. For example, it is possible that the identification of a particular edge based on the entire dataset was driven by outliers and such edges would have low stability in bootstrap samples. When poor stability is observed, we recommend careful inspection of the empirical distribution of the data, as well as target information equivalence analysis (that may be driving the instablity).

Fit of the causal model to the data. After the causal structure discovery and parameterization of the causal model (i.e. estimating all parameters for functions X_i = f(Parent(x))), one can assess the fit of the causal model to the data. In general, the fit can be assessed with scores like the BIC. If the model parameters are estimated with SEM softwares (e.g. OpenMx, Mplus, Lavaan), one can also examine common metrics for goodness of fit from the SEM literature [33,34,35].

Generalizability. To test the generalizability of causal discovery results, one can identify a separate dataset that contains the same or similar measurements as the primary dataset, conduct causal analysis and compare the results. For example, comparing the causal discovery results on EHR data collected from different hospital systems to assess the generalizability of causal mechanisms over different patient populations [24]; comparing the causal discovery results in a veteran population with PTSD vs. civilian population with traumatic brain injury tests the generalizability of causal mechanism over different disease populations [36].

It is worth noting that the goal for testing generalizability is not to require that the causal mechanism underlying two datasets must be the same, but rather to assess whether the causal mechanisms are indeed the same in different populations. The discrepancies among causal mechanisms discovered from multiple datasets do not indicate that the discovered causal mechanisms are incorrect. It merely indicates that the discovered causal mechanisms are different (because of population or external factor differences). The differences can be due to a variety of technical factors, such as sample sizes, sampling bias in one or more of the datasets, differences in data collection protocol, and differences in measurements. They can also be due to genuine differences in the causal mechanisms of the two populations. Nevertheless, assessing causal mechanisms in multiple datasets that bear similarity is beneficial, since it helps identify stable causal relationships across different datasets and highlights different causal pathways.

Quality of the local causal neighborhood. The predictive performance for a variable of interest can be used to assess the quality of the local causal structure around the variable. This is related to the concept of Markov boundary (chapter “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science”). Recall that the Markov boundary of a variable of interest is the smallest variable set that contains the maximum amount of (predictive) information about the variable [1]. Under the faithfulness condition, and with no latents, the Markov boundary consists of the direct causes, direct effects and direct causes of the direct effects of the variable of interest [37]. Therefore, one way to assess if our causal structure discovered captures the Markov boundary of the response variable, we can compare the predictive performance of the discovered Markov boundary to the best predictive model (see chapters “The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models,” “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI” for more details) we can build for this variable given this dataset. If the predictive performance of the discovered Markov boundary is statistically indistinguishable from that of the best model, we can be assured that the Markov boundary variables contain the true local causes and effects (subject to any confounding due to latents). It is worth noting that, when target information equivalency is present for the response variable of interest (which constitutes a faithfulness violation), there are many variable sets (Multiple Markov Boundaries) that are predictively equivalent and contain the maximum amount of information regarding the variable of interest, and are minimal. In this case, observing statistically indistinguishable predictive performance of one Markov boundary vs. the best model does not guarantee the causal role of that Markov boundary. However one of the multiple Markov boundaries contains exactly the direct causes, direct effects, and direct causes of direct effects of the variable of interest (always subject to latent confounding). Finally, a variable that appears in all members of the Markov boundary equivalence class is guaranteed to be causal subject to confounding. An example of applying this method is [31].

Experimental Validation. Another way to partially assess the validity of the discovered causal relationships and effect sizes is experimental validation. For example, one can select a variable to manipulate, observe its effects on another variable, and compare the experimental result with the effect estimated by the causal model. This form of validation is in general costly and possibly not feasible. Experimental results can also come from prior studies, such as RCTs. An example of this is [38].

Implementation and Enhancement of Results

Causal Discovery Guided Experimentation. In many domains of medicine, it is common practice to observe correlational relationships, hypothesize that they are causal, and then test this hypothesis with experimentation. This procedure is not efficient since numerous correlational relationships are due to confounding and are not causal. A more efficient approach is to use causal structure discovery algorithms to eliminate the majority of correlational relationships that are non-causal and resolve any false positives by experimentation. Hybrid causal ML/experimental algorithms exist that are designed to minimize the number of experiments needed for discovery [39].

Consideration for Clinical Translation. One of the main goals for causal discovery applied to biomedical data is to discover novel treatments that can benefit patients. Molecular and other targets that causally affect patient outcomes are potential treatments. Key considerations for which targets to select for treatment depend on their effectiveness, robustness and safety. Causal modeling can help us evaluate these aspects. For example, a causal factor with large effect size and small variability indicates that a corresponding drug treatment would work well and have consistent performance over the patient population. Also in such cases, this potential treatment could be prescribed to patients regardless of their characteristics. If a putative treatment’s effect has large variability, however, this indicates that the response to the treatment could benefit precision medicine administration [40, 41]). Further, with the help of causal analysis, one can also evaluate not only the effect of the treatment on outcomes but also side effects, and select therapeutic targets that maximize patient benefit and minimize side effects.

In summary, the goals for causal ML in health is to discover knowledge that are (1) biologically and clinically relevant, (2) correct and generalizable, and (3) can be translated into clinical application and incorporated into the clinical workflow to benefit patients. A multidisciplinary team consisting of clinical and biological domain experts, health data scientists specialized in causal discovery, clinical informaticists and implementation scientists working closely together is well suited to achieve these goals.

Key Concepts Discussed in Chapter “Foundations of Causal ML”

Causal Inference, Causal Structure Discovery.

Graph, Directed Acyclic Graph (DAG).

Causal Model, Causal Markov Condition, Causal Probabilistic Graphical Model (CPGM), Properties of CPGMs.

Distinction between predictive and causal ML models.

d-separation and Faithfulness.

Structural Equation Models (SEMs).

Causal Effect Estimation and Do-Calculus.

Causal Structure Discovery algorithms.

Acyclic Directed Mixed Graphs (ADMG).

Protocol for health science causal ML.

Pitfalls Discussed in Chapter “Foundations of Causal ML”

Pitfall 4.1. Popular and successful predictive ML methods are not designed and equipped to satisfy the essential requirements of causal modeling.

Pitfall 4.2.: Using SEMs to estimate causal effects with the wrong causal structure.

Pitfall 4.3. Using regression to estimate causal effects without knowing the true causal structure (and making assumptions about which are the true measured confounders).

Pitfall 4.4. Discovering the correct variables to condition on can be hard or even impossible in the presence of hidden variables. Discovering the minimal blocking variable set may be computationally hard in intractable in when the causal structure is large and complex.

Best Practices Discussed in Chapter “Foundations of Causal ML”

Best Practice 4.1. For predictive tasks (i.e., without interventions contemplated) use of Predictive ML should be first priority. For causal tasks (i.e., with interventions contemplated) use of Causal ML should be first priority.

Best Practice 4.2. In order to estimate unbiased causal effects, control variables that are sufficient to block all confounding paths. These variables can be identified by causal structure ML algorithms.

Best Practice 4.3. Often there is a choice of multiple alternative variable sets that block confounding paths. An applicable choice is to control/condition on the set Pa(A) in order to block all confounding paths connecting A and T. However this sufficient confounding blocking variable set is not the minimal one and it is recommended to use the minimal blocking variable set in order to maximize statistical power and minimize uncertainty in the estimation of the causal effect.

Best Practice 4.4. A Protocol for health science causal ML.

  1. 1.

    Define the goal of the analysis.

  2. 2.

    Preprocess the data.

  3. 3.

    Conduct causal structure discovery.

  4. 4.

    Conduct causal effect estimation.

  5. 5.

    Assess the quality and reliability of the results.

  6. 6.

    Implementation and enhancement of results.

Classroom Assignments and Discussion Topics for Chapter “Foundations of Causal ML”

  1. 1.

    Give:

    1. (a)

      2 examples of causal discovery problems and 2 examples of predictive problems in healthcare management.

    2. (b)

      2 examples of causal discovery problems and 2 examples of predictive problems in clinical care.

    3. (c)

      2 examples of causal discovery problems and 2 examples of predictive problems in health sciences research.

    4. (d)

      Discuss how predictive modeling that does not take into account causality may lead to flawed decisions in each of the above causal example applications.

  2. 2.

    Someone presented to you a model for predicting the probability of cancer relapse with high predictive performance, derived from observational data using a convolutional neural network. Can you deduce potential relapse prevention strategies from this model?

  3. 3.

    Write a computer program in your favorite programming language to generate data based on the causal graph specified in Fig. 1.

    1. (a)

      Based on the data you generated and the causal structure in Fig. 1, estimate the causal effect for each cause-effect pair in Fig. 1 (hint: you should obtain coefficients similar to what is specified in Fig. 1).

    2. (b)

      Apply the PC algorithm or the FGES algorithm to the data you generated to discover the causal structure and compare it to the true causal structure specified in the figure.