Skip to main content

Methods of Causal Analysis for Health Risk Assessment with Observational Data

  • Chapter
  • First Online:
Quantitative Risk Analysis of Air Pollution Health Effects

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 299))

  • 581 Accesses

Abstract

Perhaps no other topic in risk analysis is more difficult, more controversial, or more important to risk management policy analysts and decision-makers than how to draw valid, correctly qualified conclusions from observational data. Chapters 1, 2, 7, and 8 have warned against the common practice of using statistical regression models in place of causal analysis (Pearl 2014), and have suggested some alternatives, including causal analysis and Bayesian network (BN) modeling. This chapter examines these methods in greater detail. It is a more technically dense chapter than others in this book, but the required effort pays substantial dividends in methods for more clearly assessing evidence of causality in exposure concentration-response (C-R) functions, as illustrated in the chapters that follow. Readers who are content to accept that “correlation is not causation” and who are satisfied with the high-level descriptions of Bayesian networks and causal analysis given in Chaps. 1 and 2

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Download references

Author information

Authors and Affiliations

Authors

Appendices

Appendix 1: Directed Acyclic Graph (DAG) Concepts and Methods

Directed acyclic graph (DAG) models are widely used in current causal analytics. This appendix summarizes some key technical methods for causal graphs and DAG models; for details, see Pearl and Mackenzie (2018). A DAG model with CPTs for its nodes but with arrows that do not necessarily represent causality is called a Bayesian network (BN). In a BN, observing or assuming the values of some of the variables lets conditional probability distributions for the other variables be calculated based on this information; this is done by BN inference algorithms. If the BN is also a causal graph model, then it also predicts how changing some variables (e.g., smoking or income) will change others (e.g., risk of heart disease). Whether a given BN is also a causal graph model is determined by whether its arrows and CPTs describe how changes in the values of some variables change the conditional probability distributions of their children.

The arrows in a BN reveal conditional independence relations between variables, insofar as a variable is conditionally independent of its more remote ancestors and other non-descendants, given the values of its parents; in machine learning for BNs, this is commonly referred to as the Markov condition. It is often assumed to hold, but can be violated if unobserved latent variables induce statistical dependencies between a variable and its non-descendants, even after conditioning on its parents. The conditional independence relations implied by a DAG model can be systematically enumerated using graph-theoretic algorithms available in free software packages such as dagitty (Textor et al. 2016). They can be tested for consistency with available data by statistical independence testing algorithms included in statistical packages such as the R package bnlearn (Scutari and Ness 2018). For example, the three DAG models XYZ, XYZ, and XYZ all imply that X and Z are conditionally independent of each other given the value of Y; in the technical literature they are said to be in the same Markov equivalence class, meaning that they imply the same conditional independence relations. By contrast, the DAG XYZ implies that X and Z are unconditionally independent but dependent after conditioning on Y. Thus, in the model AgeHeart_diseaseSex, data analysis should show that Age and Sex are independent (so that their joint frequency distribution is not significantly different from the product of their marginal distributions), but that they become dependent after conditioning on a value of Heart_disease. (If, as in Fig. 9.2, heart disease risk increases with both age and sex (coded so that 0 = female, 1 = male) then among patients with heart disease, age and sex should be negatively associated, since a low value of either one implies that the other must have a high value to make heart disease a likely outcome. Thus conditioning on heart disease makes sex and age informative about each other even if they are statistically independent in the absence of such conditioning.) Since DAG models imply such testable conditional independence relations and dependence relations, the observed dependence and independence relations in a data set can be used to constrain the DAG models describing the joint probability distribution of its variables.

Adjustment Sets, Partial Dependence Plots, and Estimation of Predictive Causal Effects

Unbiased estimates of both the direct causal effect and the total causal effect of one variable on another in a DAG can be obtained by conditioning on an appropriate adjustment set of other variables (Greenland et al. 1999; Glymour and Greenland 2008; Knüppel and Stang 2010; Textor et al. 2016). The direct effect on Y of a unit change in X is defined as the change in the conditional mean (or, more generally, the conditional probability distribution) of Y in response to a unit change in X when only X and Y are allowed to change and all other variables are held fixed. By contrast, the total effect on Y of a unit change in X is the change in the conditional probability distribution (or its mean) for Y when X changes and other descendants of X are allowed to change in response. For example, in Fig. 9.1, the total effect of an increase in age on risk of heart disease includes the indirect effects of age-related changes in income (including any income-related changes in smoking behavior), as well as the direct effect of age itself on heart disease risk. An adjustment set for estimating the direct or total causal effect of X on Y in a causal DAG model is a set of observed variables to condition on to obtain an unbiased estimate of the effect (e.g., by including them on the right side of a regression model, random forest model, or other predictive analytics model with the dependent variable Y to be predicted shown on its left side, and explanatory variables (predictors) consisting of X and the members of the adjustment set on its right side). (The main technical idea for estimating total causal effects is that an adjustment set blocks all noncausal paths between X and Y by conditioning on appropriate variables (such as confounders) without blocking any causal path from X and Y by conditioning on variables in directed paths from X to Y and without creating selection, stratification, or “collider” biases by conditioning on common descendants of X and Y) (Elwert 2013).) A minimal sufficient adjustment set is one with no smaller subset that is also an adjustment set. Graph-theoretic algorithms for determining exactly which direct causal effects and total causal effects in a DAG model can be uniquely identified from data and for computing minimal sufficient adjustment sets for estimating them are now readily available; the dagitty package (Textor et al. 2016) carries out these calculations, among others. (It also determines which path coefficients can be estimated in a path analysis model; what instrumental variables can be used to estimate a given effect via regression modeling; and what testable conditional independence relations among variables are implied by a given DAG model.)

Given a cause such as income and an effect such as heart disease risk in a DAG model such as Fig. 9.1, adjustment sets can be automatically computed for estimating both the “natural direct effect” of the cause on the effect (i.e., how does the effect change as the cause changes, holding all other variables fixed at their current values?) and also the “total effect” (i.e., how does the effect change as the cause changes, allowing the effects on other variables to propagate via all directed paths from the cause to the effect?) In Fig. 9.1, {Age, Sex, Smoking} is an adjustment set for estimating the natural direct effect of income on heart disease risk, and {Age, Sex, Education} is an adjustment set for estimating the total effect of income on heart disease risk, part of which is mediated by the effect of income on smoking.

Given a causal DAG model, it is often useful to distinguish among several different measures of the “effect” of changes in one variable on changes in another. The pure or natural direct effect of one variable on another (e.g., of exposure on risk) is defined for a data set by holding all other variables fixed at the values they have in the data set, letting only the values of the two specified variables change; this is what a partial dependence plot shows. In Fig. 9.1, the natural direct effect of changes in income on changes in heart disease risk might be the effect of interest, in which case it can be estimated as in Fig. 9.3. Alternatively, the effect of interest could be the total effect of changes in income on changes in heart disease risk, calculated (via BN inference algorithms) by allowing changes in income to cause changes in the conditional probability distribution of smoking (as described by the CPT for smoking) to capture indirect, smoking-mediated effects of changes in income on changes in heart disease risk. In general, the effect of interest can be specified by selecting a target variable and one of its causes (parents or ancestors in the causal DAG model) and then specifying what should be assumed about the values of other variables in computing the desired effect (e.g., direct or total) of the latter on the former. This is relatively straightforward in linear causal models, where direct effects coincide with regression coefficients when appropriate adjustment sets are used (Elwert 2013). In general, however—and especially if the CPT for the response variable has strong nonlinear effects or interactions among its parents affecting its conditional probability distribution—it is necessary to specify the following additional details: (1) What are the initial and final values of the selected cause(s) of interest? The PDP in Fig. 9.3 shows the full range of incomes (codes 1–8) on the horizontal axis and corresponding conditional mean values for heart disease risk on the vertical axis. This allows the change in risk as income is changed from any initial level to any final level (leaving the values of other variables unchanged) to be ascertained. (2) When other variables are held fixed, at what levels are they held fixed? In quantifying the direct effect on heart disease risk of a change in income from one level to another, should the conditional probability of smoking be fixed at the level it had before the change, or at the level its CPT shows it will have after the change? The answer affects the precise interpretation of the estimated effect of the change in income on heart disease risk.

Such distinctions have been drawn in detail in epidemiology, yielding concepts of controlled direct effects (with other variables set to pre-defined levels), pure or natural direct effects (with other variables having their actual values in a data set and effects being averaged over members of the population), indirect and controlled mediated effects, total effects, and so forth (e.g., Tchetgen Tchetgen and Phiri 2014; VanderWeele 2011; VanderWeele and Vansteelandt 2009; McCandless and Somers 2017). The natural direct effect PDP in Fig. 9.3 shows the predicted conditional mean value of the selected dependent variable (here, heart disease risk) for each value of the selected explanatory variable (income), as estimated via a random forest algorithm holding other variable values fixed at their observed levels. Figure 9.3 was created (via the free Causal Analytics Toolkit (CAT) software, http://cloudcat.cox-associates.com:8899/) by applying the randomForest package in R to generate a partial dependence plot (PDP) (Greenwell 2017) for heart disease risk vs. income, conditioning on the adjustment set {Age, Sex, Smoking} (Cox Jr. 2018). DAG algorithms also allow other effects measures to be computed if they are identifiable from the data (VanderWeele 2011). In the presence of latent variables, they permit tight bounds and sensitivity analyses for various effects measures to be calculated (Tchetgen Tchetgen and Phiri 2014; McCandless and Somers 2017).

Appendix 2: Information Theory and Causal Graph Learning

Two random variables are informative about each other, or have positive mutual information (as measured in units such as bits, nats, or Hartleys), if and only if they are not statistically independent of each other (Cover and Thomas 2006). If two variables are mutually informative, then observing the value of one helps to predict the value of the other, in that conditioning on the value of one reduces the expected conditional entropy (roughly speaking, the “uncertainty”) of the other. Two random variables can be mutually informative even if they have zero correlation, as in the KE = 1/2MV2 example in the text; or they can be correlated and yet have zero mutual information, as in the case of the values of statistically independent random walks.

A key principle of many causal discovery algorithms is that direct causes are informative about (i.e., help to predict) their effects, even after conditioning on other information; thus, they have positive mutual information. Conversely, effects are not conditionally independent of their direct causes, even after conditioning on other information such as the values of more remote ancestors (except in trivial and statistically unlikely cases, such as that several variables are deterministically related, so that conditioning on one is equivalent to conditioning on all). Predictive analytics algorithms that select predictors for a specified dependent variable (such as random forest (https://cran.r-project.org/web/packages/randomForest/index.html)) automatically include its parents among the selected predictors, at least to the extent that the predictive analytics algorithm is successful in identifying useful predictors. If the Markov condition holds (Appendix 1), so that each variable is conditionally independent of its more remote ancestors after conditioning on its parents, then these algorithms will also exclude more remote ancestors that do not improve prediction once its parents’ values are known. In this way, predictive analytics algorithms help identify potential causal DAG structures from data by identifying conditional independence constraints (represented by absence of arrows between variables in a DAG) and information dependency constraints (represented by arrows in a DAG).

Statistical software packages such as bnlearn (Scutari and Ness 2018), CompareCausalNetworks (Heinze-Deml and Meinshausen 2018), and pcalg (Kalisch et al. 2012) carry out conditional independence tests, assess mutual information between random variables, and identify DAG structures that are consistent with the constraints imposed by observed conditional independence relations in available data. The “structure-learning” algorithms in these packages complement such constraint-based methods with score-based algorithms that search for DAG models to maximize a scoring function reflecting the likelihood of the observed data if a model is correct (Scutari and Ness 2018). Many scoring functions also penalize for model complexity, so that the search process seeks relatively simple DAG models that explain the data relatively well, essentially searching for the simple explanations (DAG models) that cover the observed facts (data). These packages provide a mature data science technology for identifying DAG models from data, but they have several limitations, discussed next. If they were perfect, then Hill’s question “Is X strongly associated with Y?” could be replaced with “Is X linked to Y in DAG models discovered from the data?” and the answer, which would be yes when X and Y are mutually informative and not otherwise, could be obtained by running these DAG-learning packages. In practice, current DAG-learning packages approximate this ideal in sufficiently large and diverse data sets (with all variables having adequate variability), but the approximation is imperfect, as discussed next.

Some Limitations of Graph-Learning Algorithms: Mutual Information Does not Necessarily Imply Causality

DAG structures clarify that a strong statistical association—or, more generally, a strong statistical dependency, or high mutual information—can arise between two variables X and Z not only if they are linked by a causal chain, as in the direct causal relation XY or the indirect causal relation in the chain XYZ, but also because of confounding by another variable that is their common ancestor, such as Y in XYZ; or from reverse causation, as in XYZ; or from selection bias created by stratification, selection, or conditioning on a common descendant, such as Y in XYZ. To Hill’s idea that a strong association suggests causality, DAG modeling adds the important refinement that strong mutual information suggests causality if and only if it arises from one or more directed paths between the cause and effect variables in a causal DAG model. An association or other statistical dependence, no matter how strong, does not suggest causality if it arises from other graph-theoretic relations such as confounding, selection bias, or reverse causation. Information theory implies also that direct causes are at least as informative about their effects as more remote indirect causes. (In the chain XYZ, the mutual information between Y and Z is at least as large as the mutual information between X and Z (Cover and Thomas 2006).) Only variables that are linked to a node by an arrow are candidates to be its direct causes, assuming that direct causes are informative about their effects and that the DAG model faithfully displays these mutual information relations.

Non-parametric graph-learning algorithms in free software such as the bnlearn, pcalg,and CompareCausalNetworks packages in R can play substantially the role that Hill envisioned for strength of association as a guide to possible causation, but they do not depend on parametric modeling choices and assumptions; in this sense, their conclusions are more reliable, or less model-dependent than measures of association. DAGs are also more informative than associational methods such as regression in revealing whether an observed positive exposure-response association is explained by confounding, selection bias, reverse causation, one or more directed paths between them, or a combination of these association-inducing conditions.

Despite their considerable accomplishments, DAG-learning and more general graph-learning algorithms have the following limitations, which practitioners must understand to correctly interpret their results.

  • First, graph-learning algorithms, like other statistical techniques, cannot detect effects (i.e., dependencies between variables) that are too small for the available sample size and noise or sampling variability in the data. However, they can be used constructively to estimate upper bounds on the sizes of hypothesized effects that might exist without being detected. For example, suppose it were hypothesized that PM2.5 increases heart disease risk by an amount b*PM2.5, where b is a potency factor, which the bnlearn algorithms used to construct Fig. 9.1 failed to detect. Then applying these same algorithms to a sequence of simulated data sets in which heart disease probabilities are first predicted from data using standard predictive analytics algorithms (e.g., BN inference algorithms or random forest or logistic regression models) and then increased by b*PM2.5, for an increasing sequence of b values, will reveal the largest value of b that does not always result in an arrow between PM2.5 and heart disease. This serves as a plausible upper bound on the size of the hypothesized undetected effect.

  • Graph-learning algorithms are also vulnerable to false-positive findings if unmodeled latent variables create apparent statistical dependencies between variables. However, various algorithms included in the CompareCausalNetworks package allow detection and modeling of latent variable effects using graphs with undirected or bidirected arcs as well as arrows, based on observed statistical dependencies between estimated noise terms for the observed variables (Heinze-Deml et al. 2018).

  • Graph-learning algorithms, like other techniques, cannot resolve which (if any) among highly multicollinear variables is the true cause of an effect with which all are equally correlated (but see Jung and Park 2018 for a ridge regression approach to estimating path coefficients despite multicollinearity). More generally graph-learning algorithms may find that several different causal graphs are equally consistent with the data.

  • If passive observations have insufficient variability in some variables, their effects on other variables may again be impossible to estimate uniquely from data. (For this reason the experimental treatment assignment (ETA) condition, that each individual has positive probability of receiving any level of exposure, regardless of the levels of other variables, is often assumed for convenience although it is often violated in practice (Petersen et al. 2012.)) If interventions are possible, however, then experiments can be designed to learn about dependencies among both observed and latent variables as efficiently as possible (Kocaoglu et al. 2017).

Arguably, these limitations due to limited statistical power, unobserved (latent) variables, and need to perturb systems to ensure adequate variability and unique identifiability, apply to any approach to learning about causality and other statistical dependencies among variables from data, from formal algorithms to holistic expert judgments. Using algorithms and packages such as bnlearn, pcalg, and CompareCausalNetworks to determine whether there are detected statistical dependencies (i.e., positive mutual information) between variables that persist even after conditioning on other variables simply makes explicit operational procedures for drawing reproducible conclusions from data despite such limitations.

Unobserved (latent) variables can play a key role in explaining dependencies among observed variables, but have only recently started to be included widely used causal analytics software packages. A simple light bulb example illuminates data analysis implications of omitting direct causes of a variable from a model. Suppose that a light bulb’s illumination as a function of light switch position is well described in a limited data set by a deterministic CPT with P(light on | switch up) = P(light off | switch down) = 1. In a larger data set that includes data from periods with occasional tripped circuit breakers, burned-out fuses, or power outages, this CPT might have to be replaced by a new one with P(light on | switch up) = 0.99, P(light off | switch down) = 1. Scrutiny of the data over time would show strong autocorrelations, with this aggregate CPT being resolved into a mixture of two different shorter-term CPTs: one with P(light on | switch up) = 1 (when current is available to flow, which is most of the time) and the other with P(light on | switch up) = 0 (otherwise), with the former regimen holding 99% of the time. Such heterogeneity of CPTs over time or across study settings points to the possibility of omitted causes; these can be modeled using statistical techniques for unobserved causes of changes and heterogeneity in observed input-output behaviors, such as Hidden Markov Models (HMMs) and regime-switching models for time series data, or finite mixture distribution models for cross-sectional data. In population studies, the condition that the conditional expected value (or, more generally, the conditional distribution) of a variable for an individual unit is determined entirely by the values of its parents, so that any two units with the same values for their direct causes also have the same conditional probability distribution for their values, is referred to as unit homogeneity (Holland 1986; King et al. 1994; Waldner 2015). When it is violated, as revealed by tests for homogeneity, the causal graph model used to interpret and explain the data can often be improved by adding previously omitted variables (or latent variables, if they are not measured) or previously missing links (Oates et al. 2017).

From this perspective, inconsistency across study results in the form of unexplained heterogeneity in CPTs can reveal a need to expand the causal model to include more direct causes as inputs to the CPT to restore consistency with observations and homogeneity of CPTs. As discussed in the text, a more convincing demonstration of the explanatory power of a model and its consistency with data than mere agreement with previous findings (which is too often easily accomplished via p-hacking) is to find associations and effects estimates that differ across studies, and to show that these differences are successfully predicted and explained by applying invariant, homogeneous causal conditional probabilities to the relevant covariates (e.g., sex, age, income, health care, etc.) in each population. Modern transport formulas for applying causal CPTs learned in one setting to new settings allow such detailed prediction and explanation of empirically observed differences in effects in different populations (Heinze-Deml et al. 2017; Bareinboim and Pearl 2013; Lee and Honavar 2013). Results of multiple disparate studies can be combined, generalized, and applied to new settings using the invariance of causal CPTs, despite the variations of marginal and joint distributions of their inputs in different settings (Triantafillou and Tsamardinos 2015; Schwartz et al. 2011). Exciting as these possibilities appear to be for future approaches to learning, synthesizing, and refining causal models from multiple studies and diverse data sets, however, they are only now starting to enter the mainstream of causal analysis of data and to become available in well supported software packages (e.g., https://cran.r-project.org/web/packages/causaleffect/causaleffect.pdf). We view the full exploitation of consistency, invariance, and homogeneity principles in learning, validating, and improving causal graph models from multiple data sets as a very promising area of ongoing research, but not yet reduced to reliable, well-vetted, and widely available software packages that automate the process.

Appendix 3: Concepts of Causation

In public and occupational health risk analysis, the causal claim “Each extra unit of exposure to substance X increases rates of an adverse health effect (e.g., lung cancer, heart attack deaths, asthma attacks, etc.) among exposed people by R additional expected cases per person-year” has been interpreted in at least the following ways:

  1. 1.

    Probabilistic causation (Suppes 1970): The conditional probability of the health response or effect occurring in a given interval of time is greater among individuals with more exposure than in similar-seeming individuals with less exposure. In this sense, probability of response (or age-specific hazard rate for occurrence of response) increases with exposure. On average, there are R extra cases per person-year per unit of exposure. The main intuition is that causes (exposures) make their effects (responses) more likely to occur within a given time interval, or increase their occurrence rates. CPTs can represent probabilistic causation while allowing arbitrary interactions among the direct causes of an effect. This probabilistic formulation is more flexible than deterministic ideas of necessary cause, sufficient cause, or but-for cause (see concept 9) (Rothman and Greenland 2005; Pearl and Mackenzie 2018). All of the other concepts that we consider are developed within this probabilistic causation framework. However, most of them add conditions to overcome the major limitations that probabilistic causation does not imply that changing the cause would change the effect and that probabilistic causation lacks direction (i.e., P(X | Y) > P(X) implies P(Y | X) > P(Y), since P(X | Y)P(Y) = P(Y | X)P(X) implies that P(Y | X) = P(X | Y) P(Y)/P(X), which exceeds P(Y) when P(X | Y)/P(X) > 1, or P(X | Y) > P(X)).

  2. 2.

    Associational causation (Hill 1965; IARC 2006): Higher levels of exposure have been observed in conjunction with higher risks, and this association is judged to be strong, consistent across multiple studies and locations, biologically plausible, and perhaps to meet other conditions such as those in the left column of Table 9.1. The slope of a regression line between these historical observations in the exposed population of interest is R extra cases per person-year per unit of exposure. The main intuition is that causes are associated with their effects.

  3. 3.

    Attributive causation (Murray and Lopez 2013): Authorities attribute R extra cases per person-year per unit of exposure to X; equivalently, they blame exposure to X for R extra cases per person-year per unit of exposure. In practice, such attributions are usually made based on measures of association such as the ratio or difference of estimated risks between populations with higher and lower levels of exposure, just as for associational causation, together with subjective decisions or judgments about which risk factor(s) will be assigned blame for increased risks that are associated with one or more of them. Differences in risks between the populations are typically attributed to their differences in the selected exposures or risk factors. The main idea is that if people with higher levels of the selected variable(s), such as exposure, have higher risk for any reason, then the greater risk can be attributed to the difference in the selected variable(s). (If many risk factors differ between low-risk and high-risk groups, then the difference in risks can be attributed to each of them separately; there is no consistency constraint preventing multiples of the total difference in risks from being attributed to the various factors.) For example, if poverty is associated with higher stress, poorer nutrition, lower quality of health care, increased alcohol and drug abuse, greater prevalence of cigarette smoking, increased heart attack risk, residence in higher-crime neighborhoods, residence in neighborhoods with higher air pollution levels, higher unemployment, lower wages, fewer years of education, and more occupation in blue-collar jobs, then attributive causation could be applied to data on the prevalence of these ills in different populations to calculate a “population attributable fraction” (PAF) or “probability of causation” (PC) for the fraction of any one of them to be attributed to any of the others (Rothman and Greenland 2005). In the simplest case where all are treated as binary (0–1) variables and all are 1 for individuals in a low-income population and 0 for individuals in a high-income comparison group, the PAF and PC for heart attack risk attributed to residence in a higher air pollution neighborhood (or, symmetrically, to any of the other factors, such as blue collar employment) is 100%. This attribution can be made even if changing the factors to which risk is attributed would not affect risk. Relative risk (RR) ratios—the ratios of responses per person per year in exposed compared to unexposed populations—and quantities derived from RR, such as burden-of-disease metrics, population attributable fractions, probability of causation formulas, and closely related metrics, are widely used in epidemiology and public health to quantify both associational and attributive causation.

  4. 4.

    Counterfactual and potential outcomes causation (Höfler 2005; Glass et al. 2013; Lok 2017; Li et al. 2017; Galles and Pearl 1998): In a hypothetical world with 1 unit less of exposure to X, expected cases per person-year in the exposed population would also be less by R. Such counterfactual numbers are usually derived from modeling assumptions, and how or why the counterfactual reduction in exposure might occur is not usually explained in detail, even though such details might affect resulting changes in risk. The main intuition is that differences in causes make their effects different from what they otherwise would have been. To use this concept, what otherwise would have been (since it cannot be observed) must be guessed at or assumed, e.g., based on what happens in an unexposed comparison group believed to be relevantly similar to, or exchangeable with, the exposed population. Galles and Pearl (1998) discuss reformulation of counterfactual and potential outcomes models as causal graph models based on observable variables. The counterfactual framework is especially useful for addressing questions about causality that do not involve variables that can be manipulated, such as “How would my salary for this job differ if I had been born with a different race or sex or if I were 10 years older?” or “How likely is it that this hurricane would have occurred had oil and gas not been used as energy sources in the twentieth century?” (Pearl and Mackenzie 2018).

  5. 5.

    Predictive causation (Wiener 1956; Granger 1969; Kleinberg and Hripcsak 2011; Papana et al. 2017): In the absence of interventions, time series data show that the observation that exposure has increased or decreased is predictably followed, perhaps after some delay, by the observation that average cases per person-year have also increased or decreased, respectively, by an average of R cases per unit of change in exposure. The main intuition is that causes help to predict their effects, and changes in causes help to predict changes in their effects. (The meaning of “help to predict” is explained later in the section on mutual information, but an informal summary is that effects can be predicted with less average uncertainty or error when past values of causes are included as predictors than when they are not.) Predictive causation still deals with observed associations, but now the associations involve changes over time.

  6. 6.

    Structural causation (Simon 1953; Simon and Iwasaki 1988; Hoover 2012): The average number of cases per person-year is derived at least in part from the value of exposure. Thus, the value of exposure must be determined before the value of expected cases per person-year can be determined; in econometric modeling jargon, exposure is exogenous to expected cases per person-year. Quantitatively, the derived value of cases per person-year decreases by R for each exogenously specified unit decrease in exposure. The main intuition is that effects are derived from the values of their causes. Druzdzel and Simon (1993) discuss the relations between Simon-Iwasaki causal ordering and manipulative and mechanistic causation.

  7. 7.

    Manipulative causation (Voortman et al. 2010; Spirtes 2010; Hoover 2012; Simon and Iwasaki 1988): Reducing exposure by one unit reduces expected cases per person-year by R. The main intuition is that changing causes changes their effects. How this change is brought about or produced need not be explained as part of manipulative causation, although it is crucial for scientific understanding, as discussed next. Manipulative causation is highly useful for decision analysis, which usually assumes that decision variables have values that can be set (i.e., manipulated) by the decision-maker. Bayesian networks with decision nodes and value nodes, called influence diagrams, have been extensively developed to support calculated manipulation of decision variables to maximize the expected utility of consequences (Howard and Matheson 2005, 2006; Howard and Abbas 2016).

  8. 8.

    Explanatory/mechanistic causation (Menzies 2012; Simon and Iwasaki 1988): Increasing exposure by one unit causes changes to propagate through a biological network of causal mechanisms. Propagation through networks of mechanisms is discussed further in the section on coherent causal explanations and biological plausibility. The main idea is simply that, following an exogenous change in an input variable such as exposure, each variable’s conditional probability distribution is updated to reflect (i.e., condition on) the values of its parents, as specified by its CPT, and the value drawn from its conditional probability distribution is then used to update the conditional probability distributions of its children. When all changes have finished propagating, the new expected value for expected cases per person-year in the exposed population will be R more than before exposure was increased. The main intuition is that changes in causes propagate through a network of law-like causal mechanisms to produce changes in their effects. Causal mechanisms are usually represented mathematically by conditional probability tables (CPTs) or other conditional probability models (such as structural equation models, regression models, or non-parametric alternatives such as classification trees). These specify the conditional probability or probability density for a variable, given the values of its parents in a causal graph or network, and they are invariant across settings (Pearl 2014).

  9. 9.

    But-for causation: If the cause had not occurred (i.e., “but for” the cause), the effect would not have occurred. More generally, if the value of a causal variable had been different, the probability distributions for its effects would have been different. This concept of causation has long been a staple of tort litigation, where plaintiffs typically argue that, but for the defendant’s (possibly tortious) action, harm to the plaintiff would not have occurred.

It is important to understand that these different concepts are not variations, extensions, or applications of a single underlying coherent concept of causation. A headline such as “Pollution kills 9 million people each year” (e.g., Washington Post 2017) cannot be interpreted as a statement of mechanistic or manipulative causation, since the number of people dying in any year is identically equal to the number born one lifetime ago, and pollution presumably does not retroactively increase that number. (It might affect life lengths and redistribute deaths among years, but it cannot increase the total number of deaths in each year.) But it is easily accommodated in the framework of attributive causality, where it simply refers to the number of “premature deaths” per year that some authorities (authors of a Lancet article, in this case, www.thelancet.com/commissions/pollution-and-health) attribute to air pollution using a “burden of disease” formula that is unconstrained by physical conservation laws. These are not variations on a single concept, but distinct concepts: the kind of causation implied by the headline (attributive) has no implications for the number of deaths that could be prevented by reducing pollution (manipulative).

Galles and Pearl (1998) discuss essential distinctions between structural and counterfactual/potential outcomes models of causation and special conditions under which they are equivalent. The non-equivalence of predictive and manipulative causation can be illustrated through standard examples such as nicotine-stained fingers being a predictive but not a manipulative cause of lung cancer (unless the only way to keep fingers unstained is not to smoke and the only want to stain them is to smoke; then nicotine-stained fingers would be a manipulative but not a mechanistic cause of lung cancer). In short, the various concepts of causation are distinct, although the distinctions among them are seldom drawn in headlines and studies that announce causal “links” between exposures and adverse health responses.

Appendix 4: Software for Dynamic Bayesian Networks and Directed Information

For practitioners, software packages with algorithms for learning dynamic Bayesian networks (DBNs) from time series data, automatically identifying their DAG structures (including dependencies among variables in different time slices, corresponding to lagged values of variables) and estimating their CPTs, are now readily available. The free R package bnstruct (Sambo and Franzin 2016) provides nonparametric algorithms for learning DBNs from multiple time series even if some of the data values are missing. (Detailed implementations must also address instantaneous causation, i.e., causal relations between variables in the same time slice, but this does not require any new principles, as it is just the usual case of a non-dynamic BN.) The pgmpy package for probabilistic graphical models in Python has a module for DBNs, and commercial packages such as GeNIe (free for academic research), Netica, BayesiaLab, and Bayes Server all support DBN learning and inference. Links to these and other BN and DBN software packages can be found at www.kdnuggets.com/software/bayesian.html.

Alternatively, within the context of traditional parametric time series modeling, Granger causality testing (performed using the grangertest function in R) provides quantitative statistical tests of the null hypothesis that the future of the effect is conditionally independent of the past of the cause, given its own past. Non-parametric generalizations developed and applied in physics, neuroscience, ecology, finance, and other fields include several closely related measures of transfer entropy and directed information flow between time series (Schreiber 2000; Weber et al. 2017). An elegant alternative to DBNs is directed information graphs (DIGs) (Quinn et al. 2015). Directed information generalizes mutual information by distinguishing between earlier and later values of variables (Costanzo and Dunstan 2014). It allows for realistic complexities such as feedback loops and provides a non-parametric generalization of earlier parametric statistical tests for predictive causality between pairs of time series (Wiener 1956; Granger 1969). Directed information graphs generalize these ideas from pairs to entire networks of time series variables (Amblard and Michel 2011). In such a graph, each node represents a time series variable (stochastic process). An arrow from one node to another indicates that directed information flows from the former to the latter over time. DIG-learning algorithms (Quinn et al. 2015) use nonparametric measures of directed information flows between variables over time to identify causal graph structures, assuming that information flows from causes to their effects. However, we are unaware of currently available mainstream statistical packages that automate estimation of DIGs from data (but see https://pypi.org/project/causalinfo/ for a start). Thus, we recommend using DBNs and packages such as bnstruct to identify the directions of arrows representing the temporal flows of information between variables, i.e., from earlier values of some variables to later values of others.

Appendix 5: Non-temporal Methods for Inferring Direction of Information Flow

This appendix describes several principles and methods used to infer directions of information flow (and hence possible causality) between variables when longitudinal data are not available. The last part of the appendix discusses quasi-experimental (QE) study designs and analysis methods.

Homoscedastic Errors and LiNGAM

If causes and effects are related by some unknown regression model of the form

$$ effect=f(cause)+ error $$

where f is a (possibly unknown) regression function and error is a random error term, possibly with an unknown distribution, then it may be possible to discern which variables are causes of specified effects by studying the distribution of the error term. In the special case where f is a linear function and error has a Gaussian (normal) distribution with zero mean, the linear regression model

$$ effect={b}_0+{b}_1\ast cause+ error $$

can be rearranged as

$$ cause=-\left({b}_0/{b}_1\right)+\left(1/{b}_1\right)\ast \mathrm{effect}+\left(1/{b}_1\right)\ast error. $$

Because the normal distribution is symmetric around the origin and multiplying a normally distributed random variable by a constant gives another normally distributed random variable, this is again a linear function with a normally distributed error term. Thus, the two properties of (a) having a linear regression function relating one variable to another and (b) having a normally distributed error, do not reveal which variable is the cause and which is the effect, since these properties are preserved when either variable is solved for in terms of the other. But if the regression function is nonlinear, or if the error term is not normally distributed (i.e., non-Gaussian), then the following very simple test for the direction of information flow emerges: if y = f(x) + error, where error is a zero-mean random variable (not necessarily Gaussian), then a scatter plot of y values against x values has the same distribution of y values around their mean for all values of x (since this vertical scatter is produced by an error term whose distribution is independent of x). Typically, except in the case of linear regression, the scatter plot of x values vs. y values will then look quite different: x = f-1(y - error) will have a scatter around its mean values that depends on y. Visually, plotting effect vs. cause yields a scatter plot that is homoscedastic (has the same variance everywhere), while plotting cause vs. effect typically does not. This asymmetry reveals which variable is the cause and which is the effect under the assumption that effect = f(cause) + error. It is used in recent algorithms (e.g., the LiNGAM (linear non-Gaussian acyclic model) (Shimizu et al. 2006; Tashiro et al. 2014) and CAM (causal additive model) algorithms for causal discovery (Buhlmann et al. 2014), both of which are included in the CompareCausalNetworks package (Heinze-Deml et al. 2018)). Applications in epidemiology have shown encouraging performance (Rosenström et al. 2012).

Knowledge-Based and Exogeneity Constraints

A third way to orient the arrows in a causal graph is to use knowledge-based constraints, such as that cold weather might be a cause of increased elderly mortality rates, but increased elderly mortality rates are not a potential cause of cold weather. Similarly, sex and age are typically potential causes but not potential effects of other variables. Death is typically a potential effect but not a potential cause of covariates. Such relatively uncontroversial assumptions or constraints can be very useful for directing the arrows in causal graphs, for both longitudinal and cross-sectional data. They can be incorporated into causal discovery algorithms used in bnlearn and other packages using “white lists” and “black lists” (and, to save time, via “source” and “sink” designations in CAT) to specify required, allowed, and forbidden arrow directions. More generally exogeneity constraints specify that some variables have values derived from (i.e., endogenously determined by) the values of others; the values of variables that are not derived from others are determined from outside (i.e., exogenous to) the system being examined (Galles and Pearl 1998). Such exogeneity constraints have long been used in econometrics, social sciences, electrical engineering, and artificial intelligence to help determine possible causal orderings of variables in systems of equations based on directions of information flow from exogenous to endogenously determined variables (Simon 1953; Simon and Iwasaki 1988). These ordering have proved useful in modeling manipulative causation as well as structural and predictive causation (Druzdzel and Simon 1993; Voortman et al. 2010).

Quasi-Experiments (QEs) and Assumption-Based Constraints on Arrow Directions

A very useful type of exogeneity for inferring causality from observations arises from actions or interventions that deliberately set the values of some controllable variables. These exogenously controlled variables are often called decision variables in decision analysis, factors or design factors in design of experiments, preventable risk factors in epidemiology, or controllable inputs in systems engineering. As a paradigmatic example, suppose that flipping a switch at random times is immediately followed by a light turning on whenever the switch is flipped up and off whenever it is flipped down. After some experimentation, an investigator might conclude that there is substantial empirical evidence that flipping the switch up or down causes the light to be turned on or off, respectively, even without further understanding of how the change in switch position propagates through a causal network to change the flow of current through the light bulb. Hill’s “experiment” consideration allows for such empirical evidence of causality when manipulation of risk factors is possible (Fedak et al. 2015). Several caveats on such inferences of causality from data are noted later; the challenging philosophical problem of justifying inductive inferences of causation from observations is by no means completely solved even by random experimentation unless it can be assumed that the (perhaps unknown) mechanisms tying interventions to consequence probabilities remain unchanged over time or successive trials, and will continue into the future.

When deliberate manipulation is impossible, or is sharply restricted (e.g., to avoid exposing individuals to potentially harmful conditions), as if often the case in practice, observational data can sometimes be used to achieve many of the benefits of experimentation if it reveals the responses of dependent variables to exogenous changes (“interventions”) or differences in explanatory variables. This is the basic principle of quasi-experiments (QEs) (Campbell and Stanley 1963). These differ from true designed experiments in that they lack random assignment of individuals to treatment or control groups. To interpret their results causally, therefore, it is typically necessary to make strong assumptions, such as that the populations with different levels of the explanatory variables are otherwise exchangeable, with no latent confounders (Hernán and Robins 2006). A variety of methods have been devised for drawing causal inferences about the directions and magnitudes of effects from QE data with the help of such assumptions. For single, large interventions at known points in time, a long tradition in the social sciences and epidemiology of intervention analysis, also called interrupted time series analysis (ITSA), compares time series of response variables before and after the intervention to determine whether they have changed significantly (Box and Tiao 1975). If so, and if repeated observations in multiple settings make it unlikely that the changes occur by coincidence, then they can be used to estimate the effects of the observed interventions (or possibly of other unobserved changes that caused them). More recently “difference in differences” (DID) analyses have applied similar pre-post comparisons to estimate effects of interventions with the help of such simplifying assumption as that effects are additive and trends are linear both before and after the intervention (Bertrand et al. 2004). Again, the “interventions” studied by QE methods need not be deliberate or man-made: what matters is that they cause observable changes in responses that cannot plausibly be explained by other factors. Similarly, in natural experiments, exogenous changes in exposures or other conditions arise from nature or other sources that the investigator does not control (DiNardo 2008). Comparing the distributions of observed responses in exposed and unexposed populations, or more generally in populations with different levels of an exogenous variable (or the same population before and after the natural change in explanatory variables), reveals the effects of these differences on responses—at least if the populations of individuals being compared are otherwise exchangeable (or can be made so in the analysis by stratification and matching on observed variables). The famous analysis of data from the 1854 Broad Street cholera outbreak by John Snow Regression is an outstanding practical example of a natural experiment used to identify a cause. The BACKSHIFT algorithm in the CompareCausalNetworks package (Rothenhausler et al. 2015) exploits the fact that many variables in the real world are constantly bombarded by small random shocks and perturbations. Even in the absence of any deliberate interventions, these disturbances act as random interventions of possibly unknown sizes and locations in the causal network (sometimes called “fat hand interventions”) that cause variables to fluctuate around their equilibrium values over time. The BACKSHIFT algorithm uses correlations in these observed fluctuations to learn linear causal network models (i.e., networks in which changes in causes transmit proportional changes in their direct effects), even if cycles and latent variables are present.

In econometrics and epidemiology, a QE technique called regression discontinuity design (RDD) has found growing popularity. RDD studies compare response distributions in populations on either side of a threshold that triggers an exogenous change in exposures or other conditions. Such thresholds include legal ages for smoking or drinking or driving; and geopolitical boundaries for implementation of different policies or programs (Thistlethwaite and Campbell 1960; Imbens and Lemieux 2008). The thresholds have the effect of creating interventions in the explanatory variables even without deliberate manipulation. Another QE technique, instrumental variables (IV) regression, is also now widely used in econometrics, social statistics, and epidemiology to estimate causal effects despite potential biases from reverse causation, omitted variables (e.g., unobserved confounders), and measurement errors in explanatory variables. IV studies collect observations on “instruments,” exogenously changing variables that are assumed to affect the explanatory variables in a model but not to directly affect the dependent variable. Changes in the instrument change the explanatory variable(s) of interest, thereby mimicking the effects of interventions that set their values. This generates observations from which the effects of the changes in explanatory variables on the dependent variable(s) of interest can be estimated, provided that the instrument is valid, i.e., that changing it causes changes in the explanatory variable(s) but not in the dependent variable(s) of interest except via its effects on the explanatory variables (Imbens and Angrist 1994)..

In general, QE techniques rely on an assumption that observed changes in exogenous causes—whether random, deliberate, designed, or coincidental—propagate to changes in the distributions of the observed endogenous dependent variables. Whether a light switch is deliberately flipped on and off in a designed experiment, or is flipped at random by passersby or by acts of nature or by chance, does not greatly affect the task of causal inference, provided that the relation between observed changes in the switch’s position and ensuing observed changes in the on-off status of the light bulb are not coincidental, are not caused by latent variables that affect both, and are not explained by other “threats to internal validity” for quasi-experiments (Campbell and Stanley 1963). If the effects of different settings of an explanatory variable on the mean level of a response variable in a population are estimated by the differences in its observed conditional mean values for different observed settings of the explanatory variable, then it is necessary to assume that the populations experiencing these different settings are comparable, or exchangeable—a crucial assumption, although often difficult or impossible to test, in interpreting QE results. Such conditions are necessary to justify any of these techniques for inferring that causal information flows from changes in explanatory variables to changes in dependent variables.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Cox Jr., L.A. (2021). Methods of Causal Analysis for Health Risk Assessment with Observational Data. In: Quantitative Risk Analysis of Air Pollution Health Effects. International Series in Operations Research & Management Science, vol 299. Springer, Cham. https://doi.org/10.1007/978-3-030-57358-4_9

Download citation

Publish with us

Policies and ethics