We now explain the principles underlying statistical inference about causal effects using an IV. We will not focus on the details here as these depend on the specific setting, e.g. continuous or binary outcome, and are discussed elsewhere. We begin by asking if there is a causal effect of X on Y at all. We then consider whether lower and upper bounds can be derived for this causal effect. Finally, we address the issue of getting a point estimate of the causal effect. Answering these questions in turn requires increasingly more restrictive assumptions.
Testing for a causal effect by testing for a G–Y association
When our interest is purely in confirming whether there is an average causal effect of X on Y in the first place, data on a valid IV G (satisfying the core conditions) and the outcome Y, together with the structural assumption (1) and faithfulness, are sufficient, i.e. we do not require data on X. More specifically, as reasoned below, we just need to test for a (marginal) association between G and Y. This has some analogies with the intention-to-treat (ITT) analysis under partial compliance discussed earlier, although the two should not be confused.
We will define the ‘causal null hypothesis’ of interest to mean the absence of an \(X \rightarrow Y\) edge as depicted by the DAG in Fig. 4. We note that this is ignoring the possibility of causal effects in subgroups defined by U which cancel each other out. Under the core conditions, any marginal association between the IV, G, and the outcome, Y, can only occur if X has a causal effect on Y because there is no other pathway between G and Y that creates an association (Fig. 1a). As shown in Fig. 4, when there is no causal effect of X on Y, G and Y are marginally independent, i.e. there is no unblocked path between them since the path \(G \rightarrow X \leftarrow U \rightarrow Y\) is blocked in the absence of conditioning on the collider X. The reverse reasoning is trickier. Lack of evidence for a G–Y association can mean several things: there is no causal effect, or the power is too low, or the IV is too weak. In rare cases, it could also happen that, due to interactions with unobserved factors, positive and negative subgroup effects ‘cancel’ each other out so that no overall effect is detectable. These issues are discussed in more technical detail in Didelez and Sheehan (2007b).
Any suitable statistical test for association can be used in this context, e.g. a Chi-squared test if G has three levels and Y is binary. Typically, a regression of Y on G (which can also accommodate possible observed covariates) is used and a statistically significant association is evidence for the presence of a causal effect of X on Y. Importantly for MR applications and in contrast with the ITT estimate in a partial compliance type setting, we note that the estimate of the G–Y association is generally not interpretable in terms of a causal effect. Neither does it permit inference about the magnitude nor sign of a causal effect (Didelez and Sheehan 2007b; Burgess and Small 2016; Swanson et al. 2018). It is purely a test for a causal effect and further assumptions must be made to obtain a point estimate of such an effect should it seem likely to be present. Hence, it is good practice to start an IV analysis by establishing what conclusions can be drawn from the G–Y association alone without additional assumptions. All methods that yield an estimate, standard error and confidence interval for the causal effect of X on Y make further parametric (or semi-parametric) assumptions over and above the core conditions. For instance, it may happen that the G–Y association is not statistically significant, but subsequent estimation of the causal effect of X on Y yields a significant result. We should then bear in mind that the apparent extra information ‘gained’ is mainly due to the additional modelling assumptions that were made. As these implicitly or explicitly involve the unobservable factors subsumed in U, they are empirically untestable. Furthermore, it is common to assume that X is measured without error. All these additional assumptions are themselves new sources of bias if violated, and resulting estimates and standard errors can be regarded as less reliable.
Finally, we point out that the test for a G–Y association to assess the presence of a causal effect is also valid in case-control studies without any further adjustment or additional assumptions other than a valid IV (Didelez and Sheehan 2007b). So, even if sampling is conditional on the outcome Y, we still expect a G–Y association only if X has a causal effect on Y. This is important because IV estimation in a case-control study is not straightforward, whereas a test for the G–Y association is very simple (Didelez et al. 2010; Bowden and Vansteelandt 2011). For example, in an investigation into the possible causal effect of homocysteine level on stroke risk, the odds ratio for the genotype-stroke (G–Y) association, using a dichotomisation of the MTHFR C677T polymorphism into TT and CC carriers as a genetic IV, was found to be significant at 1.26 with 95 % CI (1.14, 1.40) (Casas et al. 2005). If the MTHFR gene is a valid instrument for the effect of homocysteine on stroke, this result provides evidence for the presence of such a causal effect.
Bounds on causal effects
In some cases, it is possible to obtain some quantitative information about the size of the causal effect in the form of lower and upper bounds using only the core IV conditions without further parametric assumptions. This is possible when G, X, and Y are discrete with few levels and data on all three variables are available from a single sample. In an MR study, for example, we might have a genetic IV with three levels, a binary exposure and a binary outcome. It is important to note that the bounds are not confidence intervals for the causal effect. The interpretation of the bounds is that the data are compatible with values of a causal effect anywhere between the lower and upper bound. We do not go into technical details here as these are provided elsewhere (Manski 1990; Balke and Pearl 1994; Palmer et al. 2011a).
Returning to the example above (“Testing for a causal effect by testing for a G–Y association”), we consider bounding the causal effect of dichotomised homocysteine level (low/high) on presence or absence of cardiovascular disease (CVD) using the MTHFR genotype (now with all three levels) as an IV (Palmer et al. 2011a). Since the data come from a case–control study, the analysis is performed by converting back to the required population frequencies under plausible assumptions about the prevalence of CVD (Didelez and Sheehan 2007b). With a prevalence of 6.5%, the ACE (causal risk difference) lies between \(-\,0.0895\) and 0.7344 while assuming a prevalence of 2%, the bounds are slightly wider and the ACE lies between \(-\,0.065\) and 0.7644. Alternatively, the bounds can be given for the CRR (causal relative risk) and are 0.1272 and 41.5740, respectively, in the latter case. While we previously reported the IV–outcome association for this example as supporting the presence of a causal effect of homocysteine on stroke risk, the bounds computed here are all too wide to confirm this as they all include the null hypothesis of ‘no effect’. This may be partly due to the fact that the test in Casas et al. (2005) was based on a meta-analysis while the bounds above were calculated using only a subset of the data which was less informative.
The fact that we can bound the causal effect is interesting in two regards. Firstly, it illustrates that even though the core IV assumptions do not imply any (conditional) independencies among the observable variables (G, X, Y) they still impose some restrictions leading to such bounds, and these restrictions can be exploited to test the validity of the core IV conditions to a certain extent. Secondly, the bounds are ‘tight’, meaning that nothing more precise can be said about the causal effect without further assumptions which underlines the necessity of the latter if an effect estimate is desired—especially if the calculated bounds are too wide to be informative (Balke and Pearl 1994). Thus, for the above example, any derived estimate of the effect of homoscysteine level on stroke risk will depend on the additional assumptions that are made for point estimation.
Note that a major limitation is that if X is continuous, no bounds or other restrictions can be derived from the core IV assumptions, i.e. there are no testable implications of the IV assumptions and a parametric approach is thus required for causal inference (Bonet 2001). Conclusions drawn for a continuous exposure are hence completely reliant on the relevant parametric modelling assumptions. This is especially limiting in MR analyses where the exposure is typically continuous and is arguably the reason why few examples of computing the bounds can be found in the literature. In fact, when the exposure of interest is continuous, it may be unwise to dichotomise the exposure as the chosen IV might not be valid for the dichotomised variable as illustrated in Fig. 5 (Didelez and Sheehan 2007b; Glymour et al. 2012; VanderWeele et al. 2014; Swanson 2017). When the bounds can be computed, they tend in practice to be wide and often include the null as in the above example and are so deemed ‘uninformative’. We note that this is not a poor property of the method, but is rather a property of the data: MR data are often ‘uninformative’ in this sense due to the weak IVs that are typically used and possibly only small true causal effects, if any (see “IV estimation in linear and additive structural models” below). The width of the bounds depends on the strength of the IV and the amount of confounding. However, in contrast with Burgess et al. (2017b), we recommend that they be computed in the case where variables are genuinely discrete, if only to assess how much can be inferred without further assumptions (Richardson et al. 2014). In particular, bounds that do not include the null causal effect could lend considerable weight to reported causal findings. They can be calculated easily in Stata (Palmer et al. 2011a) and using either of two R packages, bpbounds and ivtools, that have recently been made available on CRAN. When several IVs are available, bounds can be calculated for each IV separately and the intersection of all such bounds considered. Likewise, bounds can be calculated for a combined IV (Swanson 2017).
Estimation with instrumental variables
As already noted, estimation of the causal effect requires additional modelling assumptions over and above the core IV conditions. Models differ depending on the exact setting (e.g. continuous or binary outcome, case-control or cohort study) and target of inference (e.g. local versus population parameter, causal average effect versus causal risk ratio). Furthermore, different parameters can be targeted by the same estimator under different assumptions so attention should be paid to the modelling details (Hernán and Robins 2006a; Brookhart and Schneeweiss 2007; Angrist and Pischke 2009; Didelez et al. 2010; Clarke and Windmeijer 2012). It is hence important to be clear about what parameter is being targeted and what assumptions are being made for any particular analysis.
IV estimation in linear and additive structural models
Here, we give a brief overview of the simplest and most popular case, the linear additive structural model. Other models are discussed briefly in “Other IV models and estimators”. We call this type of model ‘structural’ because it is assumed to be valid not only under observation of but also under intervention in X as explained earlier. It assumes that
$$\begin{aligned} E(Y \mid X = x,U = u) & = E(Y\mid \mathrm{do}(X = x),U = u)\nonumber \\ & = \beta x + h(u), \end{aligned}$$
(2)
where the first equality is due to the structural assumption. This model posits that the causal effect within levels of the confounders U is linear in the exposure X without effect modification by U on the chosen scale, i.e. individuals in confounder subgroups such as men/women, drinkers/non-drinkers or older/younger, all react similarly to exposure. The unobserved confounders can predict or affect the outcome Y in an arbitrary way h(u). The model implies that the average causal effect for a one unit increase in X is identified as \(\mathrm{ACE}(x,x+1)=\beta\).
The parameter \(\beta\) cannot immediately be estimated from the above as we have no data on U. Moreover, we cannot obtain an unbiased estimate of \(\beta\) from a regression of Y on X due to correlation between U and X. Here, the IV comes into play. Exploiting the core IV conditions, it follows from the above structural linear and additive model that
$$\begin{aligned} \beta = \frac{\mathrm{Cov}(Y,G)}{\mathrm{Cov}(X,G)}, \end{aligned}$$
suggesting a simple estimator because the ratio of the covariances is in fact equal to the ratio of the regression coefficients from regressions of Y on G and of X on G:
$$\begin{aligned} {\hat{\beta }} = \frac{{{\hat{\beta }}}_{Y\mid G}}{{{\hat{\beta }}}_{X\mid G}} . \end{aligned}$$
(3)
This result has been known for a long time (Wright 1928; Wald 1940; Wooldridge 2002), but see Didelez et al. (2010) for a proof using the same notation as above.
The so-called ratio estimator (3) is simple to compute and has desirable statistical properties in that it is consistent. However, we now see why we need core IV condition 2: if the denominator is close to zero (relative to the measurement scale) the whole expression becomes very unstable and the variance of \({\hat{\beta }}\) then tends to infinity. The denominator \(({{\hat{\beta }}}_{X\mid G})\) will be close to zero if the instrument G does not strongly predict X; this is known as a weak instrument. It is plausible and can be shown formally, that the strength of the instrument (as measured by the proportion of variation in X that it explains) and the amount of confounding are inversely related: if U explains a lot of the variation in X, then there is not much variation left for G to explain (Martens et al. 2006). Moreover, use of a weak IV leads to loss of power for detecting a causal effect, if present, and also tends to bias the IV estimate of causal effect towards the naïve or ordinary least squares estimate which is precisely the bias that an IV analysis is trying to circumvent (Bound et al. 1995). For a single IV, the above ratio estimator is equivalent to the two-stage-least-squares (2SLS) estimator: predict X from a linear regression of X on G, and then carry out a linear regression of Y on the predicted values \({\hat{X}}\). The latter has the advantage of being generalisable in a straightforward way to multiple instruments, but, unlike the ratio estimator, requires joint data on G, X and Y.
Instrument strength is related to the (adjusted) \(R^2\) from the regression of X on G and the corresponding F-statistic for the null hypothesis that the IV does not predict X at all. Strength is relative to sample size and hence the much-cited rule-of-thumb of \(F \ge 10\) for an acceptably strong IV is valid for a single IV if the focus is on the actual level of an IV-based test. It does not provide a significance test of the null hypothesis at the same level for multiple IVs (Staiger and Stock 1997). The two values, \(R^2\) and F, should always be reported in any MR analysis but it is important to note that neither constitutes a definition of a strong/weak IV. Also, any data-driven approach to modelling the regression of X on G based on optimising \(R^2\) and F will bias the analysis (Sheehan and Didelez 2011).
Multiple instruments
In many applications of MR, it is possible that several variables, \(G_1,\ldots , G_K\), are plausible candidates as instruments for the effect of X on Y. It is especially tempting to use databases of published GWAS results to identify numerous potential instruments for the same exposure-outcome relation.
Multiple instruments offer some potential benefits, for example with regard to the plausibility of assumptions. In particular, if each \(G_k\) separately satisfies the core IV conditions, then they should all estimate the same causal effect and so separate estimates of the causal effect parameter should be roughly similar. Note that this reasoning presumes a homogenous causal effect as implied by the linear additive structural model (2). Under this model, large differences in the resulting estimated values possibly indicate that some of the core IV conditions may be violated for some of the instruments or, if they are all believed to be valid, that the model is incorrect. When the multiple IVs are independent, then this is the basic idea underlying an over-identification test (Sargan 1958; Hansen 1982). Under model (2), similar estimates of the causal effect parameter thus provide evidence against bias due to pleiotropy or linkage disequilibrium but not necessarily due to population stratification. Of course, this procedure will still fail to detect problems if the separate IV estimates are all biased in exactly the same way (Palmer et al. 2012; Glymour et al. 2012).
When it is implausible that the causal relationship between X and Y can be summarised in a single parameter, such as when it is not linear or when there is effect modification by observed covariates so that model (2) does not hold, we can exploit multiple instruments to estimate more parameters. Hence, multiple instruments can be used to estimate more complex causal models. However, in such a case all instruments have to be sufficiently strong as well as sufficiently unrelated to provide the required increase in information.
Multiple IVs and 2SLS
With a single IV, the 2SLS estimator is asymptotically unbiased for the average causal effect but it is subject to finite sample bias which is exacerbated when the instrument is weak (Bound et al. 1995). Under model (2) with \({\mathbf {G}}\) the vector of IVs, 2SLS estimation easily accommodates multiple IVs by fitting a regression of X on all \(G_1,\ldots , G_K\) jointly in the first stage. The additional instruments can serve to reduce weak-IV-bias provided they also increase the amount of variation explained in the exposure X. However, adding very weak, or virtually ‘redundant’ IVs, could actually increase the bias as this is likely to lead to over-fitting the first-stage regression and renders the occurrence of an accidental correlation between an instrument and unobserved confounding U more likely. Other estimators, such as the limited information likelihood and continuous updating estimators, have been shown to be more robust to weak IV bias (Sheehan and Didelez 2011; Davies et al. 2015).
Multiple IVs and allele scores
In MR applications it has become popular to use genetic risk or allele scores composed of several SNPs rather than a single genetic variant. Such a score S is given as the weighted average of the multiple IVs/SNPs: \(S=\sum _k w_k G_k\). The IV estimate of \(\beta\) is then obtained by regressing X on S at the first stage and then proceeding as usual. For this procedure to result in a consistent estimator of the causal effect, the score S needs to satisfy the IV core conditions; in particular it must be sufficiently informative for X (core condition 2) as measured by the \(S-X\) association. A violation of the other core conditions will typically occur, if one or more of the \(G_k\)’s are themselves not valid IVs, so that we can say that all \(G_1,\ldots , G_K\) need to be valid for the score to be valid (Swanson 2017).
To see the advantage of using an allele score, first note that 2SLS is equivalent to determining the weight \(w_k\) of each IV \(G_k\) as the regression coefficient from a multiple regression of X on \(G_1,\ldots , G_K\) jointly on the same data used for the whole analysis. As mentioned above, this easily gives rise to weak IV bias due to overfitting. Typically, however, the weights for an allele score are determined in a different way and several suggestions for how to do this have been proposed. If joint data are not available, as is often the case, one could obtain the weights for each SNP \(G_k\) from a simple regression of X on \(G_k\) alone. This is equivalent to 2SLS if the instruments are independent, but will not suffer from weak-IV-bias if a different data source is used for these K individual regressions than for the second stage. In principle, IVs do not have to be independent to be combined into a valid allele score in a one-sample setting (Fig. 6a). However, the weights for correlated IVs should ideally be obtained from a regression of X on \(G_1,\ldots ,G_K\) jointly and based on external data (Burgess et al. 2016). More generally, one could make use of other external information, e.g. other data sources or subject matter knowledge, to determine the weights. The number of parameters could be reduced by restricting the weights to be a constant \(w_k\equiv w\) for all \(G_k\), as in an unweighted score, or by partitioning \(G_1,\ldots , G_K\) into two groups, one with (the same) high weight and the other with low weight. Most allele scores implicitly assume an additive genetic model whereby each SNP has an approximately additive per allele effect on X: an unweighted score assumes similar per allele effects across all SNPs. Biological knowledge can be incorporated to distinguish between SNPs that can be regarded as ‘major genes’ and thus fitted separately in the first-stage regression and those that are polygenic and can be combined into an allele score (Pierce et al. 2010; Palmer et al. 2012). Advantages of using allele scores mainly stem from either using external data or restricting the weights and hence reducing the number of parameters, as this alleviates weak IV bias provided all SNPs in the score are themselves valid IVs (Pierce et al. 2010; Palmer et al. 2012; Burgess and Thompson 2013).
Moreover, MR analyses based on allele scores seem to be less sensitive than 2SLS analyses to misspecification of the first-stage regression, i.e. using the ‘wrong’ score, but they are very sensitive to the choice of variants for inclusion in the score and to the derivation of the weights (Burgess and Thompson 2013). A perfect score would have the property that it fully summarises the information in \(G_1,\ldots ,G_K\) for predicting X, implying \(X\bot \!\!\!\bot (G_1,\ldots ,G_K)|S\). This is unlikely to hold if restricted weights or an unweighted score are used, but the resulting loss of information often outweighs the danger of introducing bias due to overfitting an overly complex first stage model or score. It is important to note that S is still a valid IV even if \(X\bot \!\!\!\bot (G_1,\ldots ,G_K)|S\) does not hold (see Fig. 6b) as long as the \(G_1,\ldots ,G_K\) are valid IVs. It would be a problem for methods requiring a causal and unconfounded IV.
Multiple IVs and two samples
Up to now, we have mostly assumed a ‘one-sample’ scenario where individual-level data are available on all observable quantities, G, X and Y. The ratio estimator (3) can also be used in a ‘two-sample’ setting where summary data on the G–X and G–Y associations are taken from different studies under the assumption that the two underlying study populations are broadly similar (Hartwig et al. 2016). This lends itself to exploitation of potentially very large numbers of publicly available genome wide association studies (GWAS) providing summary information on associations between candidate IVs \(G_k\) and exposure X and outcome Y of interest. For example, in a recent MR study of the effect of age at puberty on asthma risk (Minelli et al. 2018), potential instruments were initially selected from a large published genome wide meta-analysis and supplemented through a literature search for additional relevant genetic studies using curated collections such as the NHGRI GWAS Catalog (Welter et al. 2014) and HuGE Navigator (Yu et al. 2008). The MR-Base platform (http://www.mrbase.org) has been specifically developed for MR analyses and has over 11 billion SNP–trait associations from almost 2000 GWAS to choose from (Hemani et al. 2018). In this situation, MR with multiple independent IVs can be viewed as a meta-analysis where the individual ratio estimates corresponding to each \(G_k\) can be combined into a pooled inverse variance weighted (IVW) estimate (Burgess et al. 2017a; Thompson et al. 2016, 2017). The one-sample over-identification test can be replaced by a standard \(\chi ^2\) test for heterogeneity such as \(I^2\) or Cochran’s Q-statistic used in meta-analysis (Del Greco M et al. 2015; Bowden et al. 2016b, 2017). Summary data methods can also be extended to include correlated SNPs and to construct allele scores (Burgess et al. 2016; Zhu et al. 2018).
Allowing for invalid IVs
The more SNPS that are considered as IVs in an MR analysis, the more likely it is that they will not all satisfy the core IV conditions. In the one-sample setting, the method of Kang et al. (2016) (and further developed in Windmeijer et al. (2018)) permits identification of the causal effect as long as fewer than 50% of the IVs are ‘invalid’ without the need to identify the offending IVs. The approach essentially penalises SNPs with suspected pleiotropic effects and downweights them in the analysis. Analogous robust approaches for the two-sample setting include: MR-Egger regression (Bowden et al. 2015) which can potentially cope with 100% invalid IVs under a strong assumption about the suspected pleiotropic effects; a weighted median approach (Bowden et al. 2016a) again assuming less than 50% invalid IVs; and mode-based estimation (Hartwig et al. 2017) which is consistent when the largest number of ‘similar’ individual SNP-based ratio estimates derive from valid IVs. All these robust approaches yield estimates that are less precise than 2SLS or IVW estimates, but should be carried out as part of a sensitivity analysis to support or question causal conclusions (Burgess et al. 2017a). They all make different and strong assumptions so we would go one step further and suggest that more weight should perhaps be given to analyses that do not rely so heavily on parametric assumptions (Clarke and Windmeijer 2012).
Because of the increasing availability of multiple candidate genetic IVs, development of methods for incorporating multiple IVs—particularly in the two-sample setting—have been mainly restricted to the MR literature. Attention is now turning to applying such approaches to the one-sample setting as intensive phenotyping of genetic association study populations is taking place and individual level data on instrument(s), exposure and outcome can reasonably be expected in many situations. It should be noted that establishing the validity of a set of IVs requires additional care and commonly used terms such as ‘all valid’ and ‘some invalid’ are neither used consistently nor explicitly defined.
Other IV models and estimators
The linear and additive structural model of “IV estimation in linear and additive structural models‘’ may often be plausible, at least as an approximation, for a limited range of X values. It can be shown that 2SLS has very good robustness properties under this model even when certain aspects, such as the first stage model or the way in which measured covariates enter the model, are misspecified (Vansteelandt and Didelez 2018).
These desirable properties of 2SLS do not typically carry over to non-linear models which are used, for instance, when the outcome Y is binary. For binary outcomes, a linear approach would still estimate the ACE or causal risk difference, but we may then prefer to report the CRR or COR, requiring non-linear models. Under certain parametric assumptions about the exposure distribution and using a log-linear model for the second stage regression, the CRR can be targeted by a two-stage regression or ‘ratio-type’ estimator (Didelez et al. 2010). The main problem for the non-linear case is that the relationship between the two regressions (Y on G, and X on G) and the relevant causal parameter, CRR or COR, is no longer straightforward and estimators derived from these two regressions are typically biased (Vansteelandt and Goetghebeur 2003; Martens et al. 2006; Palmer et al. 2011b; Vansteelandt et al. 2011; Harbord et al. 2013). This is also true when the focus is on a local, or ‘complier’ odds ratio (Cai et al. 2011). There are other IV methods dealing with binary outcomes, or more generally non-linear structural models, but they are less intuitive than the ratio estimator, and less simple to construct. The CRR, for example, can also be estimated under the weaker assumptions of a structural mean model or using a generalised method of moments estimator but identification problems can arise as the estimating equations sometimes have multiple solutions (Hernán and Robins 2006a; Clarke and Windmeijer 2010, 2012; Burgess et al. 2014).
Targeting the COR poses additional problems due to the non-collapsibility of odds ratios and the situation becomes even more complicated if data on (X, Y, G) are obtained from a case–control study where bias can be induced through conditioning on the outcome Y. In a case–control setting, the distribution of confounders in the control group is typically different from that in the general population due to over-recruitment of cases and this can induce an undesired association between the IV G and the unmeasured confounders U (Didelez and Sheehan 2007b). Here, ORs have to be used despite the problems induced by selecting on case status since other measures of association are even more sensitive to retrospective sampling (Burgess et al. 2017b). When good estimates of disease prevalence or population allele frequencies are available, an MR analysis can be re-weighted to yield reliable estimates of the COR (Bowden and Vansteelandt 2011). Recent advances have been made using IVs for survival outcomes. Non-collapsibility of the hazard ratio in the popular Cox model is problematic and requires an approximate approach (Martinussen et al. 2019) whereas additive hazard models behave more like 2SLS (Tchetgen Tchetgen et al. 2015; Martinussen et al. 2017). They all require individual-level (one sample) data and are restricted to a single IV.
Bayesian approaches to MR analyses have also been proposed (Burgess et al. 2010; Burgess and Thompson 2012; Jones et al. 2012) and recent work addresses the issue of dependent IVs (Shapland et al. 2019) and invalid IVs with pleiotropic effects (Berzuini et al. 2019). These methods have yet to gain popularity in applied studies, possibly due to the unavailability of user-friendly software but also, perhaps, because these approaches are fully parametric requiring a complete specification of the likelihood (which implicitly or explicitly includes the unobserved confounding) together with prior distributions on all parameters in the model. Inferences are hence very sensitive to the modelling assumptions and prior information.