Because of these issues, several authors have recommended that Mendelian randomization investigations should only test the causal null hypothesis, rather than attempt to calculate a causal estimate [13, 23]. With a single genetic variant, this can be achieved by testing for an association between the variant and the outcome. With multiple genetic variants, the most efficient test of the causal null hypothesis is achieved by the IV estimate (under the homogeneity assumptions, the two-stage least squares estimate, or equivalently the inverse-variance weighted estimate, is the optimally efficient combination of the instruments for testing for a causal effect [26])—hence we may want to perform IV estimation to test the causal null even if the IV estimate is regarded as a test statistic and does not have a clear interpretation as a causal effect.
Suppose that we want to calculate a causal effect with a binary exposure, under the assumption that the exposure has a stepwise effect on the outcome (that is, it truly is a dichotomous exposure). This may be because we truly believe in the homogeneity assumptions, or we truly believe in the monotonicity assumption and regard the genetic compliers as a worthwhile subgroup of the population in which to estimate an average causal effect. Or, more likely, because a causal effect estimate is required for pragmatic reasons, such as to perform a power calculation or to inform policymakers of the expected impact of intervention on the exposure. Other reasons for estimating a causal parameter include efficient testing of the causal null hypothesis with multiple candidate instrumental variables, and using a robust method with multiple genetic variants (such as the MR-Egger method [6] or weighted median method [7]—these methods make weaker assumptions, not requiring all genetic variants to satisfy the instrumental variable assumptions). If the binary exposure is a dichotomization of a continuous risk factor, then power calculations are likely to be conservative, as the effect of the genetic variant on the outcome will not be fully captured by the binary exposure.
Two options for causal estimation are: (1) estimating the effect on the outcome per (say) 1% absolute increase in the probability of the exposure; (2) estimating the effect on the outcome per (say) doubling of the probability (or odds) of the exposure. We concentrate on estimation methods based on regression (usually linear or logistic) for several reasons. First, often researchers perform their analyses using summarized association estimates—beta-coefficients from regression analyses of the exposure and outcome on a genetic variant—and do not have access to individual-level data. These beta-coefficients represent the average change in the trait (exposure or outcome) per additional copy of the effect allele. Secondly, these approaches result in causal estimates with a simple and relevant interpretation, and which can be compared to estimates in the literature from other analytical approaches. Thirdly, often there are technical restrictions on the data analysis—for example, it may be necessary to fit a mixed model to account for relatedness between individuals, to adjust for several principal components of ancestry, or to provide a coordinated approach to analysis across different datasets. These restrictions are easiest to accommodate in a regression framework. These estimation procedures require strict linearity and homogeneity assumptions; full details are available elsewhere [13, 15]. The parametric assumptions for these two options are mutually incompatible. Additionally, regression coefficients will generally be variation dependent on the baseline risk, a nuisance parameter [20]. If individual-level data are available, then alternative approaches to estimation can be taken [4, 25].
If the genetic associations with the exposure are estimated using linear regression, then they represent absolute changes in the prevalence of the exposure. This enables estimation of the causal effect of an intervention in the prevalence of the exposure on an absolute scale. It is sensible to scale the causal effect to consider a modest increase in the prevalence of the exposure (say a 1% or a 10% increase), as a unit increase would represent the average causal effect of a population intervention from 0% prevalence of the exposure to 100% prevalence—an unrealistic intervention in practice. However, absolute associations with a binary variable do not make sense in case-control settings (where cases are those with the exposure), as they depend on the ratio of cases to controls chosen by the investigator.
If the genetic associations with the exposure are estimated using logistic regression, then they represent log odds ratios. The causal estimate would then represent the change in the outcome per unit change in the exposure on the log odds scale. A unit increase in the log odds of a variable corresponds to a 2.72 (\(= \exp 1\))-fold multiplicative increase in the odds of the variable. If the exposure is rare then the odds of the exposure is approximately equal to the probability of the exposure. The causal estimate represents the average change in the outcome per 2.72-fold increase in the prevalence of the exposure (for example, an increase in the exposure prevalence from 1 to 2.72%). It may be more interpretable to think instead about the average change in the outcome per doubling (2-fold increase) in the prevalence of the exposure. This can be obtained by multiplying the causal estimate by 0.693 (\(= \log _e 2\)).