Bias assessment and correction for Levin’s population attributable fraction in the presence of confounding

In 1953, Morton Levin introduced a simple approach to estimating population attributable fractions (PAF) depending only on risk factor prevalence and relative risk. This formula and its extensions are still in widespread use today, particularly to estimate PAF in populations where individual data is unavailable. Unfortunately, Levin’s approach is known to be asymptotically biased for the PAF when the risk factor-disease relationship is confounded even if relative risks that are correctly adjusted for confounding are used in the estimator. Here we describe a simple re-expression of Miettinen’s estimand that depends on the causal relative risk, the unadjusted relative risk and the population risk factor prevalence. While this re-expression is not new, it has been underappreciated in the literature, and the associated estimator may be useful in estimating PAF in populations when individual data is unavailable provided estimated adjusted and unadjusted relative risks can be transported to the population of interest. Using the re-expressed estimand, we develop novel analytic formulae for the relative and absolute asymptotic bias in Levin’s formula, solidifying earlier work by Darrow and Steenland that used simulations to investigate this bias. We extend all results to settings with non-binary valued risk factors and continuous exposures and discuss the utility of these results in estimating PAF in practice. Supplementary Information The online version contains supplementary material available at 10.1007/s10654-023-01063-8.


Introduction
The Population Attributable Fraction (PAF), sometimes also referred to as the Population Attributable Risk and the Excess Fraction, is a commonly used metric in epidemiology.It is useful both to ascertain the importance of a risk factor in causing disease as well as to identify the best risk factors to target in a health intervention.
Applications of attributable fractions abound in the literature starting with Richard Doll's work in estimating the burden of lung cancer due to smoking [1], with recent examples relating to modifiable risk factors of gout in the US [2], risk factors for depression in Brazil [3] and modifiable risk factors for cancer in Denmark [4], with these studies representing only a very small sampling of recent literature.
Intuitively, PAF represents the fraction of prevalent disease cases that might have been avoided in some population if a risk factor were not present. 1If we express disease prevalence in the current population as P(Y = 1) , Y being a binary indicator for disease, and the prevalence of disease that would have been observed in the target population had the risk factor been removed as P(Y 0 = 1) , PAF can be expressed as: Here, we are restricting our usage of the term risk factor to apply to environmental, physiological or behavioural determinants of a particular disease; that is we do not view variables that have non-causal associations with disease as risk factors.Such factors might be binary, multi-level or continuously distributed.As an example, depending on the resolution of data capture, smoking might be coded as a binary indicator for current smoking, separate indicators for whether an individual is currently smoking, has given up smoking or has never smoked, or perhaps as total nicotine intake via smoking.Definition (1) and its interpretation, in terms of a comparison between the current population and the population had the risk factor been removed, when this is possible at least hypothetically, is valid for any type of exposure distribution.However, the details of the estimation process will differ between these three situations.We first will discuss the simplest case where the risk factor is binary before extending to more general situations.
In the binary risk factor setting, a vast literature has evolved on methods to estimate (1) with individual-level data [5].However, it is sometimes necessary to derive estimates of PAF in scenarios where the collection of individual-level data is impossible.As an example, the Global Burden of Disease project [6] estimate population attributable fractions at a country level for differing risk factor/disease combinations.For certain countries, no individual-level data linking risk factors of interest to disease may be available.The most common approach in this situation is to substitute the estimated risk factor prevalence in the population, π , and estimated causal relative risk of disease (adjusted for known confounders), RR C ,2 into Eq.( 2) below.Equation ( 2) was introduced by Morton Levin [7], also in the context of estimating the PAF for smoking as a cause of lung cancer: (2) is prominent throughout the literature on PAF estimation.However, as recognised by multiple authors [8][9][10][11], (2) will usually differ from (1) as a quantity.In particular (2) will equal the true PAF only under the unlikely condition that the association between risk factor and outcome is unconfounded, or more technically exhibits marginal exchangeability [12].Marginal exchangeability implies that the causal relative risk, RR C = P(Y 1 =1) P(Y 0 =1) , defined as the ratio of disease prevalence comparing scenarios where the entire population was exposed and the entire population was unexposed to the risk factor in question equals the unadjusted relative risk, RR U = P(Y=1|X=1) P(Y=1|X=0) .See Supplementary Material Section 1B for more discussion. (2) .

Miettinen's definition of PAF
While (2) does not equal PAF in the absence of this 'no confounding' assumption, an alternative expression introduced by Miettinen [13], which involves the prevalence of the exposure within cases, c , as opposed to the overall population prevalence does equal (1) irrespective of the degree of confounding.Miettinen's expression is given by where c is now the proportion of disease cases that have the risk factor.The above expression is correct (that is PAF M = PAF ) provided the causal relative risk, RR C , is constant in differ- ing joint strata of the confounder variables.Otherwise, a similar formula holds, replacing the causal relative risk in the overall population with the causal relative risk in individuals exposed to the risk factor (See the worked example later in this section for an example, and Supplementary Section S2 for a proof).The usual advice is to use (3) when possible [9] and not (2) as a basis for estimating PAF.However, estimates for are more likely to be found in the published literature than estimates for c and as a result, researchers are more likely to use (2) in practice.For example, the Global Burden of Disease project utilise (2) rather than (3) to estimate PAF [6].

A re-expression of Miettinen's PAF
A fact that has been underappreciated in the literature is that Eq. ( 3) can be re-expressed to be a function of three quantities: the prevalence of the risk factor in the population, , the causal relative risk, RR C , and the unadjusted relative risk, RR U in the population: The equality of (3) and ( 4) can be shown by applying Bayes' rule (supplementary section S3).This formula was known to Miettinen (see Eq. 8 in his 1974 paper), but has received little attention in the literature, although an equivalent formula was recently described by Susuki and Yamamoto [14].While a somewhat simple re-expression of (3), (4) may be practically useful in estimating PAF with summary data as unadjusted relative risks are often reported together with adjusted relative risks.It is useful to notice that under no confounding, which implies RR U = RR C , the expression (4) simplifies to Levin's formula.This observation implies that in addition to Levin's formula being correct under marginal exchangeability as mentioned previously, it is also correct under a slightly weaker assumption of noconfounding in risk ratio ( RR U = RR C ), together with an assumption of no effect modification across confounder strata.However, formulae (2) and (4) will otherwise differ.

A worked example demonstrating the bias in Levin's formula
For illustration regarding both the biases at play from Levin's approach, we consider a hypothetical example with a single 2-level confounder, physical activity, that causally the risk factor affects a binary risk factor, hypertension, and disease.The physical activity/hypertension strata in the population, the population proportions in each stratum and the resulting observed probabilities of disease in the population are given in Table 1.
First, we will calculate the PAF from definition (1).The observed probability of disease in the population is just a weighted average of the observed probabilities of disease in the differing physical activity/hypertension subgroups, with weighting according to the corresponding proportions in the population: P(Y = 1) = 0.5(0.1)+ 0.3(0.15)+ 0.05(0.2) + 0.15(0.3)= 0.15.To calculate the counterfactual probability of disease under elimination of hypertension, we need to work out the counterfactual probabilities of disease in each subgroup, under the intervention that hypertension is eliminated.In the subgroups of the population that had no hypertension (subgroups 1 and 3), this is the same as the observed probability of disease.In the subgroups of the population with hypertension and a particular level of physical activity, this is the probability of disease in the subgroup having the same level of physical activity and no hypertension (This calculation is making use of the conditional exchangeability or no unmeasured confounding assumption).The counterfactual probability of disease in the population, under elimination of hypertension pressure is then a similar weighted average of these subgroup counterfactual probabilities.That is: P(Y 0 = 1) = 0.5(0.1)+ 0.3(0.1)+ 0.05(0.2) + 0.15(0.2) = 0.12 + 0.15(0.2) = 0.12 .The PAF can then be calculated from Eq. (1) as Since, physical activity is the only confounder of the risk factor/disease relationship, causal relative risks can be calculated by the taking the ratio of the probabilities of disease for individuals with high and low blood pressure, within the same stratum of physical activity.These probabilities are (0.15 vs 0.1) in the high-physical activity stratum, and (0.3 vs 0.2) in the low-physical activity stratum, both leading to the same stratum-specific causal relative risk for hypertension, RR C = 1.5.
The unadjusted relative risk in the population can be worked out by comparing the observed probabilities of disease in individuals with and without hypertension.Individuals with hypertension constitute 45% of the overall population, 2/3 of which (30% of the overall population) have high physical activity and 1/3 of which have low physical activity.This indicates that the observed probability of disease within individuals with hypertension is a (2/3, 1/3) weighted average: (2/3) 0.15 + (1/3) 0.3 = 0.2, of the observed probabilities of disease in the subgroup with hypertension and high physical activity and the observed probability of disease in the subgroup with hypertension and low physical activity.In a similar, way we can deduce that the observed probability of disease within individuals with no hypertension is (10∕11)0.1 + (1∕11)0.2∼ 0.11 .The unadjusted relative risk for hypertension is simply the ratio, 0.20/0.11= 1.83, of these two probabilities.Finally the prevalence of hypertension is 30% + 15% = 45% (or 0.45 as a proportion).
Numerically, Levin's formula works out to be:

Why does Levin's approach fail?
One might ask, why has Levin's approach failed here?As demonstrated in [15], Levin's formula can be interpreted as a ratio of approximate 'excess disease risk', given by RR C − 1 B among the proportion, , of individuals , and as a result Levin's for- mula, in this example, underestimates the PAF.In contrast, Miettinen's formula is correct.Miettinen's formula can be interpretated as saying the proportion of disease cases that would be prevented by elimination of the risk factor (that is the PAF) is the proportion of disease cases that are exposed to the risk factor, which is 0.6 , times the relative decrease in disease probability in those risk-factor exposed individuals had the risk factor been eliminated, which is: . This alternative construction that deals with relative (and not baseline) disease probabilities, applied solely to the subgroup exposed to the risk to the risk factor, obviates problems with confounding.

What if there is effect modification across strata?
In the above example, the casual relative risk is 1.5 in both confounder strata.If, in contrast, there is effect modification on the relative risk scale so that the causal risk factor varies among confounder strata, substituting the overall causal relative risk, RR C into (4), will no longer be correct.For instance, suppose the column 'observed probability of disease' in Table 1 was changed to have the entries (0.1, 0.15, 0.2, 0.6), with the population subgroup proportions unchanged.In this case, the per-stratum causal relative risks of hypertension among active and inactive individuals would be 1.5 and 3.The overall causal relative risk, comparing counterfactual scenarios where the entire population is exposed and is unexposed to the risk factor, is now RR C = 2 .Substituting RR C = 2 into Eq.( 4) leads to PAF M = 0.346 , an underestimate of the correct PAF, which is 0.385 (Levin's approach now generates PAF M = 0.31 ).Here the cor- rect relative risk to use in (4) is the causal relative risk of hypertension, restricted to the sub-population exposed to hypertension, which can be worked out to be 2.25.Substituting this modified relative risk into Eq.( 4) leads to the correct PAF.In practice, estimates for the causal relative risk in risk factor exposed individuals are usually unavailable from published summary data, and the best one can do is to estimate PAF by substituting estimated values for risk factor prevalence, , unadjusted relative risk, RR U , and the overall causal relative risk, RR C , into (4).This will only be an asymptotically correct approach if there is no effect modification, but should behave better than Levin's approach even if there is effect modification.In practice, RR C is usually estimated via a statistical modelling approach adjusting for known confounders.

Analysis of bias in Levin's estimate, assuming no effect modification on the relative risk scale
Darrow and Steenland [10] investigated the bias of Levin's formula using simulation, under an assumption that there was no effect modification on the relative risk scale.They showed that as the degree of confounding, specified by max{C, 1∕C} where C = RR C RR U gets larger, with no confounding in risk ratio being represented by C = 1 , the magnitude of relative bias in Levin's formula will increase.Their simulations also indicated that this relative bias will be larger for smaller risk factor prevalences, as again observed in the previous example.However, having an explicit formula for relative bias allows a more rigorous analysis of limiting behaviour for extreme values of C than was possible in [10].
Given the formula (4), which correctly identifies the PAF when causal relative risks do not vary over confounder strata, an analytic expression for the relative asymptotic bias can be trivially derived by simply taking the ratio of (3) and (4).After some simplification (supplementary section S4) this results in: 3In supplementary section S8, we analyse Eq. ( 5) mathematically, and show this relative bias function gets worse as the degree of confounding increases in either direction.

PAF L PAF
is 1, when there is no confounding, that is when C = 1.Fixing the prevalence and unadjusted relative risk, increases to a maximum of 1 + 1− RR U > 1 as the confounding ratio C increases, and decreases to 0 when C approaches 0. The absolute bias, PAF L − PAF , is again 0 when there is no confounding, that is when C = 1 , or when there is no causal effect of the risk factor, RR C = 1 , and increases to 1 −

RR U
1+ (RR U −1) for large C. One can also prove that the relative bias, (5), will get worse as the risk factor prevalence, , gets smaller, although the relationship between and absolute bias is more complicated (absolute bias approaches 0 when is close to 0 and when is close to 1, fixing RR U and C).These results (and other properties of the relative and absolute bias functions) are proven in Supplementary Section S8.A graphical representation of the associated biases (similar to the plots in [10], which also includes a representation of limiting biases and behaviour in the 'unadjusted' version of Levin's formula, (2), using RR U in place of RR C is given by Fig. 1.

PAF for exposures with more general distributions
When the distribution of exposure is multi-category or continuous, PAF can be defined as the fraction of prevalent (or incident) disease cases that would have been avoided in a population where the value of the exposure was fixed at a particular value or follows a distribution that minimises the probability of disease (this value is known as the minimum risk exposure value or MREV).While with individual-level data it is possible to estimate the MREV [16], when deriving estimates of PAF from summary estimates one usually assumes a pre-determined MREV [6].For example, in the simple case of an exposure such as air pollution which can be eliminated, 4 the MREV would be presumed to be 0, that is elimination, with non-zero values being appropriate for exposures like blood pressure or sodium consumption.In the following, we will assume for simplicity that the MREV = 0, so that PAF is still defined by (1).The causal relative risk is now a function, RR C (x) = P(Y x =1) P(Y 0 =1) , comparing the increased prevalence of disease if the population were all exposed to exposure level x relative to disease prevalence if the same were all exposed to the MREV.Miettinen's formula (4) can be extended as follows: where E X|Y=1 [f (X)] denotes the average of a function, f (X) , of the exposure, X within the population of individuals with disease.Under the assumption that for each possible exposure value, x, RR C (x) is constant within confounder strata, Eq. ( 6) will be correct for the PAF as given by ( 1), see supplementary section S5.Note that this formula is valid for both continuous exposures (such as air pollution) and multicategory exposures (such as a three-level coding of smoking in terms of current smokers/former smokers and neversmokers), and simplifies to (3) in the setting of a binary exposure.
As shown in supplementary section S7, ( 6) can be reexpressed using Bayes' Rule as follows: with R U (x) = P(Y=1|X=x) P(Y=1|X=0) representing the unadjusted relative risk function that compares the prevalence of disease in the strata of the population exposed to the value x of the risk factor with the prevalence in the strata exposed to the MREV, and E X is the population expectation operator.The right-hand side of (7) indicates that one can calculate PAF M by numerical integration of the quantities RR U (x) and RR U (x) (x), where we are assuming that the exposure distribution is absolutely continuous with density (x) .In the special case that the exposure is discrete, but with more than two levels: X ∈ {0, 1, ..., L} , with the population preva- lences of each level: (0), ..., (L) , and MREV = 0, Eq. ( 7) can be re-expressed as:

Bias from Levin's approach for more general exposure distributions
The generalisation of Levin's formula to continuous exposure distributions is given by: The formula for multi-category exposures is very almost identical, apart from replacing the integral in (9) with a summation over the values, x , that the exposure can take.( 6) )

RR U
for PAF L as C → ∞ (for plots C and D).Note that the absolute bias of PAF L is 0 at both C = 1∕RR U and C = 1 , while the relative bias for PAF L converges to as C approaches 1∕RR U (as tagged on the y-axis of plot C).The absolute and relative biases of the unadjusted version of Levin's formula are more erratic.In particular, the relative bias function for PAF L,U is negative when RR C < 1 or equivalently when C < 1 RR U indicating that PAF L,U shows the incorrect direction of association in this scenario; in contrast the relative bias for PAF L is always larger than 0 for all C > 0 and converges to 0 as C → 0 See Table 2 for a comparison of Levin's and Miettinen's approach for the binary, multi-category and continuous exposures.Comparing Eqs. ( 7) and ( 9), it follows immediately that under no confounding, that is RR U (x) = RR C (x) for every exposure value x , Levin's formula is again unbiased.However, analysis of bias in Levin's formula is more complicated when the distribution of the exposure is non-binary, although under certain special settings, one can analyse bias in the same way as in the binary exposure case.See supplementary section S9 for discussion.

Discussion
We use this final section mostly to discuss caveats and limitations to our suggested approaches.First, we have suggested that where possible Eq.( 4) should be used instead of Levin's formula, (2), to estimate PAF when individual-level data is unavailable.Equation (4) requires estimates of the unadjusted and causal relative risks, RR U and RR C .In practice, relative risks are estimated in one population and used to estimate PAF in a differing population.This requires an implicit assumption that the relative risks are transportable across populations, that is they are equal in the population in which the relative risks are estimated and the population for which PAF is being estimated.While transportability of the causal relative risk is an issue both for Eq. ( 4) and Levin's approach, (2), transportability of the unadjusted relative risk is an additional assumption and there is no good reason to expect it to hold.
Here we advocate that authors use Eq. ( 4) when estimates of unadjusted and adjusted relative risks are available.In contrast, in the literature we've noticed several examples of Levin's formula being applied with unadjusted relative risks in observational settings [18,19].While this will generate correct results under no confounding, we would not advise this practice.First, if there is no confounding, either an adjusted or unadjusted relative risk would be appropriate to use in Levin's approach, that is both should generate statistically consistent estimates of the true causal relative risk, provided no effect modifiers are included in the set of variables that are adjusted for in the adjusted relative risk.Second, in observational settings, there usually will be some degree of confounding in which case substituting unadjusted relative risks into Levin's formula will often result in more egregious bias than if appropriately adjusted relative risks were instead used in the same formula.For example, using Levin's formula with an unadjusted relative will flip the direction of the putative PAF when RR U < 1 and RR C > 1 , so that while PAF > 0 (the risk factor causes disease), PAF L,U = (RR U −1) 1+(RR U −1) < 0 , that is the risk factor would seem to be protective when using an unadjusted relative risk with Levin's approach.In contrast, this problem does not happen with using the true causal relative risk in Levin's formula since PAF L = (RR C −1) 1+(RR C −1) > 0 , when RR C > 1 .Plots B and D in Fig. 1 show that, as one might expect, Levin's formula also flips the direction of the causal effect if RR U > 1 and RR C < 1 .In addition, for very small causal effects with RR C ≈ 1 , PAF L ≈ 0 irrespective of the value of the con- founding ratio C , however in the same scenario PAF L,U can be quite biased if C is very large or very small.
As noted by other authors, the bias using Levin's formula, with the causal relative risk, is generally quite small [10] and likely insubstantial compared to other biases involved in estimating a PAF.For instance, in an example regarding the relationship between physical inactivity and depression, there may be questions about the consistency of ascertainment and measurement of both the exposure (inactivity) and outcome (depression).The prevalence of defined inactivity (and the associated relative risk) will change depending on what the investigator considers to constitute inactivity, and the definition may differ across different studies.Similar issues arise in the consistency of ascertainment and Table 2 Below, we show the identification formulae for PAF and PAF L , for binary, multicategory and continuous exposure distributions, under the assumption of no effect modification π represents the population prevalence of a binary risk/ factor and π c the prevalence within disease cases, with similar notation representing probability mass functions and density functions when the exposure is multi-category or continuously distributed.An implicit assumption for the multicategory and continuous formulae is that the MREV (minimum risk exposure value) is equal to 0. In all settings, PAF is expressed using the traditional Miettinen formula (in terms of π c and RR C ) and the 3 variable formula advocated in this manuscript which can be used if estimates of both the unadjusted and adjusted relative risks are available Continuous risk factor measurement of depression.These issues reduce the realworld meaning and actionability of any estimated PAF.On a related point, physical activity should really be measured on a continuum; a binary definition of inactivity will likely underestimate associated disease burden.For instance, a better approach to calculating PAF might determine a sufficient level of physical activity (that perhaps differs dependent on age and other characteristics of a person) and estimate disease prevalence in a hypothetical population where everybody had at least this level of activity; however, implementation of such an estimator may be practically challenging.Finally, the assumption that the causal relative risk is constant across differing confounding strata will usually be dubious (individuals exposed or not exposed to differing patterns of confounding variables have differing baseline probabilities of disease making the same relative effect of a risk factor unlikely).If the causal relative risk varies over differing strata of confounders, formulae for PAF for binary and general exposure distributions are as follows: where RR C (c) = P(Y 1 =1|C=c) ) is regarded as the causal relative risk within confounder stratum C = c in (10) and RR C (x, c) = P(Y x =1|C=c) P(Y MREV =1|C=c) as the causal relative risk comparing exposure levels x and the MREV in confounder stratum C = c in (11).Note again that (10) can be re-expressed as: PAF = P(X = 1|Y = 1) , where RR e = P(Y 1 =1|X=1) P(Y 0 =1|X=1) is the causal relative risk restricted to the population exposed to the risk factor, as Miettinen originally showed in 1972.These formulae are derived in supplementary sections S2 and S6 and are equivalent to, and extend of in the case of (11), the formulae suggested in [17], and without a no-effect modification assumption require individual-level data (incorporating effect modification between risk factors and confounders) to estimate.When relative risks vary over confounder strata, the marginal causal relative risk, RR C (defined as the ratio of disease probabilities if everyone was exposed and everyone was exposed to the risk factor), and the causal relative risk in exposed individuals are both weighted average of confounder-strata specific causal relative risks, RR C (c) (see Supplementary Appendix S10).As a result, one would not expect large differences between (4) and (10) under moderate levels of effect modification.
While equivalent results to Eq. ( 4) exist for multi-category and continuous exposures (Eqs.( 7)-( 8)) these may be not as useful in practice as Eq. ( 4) as they require specification of unadjusted relative risks comparing many levels of exposure to baseline.As a result, Levin's formula (despite its bias) may forever be the method of choice for estimating PAFs and impact fractions for continuous exposures with summary data.However, the biases in Levin's formulae are often dismissed in this setting.If the formula is to be used it is important to recognise its bias and have some awareness of the likely extent of the bias.In this regard, if there is a range of values of the exposure which approximately minimises disease risk and it's plausible that the confounding parameter: C(X) = = 1) − P(Y 0 = 1) P(Y = 1).

Fig. 1
Fig. 1 Plots for absolute and relative error in Levin's formula when RR U = 1.5 and = 0.5 over a range for the confounding ratio C , assuming no effect-size modification on the relative risk scale.A: absolute bias, RR C ≥ 1 , B: absolute bias, no restriction on RR C , C: relative bias, RR C ≥ 1 , D: relative bias, no restriction on RR C .In the attached plots, the bias for the 'adjusted' version, PAF L = (RR C −1) 1+ (RR C −1) of Levin's formula is given in blue, while biases for the unadjusted version,PAF L,U = (RR U −1) 1+ (RR U −1) are given in red.The upper blue horizontal lines are the limiting absolute bias: 1 − RR U 1+ (RR U −1) (for plots A and of c and RR C ) PAF M (in terms of , RR U and RR C ) Binary risk factor

Table 1
Hypothetical population where a single binary confounder, physical activity, affects the both a risk factor, hypertension, and diseaseThe table shows the proportions of individuals in each of the 4 physical activity/hypertension subgroups, the observed probabilities of disease in each of these subgroups, and the counterfactual probabilities of disease in each subgroup if hypertension were absent from the population exposed to the risk factor, compared to an assumed counterfactual baseline disease probability of B under the hypothetical scenario that nobody was exposed.Note that the total risk in the population (the denominator of Levin's formula) can be approximated as baseline risk + approximate excess risk, which is B + RR C − 1 B. The baseline risks, B , then cancel out on the numerator and denominator of Levin's formula.However, this calculation only holds true if there is no confounding.In contrast, in the above example individuals who have hypertension are more likely to have low physical activity.Because low physical activity increases disease risk, these individuals will have a larger counterfactual risk, B′ > B under elimination of hypertension com- pared to the counterfactual risk in the general population, that is B , if hypertension were eliminated.The true excess risk in individuals with hypertension is actually