Background

The population attributable fraction (PAF) is the fraction of disease cases in a sample that can be attributed to the exposure. The PAF is an important measure of the public health impact of an exposure on disease burden, and thus it is useful to prioritize public health interventions [1, 2]. The maximum likelihood method is commonly used to estimate the PAF. However, this approach to estimation requires a correct model for the probability of disease given the exposure and other covariates subject to the ignorable treatment assignment assumption [3] (to be reviewed in the next section). Logistic regression is typically used to model the probability of disease given the exposure and the other covariates [1, 4, 5].

The PAF of an exposure must fall between 0 and 1 by definition. A zero PAF indicates that the disease risk is irrelevant of the exposure levels. In contrast, a higher PAF value indicates a stronger association between disease risk and the exposure level. In the extreme, a PAF equal to 1 implies no disease risk when there is no exposure. If the disease risk is higher in the absence of the exposure than in the presence of the exposure, the PAF then has no proper meaning. Thus, the definition of the PAF itself also suggests a monotone relationship between the disease risk and the exposure level (a justification is presented at the end of the next section). Incorporating the monotonicity assumption into the estimation of the PAF provides performance gains when there are no other covariates [6].

In many research fields, the exposure is thought to have a monotone effect on the probability of having the disease, i.e., the probability is a monotone function of the exposure. For instance, a dose-response curve is often thought to be non-decreasing [7, 8]. One such example is the probability of suicidal ideation, which is thought to be an increasing function of both hopelessness and depression [912]. Hence, it is desirable to model the probability of suicidal ideation under the monotonicity constraint of both hopelessness and depression.

This situation also presents an analytic challenge. When there are other covariates, they can interact with the exposure, for instance, the interaction of two drugs [13]. In our example, hopelessness is a system of negative expectations concerning one’s future life and can cause depression. On the other hand, current depression can influence one’s hopelessness towards the future [14, 15]. Thus, an interaction between hopelessness and depression can present to have a joint effect on suicidal ideation.

These examples highlight, how in many studies, the effect of the exposure can be complicated and is not necessarily linear, including the common analysis of drug interactions [16]. We bring together several past innovations to propose a novel analytic solution. [17] study the PAF when there are joint effects or interactions. [18] develop an approach of estimating logistic regression models with interactions and monotonicity constraints. The authors [18] apply the approach to estimating the PAF and achieve substantially more accurate estimates in some settings than the usual approach which uses logistic regression without monotonicity constraints.

Semiparametric approaches have been applied to estimate relative risk functions [19], to calculate odds ratios [20], and to estimate effect measures in the presence of interactions [21]. Herein, we use B-splines [22] (see also [23, 24] and references therein), to develop a semiparametric approach to estimate the PAF in the presence of interactions with confounding. The model fitting procedure is formulated as a well studied quadratic programming problem, and, thus, can be easily solved using standard optimization packages. We implement the approach using the R function solve.QP in the package quadprog [25, 26]. After a simulation study, we illustrate our new method by examining hopelessness and depression for suicidal ideation among elderly depressed patients from the PROSPECT (Prevention of Suicide in Primary Care Elderly: Collaborative Trial) study [27]. Specifically, we model the interaction between hopelessness and depression under the monotonicity constraint.

Methods

We develop the semiparametric approach to estimate PAF accounting for interactions under the monotonicity constraint using B-splines in the last two subsections. We compare the performance of the following three approaches: the approach we developed (monB), the conventional B-splines (conB) approach without the monotonicity constraint but with the same knots, and the logistic regression approach (logit). Comparisons are made through simulation studies and a case study of estimating the PAF for suicidal ideation attributable to hopelessness or depression. To save computation time, we use a small number of basis functions, quadratic B-splines with knots placed at quantiles of the distribution of unique predictor values [28, P. 24]. We use quartiles {0, 1/4, 1/2, 3/4 and 1} as the knot locations.

Simulations

Let Y denote presence (1) or absence (0) of the outcome, Z the exposure, X a confounder, and V an additional covariate. The two interacting covariates Z and X are simulated independently from the Uniform [0,1] distribution where a 0 value is always part of the simulated z. The additional covariate V is simulated from a Bernoulli distribution with p=1/2. The outcome is simulated from a Bernoulli distribution with the following probability functions, A: 0.1+0.4z+0.3x−0.2xz+0.2v; B: 2(0.1+2 log(z/2+1))x/3+0.3v; and C: \(0.8\sqrt {z+0.1}x^{2}+0.1v\). Shapes of the models at a fixed v are provided in Figure S1 of the supplementary material.

We examined 1000 simulations for each sample size 100 and 200. We compared the absolute value of the bias, the variance, and the mean squared error (MSE) of estimating the PAF attributable to the exposure Z. The true PAF values from these models are respectively 0.3, 0.4404, and 0.4616. They are calculated from (4) where the integrals are computed using R functions integral and integral2 [29].

Illustrative case study

In the PROSPECT study, we focus on suicidal ideation four months after the beginning of the study. We consider the 592 patients in the study with no missing data. An event was observed if the score for suicidal ideation is greater than zero [27]. Following [30], we use the Beck Hopelessness Scale [31, BHS] to measure hopeless, and the Beck Depression Inventory score [12, BDI] to measure depression. The BHS and the BDI range from 0 to 19 and from 0 to 17, respectively, where a higher value means more hopelessness or depression.

Figure 1 shows the sample average of suicidal ideation by BDI and BHS scores, while Fig. 2 shows the corresponding patient frequency. Many patients with low BDI and BHS scores experienced no suicidal ideation. Overall, the risk of suicidal ideation increases with BDI and BHS scores, though the patient frequency decreases. The figures also show that high BHS scores are associated with high BDI scores. The PAF for hopelessness is the proportion of suicide ideation that would be prevented if all patients’ hopelessness was reduced to 0 on the BHS scale, while keeping BDI fixed. Similarly, the PAF for depression is the proportion of suicide ideation that would be prevented if all patients’ depression was reduced to 0 on the BDI scale, while BHS fixed.

Fig. 1
figure 1

Sample average of suicidal ideation by BDI and BHS scores

Fig. 2
figure 2

Patient frequency by BDI and BHS scores

Semiparametric estimation of the probability

Suppose Y is distributed in the exponential family with mean μ [32]. The logit link function connects μ with the exposure and the confounder through a smooth function f as log[μ/(1−μ)]=f(z,x). [18] models f(z,x) parametrically assuming linearity as f(z,x)=β0+β1z+β2x+β3x z and estimates the coefficients under the constraint that β1z+β3x zβ1z+β3x z when zz at every x. We use a flexible semiparametric approach to model a possible non-linear f(z,x) by linear combinations of B-splines basis functions [22] under the constraint f(z,x)≤f(z,x) when zz at every x. Let \(\boldmath {\psi }(z)=\left (\psi _{1}(z),\ldots,\psi _{P}(z)\right)'\) be a set of B-spline basis functions. Let \(\boldmath {\phi }(x)=\left (\phi _{1}(x),\ldots,\phi _{Q}(x)\right)'\) be another set of B-spline basis functions. We model f(z,x) as

$$ {\begin{aligned} f(z,x)=\sum_{p=1}^{P}\sum_{q=1}^Q \psi_{p}(z)b_{p,q}\phi_{q}(x) \equiv \boldmath{\psi}(z)'\textbf{B}\boldmath{\phi}(x), \ \textbf{B}[p,q]=b_{p,q}, \end{aligned}} $$
(1)

where B is the unknown coefficient matrix. Using Kronecker product ⊗ and the vectorization operator vec described in Section S1 of the supplementary material, we further obtain

$$f(z,x)=\left(\boldmath{\phi}(x)'\otimes \boldmath{\psi}(z)'\right)\text{vec} \textbf{B}\equiv\boldmath{\varphi}(z,x)'\boldmath{\beta}, $$
$$ \text{ where} \boldmath{\varphi}(z,x)'=\boldmath{\phi}(x)'\otimes \boldmath{\psi}(z)' \text{ and } \boldmath{\beta}=\text{vec} \textbf{B}. $$
(2)

The maximum likelihood estimate of β without the constraint can be viewed as a modified iteratively weighted least squares problem [33, 34]. Derivation in Section S2 of the supplementary material shows that the constraint f(z,x)≤f(z,x) when zz at every x can be expressed as Aβ0, where the matrix A has a special pattern. The derivation in Section S3 of the supplementary material shows that the estimation procedure can be expressed as a quadratic programming problem. The approach is also capable of finding the estimate under the additional constraint that f(z,x) is monotone in x.

Estimation of the PAF

To be general, let X be a vector of measured covariates. Let Y0 denote what the presence or absence would be if the exposure were to be eliminated. Let P be the probability measure. In particular, let P(Y0=1) be the “hypothetical probability of disease in the same population but with all exposure eliminated” [35, 36]. The PAF for the exposure Z is the proportion of disease that would be eliminated if Z=0,

$$ PAF=1-\frac{P(Y_0 = 1)}{P(Y=1)}=1-\frac{\int P(Y_0=1|\textbf{X}=\textbf{x}) d P}{P(Y=1)}. $$
(3)

To identify the PAF based on observed data, we make the assumption that all confounders of the disease-exposure relationship are measured and contained in X. We assume consequently that the ignorable treatment assignment [3] holds: P(Y0=1|X=x)=P(Y0=1|X=x,Z=0)=P(Y=1|X=x,Z=0). Under this assumption, the PAF can be written as

$$ PAF=1-\frac{\int P(Y=1|Z=0,\textbf{X}=\textbf{x}) d P}{P(Y=1)}, $$
(4)

which is equal to the expression (2.2) for the PAF in [36]. The PAF is also written as

$$ {}PAF=\int \left[1-\frac{P\left(Y=1|Z=0,\textbf{X}=\textbf{x}\right)}{P\left(Y=1|Z=z, \textbf{X}=\textbf{x}\right)}\right] d P(\cdot |Y=1), $$
(5)

where P(·|Y=1) is the conditional probability given Y=1 in the subpopulation of people with the disease [1, 5, 37]. We provide a proof of the equivalence between (4) and (5) in Section S4 of the supplementary material.

From a random sample of the population of size n, i=1,…,n, an estimate of the PAF then follows as

$$ {\begin{aligned} \hat{PAF}=\frac{1}{\sum_{i=1}^{n} I_{\{Y_{i}=1\}}}\sum_{\left\{ i: Y_{i}=1, 1\leq i\leq n\right\}} \left[1 -\frac{\hat{P}\left(Y_{i}=1|Z_{i}=0,\textbf{X}_{i}=\textbf{x}_{i}\right)} {\hat{P}\left(Y_{i}=1|Z_{i}=z_{i},\textbf{X}_{i}=\textbf{x}_{i}\right)}\right], \end{aligned}} $$
(6)

where \(I_{\{Y_{i}=1\}}\) is an indicator function [18]. Justification of the convergence of the estimate (6) to the PAF in (5) is provided by [6] in the scenario of no other covariates. A similar proof adjusting for the X can also be derived. Thus, an accurate estimate of the probability of the disease would provide a reasonable estimate of the PAF. The defined PAF in (3) as a proportion has no practical meaning if P(Y0=1)>P(Y=1). Similarly in (6), occurrences of \(\hat {P}\left (Y_{i}=1|Z_{i}=0,\textbf {X}_{i}=\textbf {x}_{i}\right)>\hat {P}\left (Y_{i}=1|Z_{i}=z_{i},\textbf {X}_{i}=\textbf {x}_{i}\right)\) can result in an estimated PAF out of the [0,1] range. Hence, an estimate of the probability with the monotonicity constraint ensures a reasonable estimate of the PAF.

Results

Simulation

Table 1 summarizes the proportion of the times that the estimated PAF was within the [0,1] range for the three approaches. As discussed, the monotonicity constraint ensures the estimated PAF is within [0,1]. The lack of a constraint results in PAF estimates that are sometimes out of the range using the logistic regression and the conventional B-splines approaches.

Table 1 Proportion of the times out of the 1000 simulations that the estimated PAF is between 0 and 1

Table 2 shows the performance of the three approaches of estimating the PAF. In the comparison, results from the conventional B-splines approach are not shown due to its large scale. Instead, we show the results from this approach by censoring the original estimate at 0 if it is negative, or at 1 if it is bigger than 1. Similarly, we obtain the censored estimate from the logistic regression approach. In general, the censored estimate improves the original estimate. Overall, the developed approach outperformed the other approaches in the settings that we examined.

Table 2 Comparison of the absolute value of the bias (|Bias|), the variance and the MSE of estimating the PAF among the logistic regression approach (logit), the conventional B-splines approach (conB), and the developed approach (monB)

Case study

The estimated probability of suicidal ideation is shown in Fig. 3 separately by the three approaches. Both the logistic regression and the developed approach demonstrate a monotone relationship. Fitted probabilities numerically 0 or 1 occurred using the conventional B-splines approach. Table 3 summarizes the estimated PAF by the three approaches. The logistic regression produces lower PAF attributable to BHS and higher PAF attributable to BDI. Due to the numerical instability, the conventional B-splines estimated PAF is out of the [0,1] range. We examine the numerical instability in the “Discussion” section.

Fig. 3
figure 3

Estimated probability of suicidal ideation by the logistic regression approach (top), the conventional B-splines approach (middle), and the developed approach (bottom)

Table 3 Estimated PAF attributable to BHS and to BDI by the logistic regression approach (logit), the conventional B-splines approach (conB), and the developed approach (monB). The 95% confidence intervals are obtained from 2.5% and 97.5% quantiles of 1000 bootstrap estimates

We use the bootstrap method to obtain the 95% confidence intervals of the estimates. The lower bound is the 2.5% quantile of the 1000 bootstrap estimates and the upper bound is the 97.5% quantile. The lower bound of the logistic regression estimated PAF attributable to BDI is negative. In fact, 28 of the estimated PAF’s are negative with the minimum being −32.45%. Among the 1000 estimates, two of the estimated PAF attributable to BHS are negative with the minimum being −4.27%.

Discussion

We used B-splines to develop a semiparametric estimate of the probability of an event under the monotonicity constraint and accounting for interactions. The approach is solved as a quadratic programming problem and can be easily implemented using the statistical software R. Using a boosting technique [38, 39] implement a similar approach to estimate β under the same constraint Aβ0 to solve the problem defined in equation (3) of the supplementary material. In the settings that have been tried, a comparison of fitting a univariate generalized linear model under the monotonicity constraint showed that the boosting algorithm is computationally intensive [23]. A summary of other approaches in the literature for estimation when there are monotonicity constraints and interactions is provided in Section S8 of the supplementary material.

Throughout our study, we placed the knots at the quartiles of the distribution of the unique predictors. We observed similar performance of the conventional B-splines approach and the developed approach when the knots are placed at the tertiles {0, 1/3, 2/3, 1}. In the simulation study, Tables S1 and S2 of the supplementary material show that overall the developed approach has better performance. In the case study, using the developed approach, the estimated PAF attributable to BHS was 67.66% (43.41%,95.13%), and the estimated PAF attributable to BDI was 23.51% (10.92%,55.62%) (Table S3). Similar numerical instability was observed using the conventional B-splines approach (Figure S2 and Table S3). The estimated PAF is also out of the defined range between 0 and 1 using the conventional approach when knots are placed at the quantiles {0, 1/2, 1} (Table S3). Performance of the developed approach is robust to the knots placement.

Conclusions

We developed a semiparametric estimate of the probability of an event using B-splines. Our approach can model a monotone relationship between the response and covariates, and can account for interactions. We applied the approach to estimate the PAF, and compared the performance of the estimator with the logistic regression approach and the conventional approach without the monotonicity constraint. Simulation studies showed that the developed estimator outperforms the other two approaches.