Introduction

Using observational data to compare outcomes associated with different treatments may result in biased estimates because of non-random treatment assignment. To correct for variables that may confound an association, the traditional approach is to apply multivariable-adjusted modeling, but in recent years, use of propensity scores has become increasingly popular [1]. The concept of a multivariate confounder score was first introduced by Miettinen in 1976 [2], but the formal concept of propensity scores to estimate causal effects in observational studies was first described by Rosenbaum and Rubin [3]. A propensity score is a conditional probability of assignment to a particular treatment given a vector of baseline covariates. Except for unmeasured potential confounding factors, two patients having the same propensity score but assigned to different treatments are considered to be equivalent to a random assignment of treatment. Thus, adjustment for the propensity score in the outcome model can balance the observed and included covariates and remove bias that may arise due to these confounders. This adjustment can be accomplished by either 1) selecting matched pairs of patients each on a different treatment arm, but with similar propensity scores, 2) stratifying the sample on the propensity score, calculating the treatment effect within strata and then pooling the strata-specific treatment effect estimates, or 3) including the propensity score itself as a covariate in the outcome model.

Several advantages of propensity score-stratified versus traditional multivariable-adjusted modeling have been suggested. The propensity model does not need to be parsimonious and easy to understand because it is not the focus of the study [4]. Furthermore, the propensity score enables a direct estimation of comparability of the treatment groups by assessing the covariate balance between groups. Inability to balance confounders alerts investigators that the treatment groups are not sufficiently overlapping with respect to these confounders and that selection bias may not be resolvable [4]. Traditional multivariable regression modeling will not detect this directly.

Patients with end stage renal disease (ESRD) require renal replacement therapy (RRT). Of all therapeutic options, renal transplantation is generally associated with the highest survival and quality of life. However, due to the shortage of organs, the majority of ESRD patients are treated with renal dialysis. Two main forms of renal dialysis can be distinguished: hemodialysis (HD) and peritoneal dialysis (PD). Many factors influence dialysis treatment assignment: not only the clinical characteristics of a patient, but also patient and physician preference, cultural factors and reimbursement policy decisions may play a role. Therefore, comparison of patient survival on HD and PD is complicated. Because the one randomized controlled trial that has been undertaken to assess survival differences had to be stopped prematurely because of low inclusion rates [5], observational studies have to be relied upon to compare survival on HD versus PD. Our aim was to compare a propensity score-stratified model with a traditional multivariable-adjusted model, specifically in estimating survival of hemodialysis (HD) versus peritoneal dialysis (PD) patients to assess the possible advantages of using a propensity score.

Analysis

Methods

Patients

We included all incident patients who started RRT between January 1st 1987 (start of prospective registration) and December 31st 2002 from the Dutch End Stage Renal Disease Registry (RENINE). We excluded patients younger than 18 years, patients who underwent RRT for less than 30 days, patients who had more than one episode of recovery of renal function, or who died directly following a period of renal recovery, patients who received a pre-emptive transplantation, patients who died during the first 90 days of renal replacement therapy and patients from centers treating fewer than 20 dialysis patients or fewer than 5 PD patients. The outcome of interest was all-cause mortality, as registered by RENINE. The registry collects information on date and cause of death and verifies its information yearly with all centers [6, 7]. From registry data we also determined age and gender of patients, baseline dialysis modality, year of first dialysis, and the center at which dialysis was started. Modality switches among HD, PD, and kidney transplantation over time were also recorded. In the database, primary renal diagnosis was coded according to the classification of the European Renal Association-European Dialysis and Transplantation Association (ERA-EDTA). After examining previously published disease categories and hazard rates in the Dutch registry, we aggregated these into five categories: glomerulonephritis (PRD-GN), hypertension (PRD-HT), renovascular disease (PRD-RVD), diabetes mellitus (PRD-DM) and a category for all other renal diagnoses (PRD-OTH).

Statistical modeling

We adopted an intention-to-treat perspective and, as is customary in previously published analyses, considered the dialysis modality on day 91 to be the definite modality. We left-censored survival time for the first 90 days and right-censored at first transplantation or December 31st 2002, whichever occurred first. To estimate the independent comparison between PD and HD mortality by controlling for observed potential confounders, we explored two analytical options: a propensity score-adjustment approach and the traditional multivariable-adjustment approach.

The propensity-score approach involved a two-step approach. First, we predicted PD versus HD treatment assignment by constructing a logistic regression model that estimated treatment assignment using all available variables, as well as age-squared, age-cubed, and all possible 2-way interactions between the database-variables. As explained earlier, this model did not need to be parsimonious nor easy to understand, because it was not the focus of the study. The model calculated the expected probability or propensity score of each patient being assigned to PD, accounting for that individual's baseline characteristics. The propensity score was then evaluated for the following criteria: 1) a reasonable Nagelkerke's r2-statistic as a measure of fit and a c-statistic between 0.65 and 0.85 as a measure of discriminative power, 2) good calibration as measured by the PS-predicted and observed proportion of PD patients within quintiles of the propensity score, and 3) balanced covariates within quintiles of the propensity score [8]. This third criterion is most important for assessing the appropriateness of the PS-model [9]. In the second-step, estimating the effect of treatment assignment on outcome adjusted for the propensity score, we stratified a Cox model containing dialysis modality as the only independent variable on intervals of 0.01 of the propensity score. Alternative techniques to adjust for the propensity score include matching or regression. However, regression is affected by measurement errors in the propensity score [10]. Furthermore, it assumes a linear relationship between the propensity score and the natural logarithm of the hazard. Matching or stratification techniques do not assume such a relationship, but matching entails exclusion of patients because of the unavailability of a match. Stratification on intervals of 0.01 closely resembles matching, but because the number of patients in either exposure group within a stratum may vary, only few patients will need to be excluded. In our analyses, ten strata not containing either HD or PD patients were excluded.

In the alternative multivariable-adjustment approach, the calculation of the relative mortality of PD patients compared with HD patients was conducted by entering observed characteristics as covariates into the survival regression model and thereby adjusting for potential confounders. The first step in this approach was to estimate univariable Cox proportional hazards models for all available variables. Age and year of start of dialysis were entered into the model as continuous variables and all other variables as categorical variables. All statistically significant variables (P < 0.05) from the univariable analyses were introduced into a multivariable main-effects Cox proportional hazards model. From this multivariable model, we explored the significance of a quadratic term (age) and several two-way and three-way interaction terms. We tested for center effects by entering center as a categorical variable into the multivariable model. Finally, we compared the hazard ratios (HR) for mortality with PD versus HD from the propensity score-stratified and the multivariable-adjusted models.

Results

Patients

The RENINE Registry prospectively collected data of 20,687 patients who started RRT between January 1st 1987 and December 31st 2002. We discarded 4,044 patients that did not meet inclusion criteria. As a result, our final sample included 16,643 patients from 47 centers. Mean age was 59 years (standard deviation, SD: 15.3) and 58.8% were male. Additional descriptive characteristics are shown in Table 1.

Table 1 Baseline characteristics of study cohort

Propensity score analysis

The propensity score model containing age, age2, age3, all other variables and all possible two-way interactions had a Nagelkerke's r2 of 0.240 and a c-statistic of 0.752. Leaving all non-significant variables out of the model did not alter these quality indicators substantially.

The propensity score in quintiles showed good calibration. The mean propensity scores (the probability of receiving PD) were 9.1%, 21.4% 32.9%, 46.3% and 64.6% for each quintile, respectively and were very similar to the actual proportions of patients on peritoneal dialysis in all quintiles (see Figure 1). Furthermore, the propensity score balanced the covariates between the HD and PD groups except for a slight (1.4 year) difference in age within the fifth quintile and in the starting year within the second and the fifth quintile (Table 2). Stratifying the univariable Cox model on 0.01 intervals of the propensity score yielded no difference in mortality risk between PD and HD patients (HR = 0.97; 95%CI 0.92-1.02) (Table 3).

Table 2 Baseline characteristics of study cohort, by quintile of propensity score
Table 3 Multivariable-adjusted and propensity score-stratified models without interaction variables
Figure 1
figure 1

Mean predicted and observed probability of peritoneal dialysis assignment per quintile of the propensity score. Predicted (%): probability of assignment to peritoneal dialysis as predicted by the propensity score; Observed (%): actual prevalence of peritoneal dialysis assignment.

Multivariable regression analysis

In the unadjusted Cox model, patients receiving PD had a 30% lower mortality compared with HD patients (HR = 0.70; 95%CI: 0.67-0.74; p < 0.001). The coefficients of all other univariable models were also statistically significant (p-values ranging from < 0.001 to 0.02), both in the overall population and in the HD and PD groups separately. The coefficient for the year of starting RRT was not significant in the total population, because with increasing year of start of RRT, the relative risk of dying increased for HD patients and decreased for PD patients.

In contrast to the univariable model, the multivariable Cox model, adjusted for main effects of age, gender, primary renal disease, year of first RRT and treatment center but without interaction terms revealed that mortality of PD patients and HD patients did not differ significantly (Table 3). The HR of PD compared with HD patients was 0.99 (95%CI: 0.94-1.05), which did not differ from the relative risk estimated by the propensity score-stratified model (HR 0.97; 95%CI 0.92-1.02). Note however that the propensity score model involved only one covariate as opposed to nine in the multivariable Cox model.

Exploration of effect modification

The constructed Cox models did not consider the possibility of effect modifiers on outcome. When tested in the multivariable model, four interaction variables were statistically significant: two with modality (age by modality (HD or PD) and diabetes as the primary cause of renal disease (PRD-DM) by modality), and two other interaction variables (age by PRD-DM and gender by PRD-DM). After entering these interaction variables into both the propensity score-stratified and the multivariable-adjusted model (Table 4), the hazard ratios of dialysis modality and all interaction variables with dialysis modality were statistically significant in both models. As before, the results from the propensity score-stratified and multivariable-adjusted models were essentially identical. The propensity score-stratified model, however, included almost the same number of variables as the multivariable-adjusted model. The only two additional variables in this multivariable-adjusted model were year of first RRT and dialysis center. Hazard ratios for the combinations of the interaction variables as estimated by the multivariable-adjusted model are presented in Table 5. They show a relative survival benefit of PD compared with HD that diminishes with increasing age and in the presence of diabetes. Since the proportionality assumption was not satisfied, time-stratified analyses are presented. The clinical issues associated with these findings are discussed more in depth elsewhere [11].

Table 4 Multivariable-adjusted and propensity score-stratified models with interaction variables
Table 5 Associations between dialysis modality and mortality for various combinations of the effect modifiers

Conclusion and Discussion

In this study of nearly all patients who initiated chronic dialysis treatment between 1987 and 2002 in The Netherlands, we developed a propensity score that fulfilled accepted quality criteria: it showed a reasonable fit, had a good c-statistic, calibrated well and balanced the covariates. The Cox model that solely stratified on propensity score yielded essentially identical effect estimates of PD versus HD mortality compared with the multivariable-adjusted model while having the advantage of containing only one as opposed to nine covariates, thus being more parsimonious. When excluding interaction terms, dialysis modality was not an independent predictor of mortality in either model, but both models were misspecified, because effect modification was present. After introducing both interaction terms and all corresponding main effect variables to account for effect modification, the propensity score-stratified model contained almost the same number of covariates as the multivariable-adjusted model. When the models included interaction variables, all covariates remained independent predictors, but now dialysis modality besides its interaction variables became statistically significant. Supporting our findings, the identified effect modifiers, age and diabetes as primary renal disease, correspond to those found in previous studies [1215].

Our study informs the discussion of the utility of propensity score in outcomes research. In theory, the use of propensity score-stratified modeling may allow for a more straightforward estimation of the relative mortality risk in comparison with multivariable-adjusted modeling. However, our study shows that neglecting effect modification in propensity score-stratified models may lead to erroneous conclusions. Incorporating effect modification however removes the direct interpretability of the main treatment effect one wishes to estimate, thereby limiting one benefit of using a propensity score. Still, Sturmer and colleagues [16] suggest that the propensity score-adjustment approach may yet have an advantage over traditional methods when effect modification is present, because it allows for a summary effect size across all strata of the effect modifiers. This can be relevant in pharmacoepidemiology, to estimate how a total population might benefit from a particular drug. However in the setting of end-stage renal disease, the assignment of a patient to a specific dialysis modality should be tailored to a patient's specific pretreatment characteristics and should not be determined by the summary effect size across all strata of the effect modifiers.

Other studies that compared propensity score-stratified versus multivariable-adjusted modeling have been reviewed by Shah and colleagues [4] and Sturmer and colleagues [1]. Similar to our findings, both studies concluded that propensity score-stratified modeling rarely led to a different result compared with multivariable-adjusted modeling. Choosing which method may depend on the quantity of data. Cepeda and colleagues report from their simulation study that with eight or more outcomes per confounder, the multivariable-adjusted logistic regression model showed better precision [17]. However, with fewer than eight events per confounder, propensity score-stratified modeling performed better. Furthermore, the choice also depends on the research question, as suggested by Kurth and colleagues [18]. They showed that when there is a non-uniform treatment effect, different adjustment methods can result in divergent results, which may all be correct but depend on the research question implied by the adjustment method. Propensity score, as opposed to traditional multivariate modeling, enables a direct estimation of comparability of the treatment groups by assessing the covariate balance between groups [4].

For the propensity score-adjustment approach, there are no accepted rules for construction and evaluation of the propensity score model. Evaluation of a propensity score model often consists of assessing discrimination with the c-statistic and calibration with goodness-of-fit tests. However, Weitzen and colleagues in their simulation study showed that neither the c-statistic nor the Hosmer-Lemeshow goodness-of-fit test was sensitive to omission of an important confounder from the propensity score model [19].

Failure to include important confounders can lead to biased estimates of the treatment effect [20]. Austin and colleagues reported that propensity scores estimated on administrative data might not balance all clinical characteristics [21]. This could be particularly relevant to our study, because the RENINE database is administrative and does not contain clinical data, in particular co-morbidity data. Information on primary renal diagnosis however is available with the most important co-morbid condition, diabetes, likely well-represented because of the strong correlation between diabetes and diabetes as primary renal disease (PRD-DM). Further, Weitzen and colleagues [19] report that omitting a confounder in a propensity model has little effect on the treatment effect estimate. This could imply that the propensity score is fairly robust to unobserved confounders if, as also reported by Rosenbaum [3], at least some of the key variables that explain treatment assignment are included in the score. Moreover, omitting a confounder also leads to biased estimates when using a multivariable-adjusted model.

To summarize: if propensity score models are constructed well and no important confounders are missing, a treatment effect with a reliable significance level can usually be estimated, with the advantage of a more parsimonious outcome model and the advantage of assessing covariate balance between treatment groups explicitly. When the outcome is rare, propensity score-adjustment yields effect size estimates with a higher precision. Reviews of studies applying propensity score-adjustment methods have shown however, that propensity score-stratified modeling was often not implemented or reported appropriately [1, 4, 10]. Researchers should carefully assess whether propensity score-adjustment methods are appropriate for their specific situation.

From our study, we conclude that although the propensity score performed well, it did not alter the treatment effect in the outcome model and lost its advantage of parsimony because effect modification was present. Thus, using a model of mortality of patients on renal replacement therapy as a special case study, our study contributes to the growing literature supporting the comparability of traditional multivariable regression and propensity score methods unless sample size is small and outcome is rare.