Background

The two most important agendas to improve United States (US) healthcare are to enhance access to care and increase the quality of care delivered in an equitable manner [1,2,3,4]. Although the United States healthcare expenditures far exceeds other developed countries, the US ranks 30th in terms of morbidity and mortality [5]. Despite efforts to expand delivery of health care through the Affordable Care Act [6], 12% of the population remained uninsured in 2016 with difficult access to primary care [7]. Furthermore, middle and lower socioeconomic classes are less likely to have a regular source of care, less likely to receive preventive services, and more likely to experience delays in their care [8,9,10,11]. Studies have also found evidence for widening racial gap in health [12]. Social inequalities in access to healthcare persist in the US healthcare delivery system even with the expansion of Medicaid [8, 13, 14].

Medical care alone cannot adequately improve health or address health disparities by socioeconomic status (SES) in the US [15], and thus it is imperative to clearly delineate the causal relationship between a patient’s SES and the quality of care received, in light of the recent government proposal for cutbacks on Medicaid [16, 17]. Many studies have investigated the influences of various social determinants of health, but few have recognized the longitudinal, compounding effect of SES on health [15, 18, 19]. Furthermore, the confounding bias between variables of SES, race, and health outcomes were not investigated in the analysis [8, 9, 12, 20,21,22,23,24,25,26]. In fact, SES and race are deeply intertwined in their development and longitudinal continuum through multiple familial generations in the US [12, 15]. Even the direction of causality between SES, race, and health has been in debate [12, 19]. In addition, SES is correlated with the type of insurance coverage, which further leads to disparities in access to care and health outcomes [10]. Experts state that the confounding effect among race, SES, and health in the US cannot be sufficiently eliminated by statistical means [12].

In contrast to our racially diverse nation with a multi-payer insurance system, Taiwan’s population sample is homogeneous for race and insurance type, owing to the establishment of a universal, single-payer health insurance system in 1995 [27]. Covering over 99% of the residents, the National Health Insurance (NHI) provides easily accessible, equitable care [27, 28]. Utilizing the NHI Database, we designed a population-based study with marginal structural modeling to more accurately investigate the longitudinal causal influence of SES on the quality of care in the community. Regarding income as a measure of SES, we hypothesized that without the confounding bias from race and insurance type, income is not a causal factor in determining the quality of care received. Results from this study will direct health policy towards improving quality of care by providing appropriate support for patients in lower SES.

Methods

Data source

As one of the largest and most comprehensive national-level population databases in the world [29], the NHI Database contains healthcare records of 30 million residents of Taiwan, including inpatient, outpatient, and pharmacy services [30]. The NHI Database is ideal for longitudinal epidemiological investigation because each beneficiary of the NHI has a unique identification number consistent across all datasets, and can be followed through multiple clinical encounters [31]. To maximally retain the population heterogeneity to reflect the real-world impact of SES, we selected a random 25% sample from the eligible adult population in 2000 from the NHI Database by random sampling method based on Floyd’s ordered hash table algorithm to ensure an equal probability of the eligible population being selected. We also excluded patients with missing data before 2006 for appropriate censoring and to ensure two-waves worth of longitudinal data at minimum (Fig. 1). This study was approved by the Institutional Review Board at the Chang Gung Memorial Hospital.

Fig. 1
figure 1

Selection flow for 25% random sample from Taiwan population

Overview of marginal structural modeling

Since patients’ socioeconomic status is a time-varying variable, we adopted marginal structural model (MSM) analysis to investigate the causal relationship between patients’ socioeconomic status as defined by income levels and the quality of care received measured by the rate of preventable hospitalization and the Elixhauser Comorbidity Index (ECI). MSM eliminates the confounding bias in estimating this causal relationship, where the other confounding effects vary over time from wave to wave, and these confounders may act as intermediate variables resulting from previous socioeconomic statuses at the same time [32, 33]. The observational data lack counterfactual outcomes that would have occurred under the opposite exposure level or treatment decision [34]. Thus, cross-sectional observational studies use stratification or multivariable regression analyses to address potential confounding and make causal inferences. However, with longitudinal data for which the time-varying confounders (variables that are associated with both treatment and the subsequent outcome) and treatment by indication (variables affected by prior exposure and affect future exposure levels) co-exist, these traditional analyses cannot account for the dynamic effects of the time-varying covariates appropriately to make a causal inference [35]. MSM provides an approach to balance out the potential confounding effects in a longitudinal study by creating a pseudo-population to mimic the data from a sequentially-randomized trial, and use that to estimate the average effect of a treatment or an exposure on potential outcomes [36]. We chose to use MSM to answer our research question because of these advantages that permit unbiased assessment of the causal effects of SES on the quality of care received in a longitudinal study. We conducted two separate models, each with a different outcome variable: one with the rate of preventable hospitalization and the other with the ECI.

Variables of interest

We considered preventable hospitalization and comorbidity as representative outcome measures of the quality of care in the community [9]. A patient was considered to have had a preventable hospitalization if the primary diagnosis code associated with the inpatient stay was included in the set of diagnosis and procedure codes defined by the Agency for Healthcare Research and Quality for 11 of ambulatory care sensitive conditions [37] for which hospitalizations can be avoided with good outpatient care [20, 21, 38, 39]. We calculated each patient’s Elixhauser comorbidity index (ECI) [40] by defining it as having ≥2 ambulatory visits or one hospital admission with a corresponding diagnosis code to ensure the validity of indices greater than zero. The first model with preventable hospitalization as outcome included ECI as a covariate, and was calculated as a categorical variable with 3 levels (0, 1–3, and ≥ 4). In the second model with ECI as outcome, ECI was defined as a continuous variable and calculated at years 2004, 2007, 2010, 2013, and 2016; thus ECI was excluded as a covariate from this model. Demographic and clinical covariates included in the analysis were patients’ sex, age, income level, occupation type, urbanization of area of residence, number of outpatient visits, number of hospital admissions, and the physician density of the area of patients’ residence. Occupation categories defined by the NHI program enrollment protocol [30] and income levels in Taiwan dollars were collected directly from the NHI Database and were converted to US dollars by the mean exchange rate during each corresponding calendar year [41]. We compared demographic characteristics by income groups for each time wave by analysis of variance test and chi-squared test. In our MSM analysis, values at year 2000 for each independent variable was defined as baseline values. Then, the baseline values were used to calculate the time-varying variables at each wave by taking the average value during each time interval for continuous variables and the mode for categorical variables at each wave and two years before each wave. For example, for the number of outpatient visits variable, the average value between 2001 and 2003 was considered as the number of outpatient visits in 2003. To create income groups at each wave, the average income during the time-period up to each wave-year was used. For example, the average income from 2001 to 2008 was used to define the income level at 2009.

Inverse probability of treatment weights

MSM can control the confounding effect of time-dependent confounders without over-adjusting by applying inverse probability of treatment weights (IPTW) [42]. At each time-point of follow-up, the probability of each patient receiving the treatment/exposure (or not receiving the treatment/exposure, whichever that actually took place) is estimated based on the baseline and time-varying covariates up to that time-point [42]. Then, patients are weighted by the inverse of their predicted probabilities of receiving the observed treatment/exposure to create a pseudo-population without the covariate imbalances. Under-represented subjects, given their previous covariate values and treatment history, receive proportionally higher weights, and vice-versa for over-represented subjects. In this pseudo-population, the potential confounders are distributed evenly and thus we can estimate causal effects [35]. To reduce the variability and improve the precision of estimation, we applied the stabilized version of IPTW weights [42, 43] as follows:

$$ \mathrm{SW}(t)=\prod \limits_{k=0}^t\frac{\mathit{\Pr}\left\{A(k)|\overline{A}\left(k-1\right),L(0)\right\}}{\mathit{\Pr}\left\{A(k)|\overline{A}\left(k-1\right),\overline{L}(k)\right\}} $$

Pr{*} denotes the probability function, A(k) represents the time-varying exposure at time k, \( \overline{A} \) (k-1) represents the exposure history prior to time (k − 1), \( \overline{L} \) (k) are the time-dependent covariates through time k that are possible mediators as well as confounders, and L(0) represents the vector of baseline covariates. Here, the numerator contains all covariates measured at baseline, and the denominator contains both baseline and time-varying confounders [42]. In our model, weights larger than 50 were considered extreme, and the weights were truncated at 50 in our analysis. To ensure that the confounders were balanced after applying IPTW, we compared absolute standardized mean differences across different exposure groups calculated before and after the weight application.

Censoring/attrition

To minimize selection bias from inconsistent study cohort at multiple time points, we used censoring weights to account for any loss to follow-up in the data by calculating for the probability of remaining uncensored up to each point of follow-up [44, 45]. We fit censoring models to predict the probability that a patient remained in the study for each time-interval that the patient actually remained in the study [33]. Each subject was weighted by the IPTW multiplied by the inverse probability of censoring weight [42].

MSM analysis

We defined 5 time-waves with 3-year interval between 2000 (baseline) and 2016. The baseline values and time-varying covariates collected up to each wave-year were used to assess their relationships with the outcome variable at each wave-year (Fig. 2).

Fig. 2
figure 2

Marginal structural modeling conceptual framework. *Baseline covariates (L(0)) were used to predict the low-income group exposure at each wave. The time-varying covariates (L(t)) was used to predict the low-income exposure at each wave, Pinc(t), as the outcome variable in the first stage. Then, during the second stage, Pinc(t) was used as the independent variable to examine the causal relationship between income and preventable hospitalization and comorbidity. All models in the first stage included baseline and prior low-income status to predict Pinc(t)

Before applying MSM, we first explored the associations of income quartiles with the outcome and found that outcomes from the higher three income quartiles were very similar. Therefore, we decided to dichotomize the income level into low-income group (<25th percentile) and high-income group (≥25th percentile).

We used a two-stage approach to estimate treatment effects from MSM (Fig. 2). During the first stage, we estimated each patient’s probability of being in his or her income group at each time point, and used the inverse of this probability as weights to balance the potential confounding owing to the observed and non-randomized income levels. To acquire the treatment weights (here low-income exposure), we fitted logistic regression models with both baseline variables and time-dependent variables up to each wave (2003, 2006, 2009, 2012, and 2015). Censoring weights were also applied by estimating the survival probability, because the most common reason for discontinuing the NHI coverage was death. Censoring was present at time = t if the patient transferred out prior to or at the next time point t + 1. The final weight for each patient was calculated by multiplying both the treatment weights and censoring weight at each time point. The numerator for treatment weighting was derived by adjusting the baseline values and previous income binary groups in the model and the denominator was derived by adjusting the baseline values and time-dependent variables. The numerator and denominator for censoring weights were obtained from the cox proportional hazard models, with the response variable in cox model as the patients’ binary censoring status.

Then, in the second stage, the causal parameter in the pseudo-population created with each individualized IPTW was recovered by fitting a weighted generalized linear model on the health outcomes (here preventable hospitalization and ECI). We applied IPTW to avoid biased estimation that happens when time-dependent confounders are inappropriately adjusted by stratification or traditional regression approaches. Furthermore, this methodology separates the time-dependent covariates confounder adjustment from the mediation adjustment in assessing the causal impact on the outcome [36, 45, 46].

Results

Sample characteristics - baseline

In the baseline year 2000, our study cohort included 2,844,334 patients with 32.3% in the low income group. By comparing demographic characteristics, our 25% random sample was not significantly different from the overall population (Additional file 1). Among patients in the low-income group, 59.1% lived in urban, and 34.5% in suburban areas. The distribution by place of residence for urban and suburban areas were similar between low-income and high-income groups, but the proportion of high-income group living in rural areas (10.2%) was greater than that of the low-income group (6.4%). In fact, only 16.7% (44,262/220,148) of rural dwellers were categorized as low-income. Distribution by occupation type was similar between low and high-income groups for public employees (category 1, 21.4% vs. 19.0%) and private employees (category 2, 37.6% vs. 39.3%). However, for self-employed (category 3) and those related to the military (category 4), over 99% were included in the high-income group. Approximately 93.7% of veterans and those without permanent jobs (category 6) were categorized as low-income. The composition of patient mix by ECI was similar between the two income groups. The mean number of outpatient visits (calculated per 1000 patients) was 11.3 for low-income group and 11.5 for high-income group. The mean number of hospital admissions, also calculated for every 1000 patients, was 0.11 for low-income group and 0.10 for high-income group. In other words, patients with higher income utilized ambulatory care services more frequently whereas lower-income patients required inpatient services at a higher rate (Table 1).

Table 1 Sample characteristics, baseline and first waves (no. (%), mean (95% CI))

Sample characteristics first wave to fifth wave

The total number of patients in the study decreased from 2,844,334 in the first and second waves to 2,753,224 in the third, 2,644,668 in the fourth, and 2,538,246 in the fifth wave. Thus, the overall rate of attrition during the 16-year study period was 10.8%. The high-end limit in income ranges increased for both low and high-income groups from first wave to fourth wave, reflecting the growth of Taiwan economy over the years. The distribution of residence by urbanization stayed relatively constant throughout the study period.

The proportion of patients with ECI of 0 decreased over the years for both low and high-income groups, whereas those with index between 1 and 3 increased for both groups. There was an increasing trend in rate of health care utilization by the low-income group, as the number of outpatient visits increased from 2003 to 2015 whereas it stayed relatively constant for the high-income group. The number of inpatient admissions stayed the same for both income groups. The physician density in area of residence increased over the years for both income groups as well (Tables 1, 2, and 3).

Table 2 Sample characteristics, second and third waves (no. (%), mean (95% CI))
Table 3 Sample characteristics, fourth and fifth waves (no. (%), mean (95% CI))

Analysis of causal inference by MSM

The covariates between the low-income and high-income groups were not balanced across the waves (Tables 1, 2, and 3). We compared the absolute standardized mean differences before and after applying the IPTW and confirmed that the weighted data used in MSM analysis had the covariates balanced (Additional file 2). Through MSM analysis, we found that patients in low-income group had 1.28 times greater odds of incurring a preventable hospitalization in comparison to patients in the high-income group (p < 0.001). In other words, patients with lower income were 28% more likely to require inpatient-level treatment of their ambulatory care sensitive condition(s). We also found that patients in low-income group had 1.04 times greater odds of having a comorbidity (p < 0.001) (Table 4).

Table 4 Analysis of causal impact of low income on health with marginal structural modeling

Discussion

In our study, patients in lower income group were 28% more likely to experience a preventable hospitalization, which indicates that they have a much higher risk of acquiring diseases severe enough to require inpatient-level treatment that would have been prevented with good ambulatory care. Our finding of lower income patients having 4% higher odds for comorbidity echoes this interpretation. Our analysis demonstrated that income is a causal factor in a patient’s health status even under a universal health insurance system that grants equitable access to care. Furthermore, as the Taiwanese population is relatively homogenous in race, this finding is in a setting where racial disparities in healthcare that exist in the US is eliminated. In addition, our use of MSM shows that the relationship between a patient’s SES and quality of care received is actually causal, rather than correlational.

Past studies that have investigated the influence of various demographic and health factors on preventable hospitalization are based on observational data [8, 10, 11, 20,21,22, 25] and there is a concern with effect estimates that they may be biased from unobserved confounding among the variables used [46]. Moreover, these studies used cross-sectional designs that assume the magnitude of influence of each study variable is constant and takes immediate effect [46]. Rather, many of these patient and environment factors influence health at each life stage, with accumulating social advantages and disadvantages that translate to health advantages and disadvantages over time through complex causal pathways [15, 21, 46]. In fact, a systematic review comparing estimates from conventional analysis and MSM found that there is a statistically significant difference in results between the two analytic designs [35]. Thus, our result by MSM analysis can be considered to be a more accurate estimate of the true effect of a patient’s socioeconomic status on quality of care received.

Our study has established the causal impact of income on health, and that it is a social factor at the very root of health inequalities stemming from issues both inside and outside of the healthcare division [9, 23, 47]. Upstream social determinants of health have been correlated to the onset and progression of various diseases as well as to overall mortality [48], and experts stress that many of these factors influence each other as well [21]. It is well accepted that strategies to solve health disparities should address the very generators of the inequalities [23], yet in our current healthcare system, patients’ health-related social needs are seldom evaluated or addressed [48, 49]. Information on these factors along with patients’ clinical data would permit providers to tailor services and improve effectiveness of the care delivered [48]. Global health community considers universal health insurance coverage as a tool to improve access to care and further the population health, but our results suggest that providing universal health insurance is not enough to overcome the persistent inequalities in health by SES [50]. Policies that recognize income as an upstream causal determinant of health would be more effective in addressing the health disparities by SES and improve the quality of care for patients in lower socioeconomic group [51, 52].

Based on the nature of claims data, our study is subject to error from the inherent uncertainty in coding practice. Also, the validity and consistency of IPTW is based on the assumption that there are no other unmeasured confounders in MSM analysis [32], which is not guaranteed in real-life situations. Furthermore, MSM holds under the condition that the measured covariates at each time point are sufficient to adjust for confounding, which cannot be tested with observed data [44]. However, our study cohort is comprised from a national database with demonstrated validity of recorded diagnoses [53,54,55] that represents more than 99% of the population, and thus coding errors and the selection bias stemming from the study design are minimized. We excluded patients with missing data before 2006 for appropriate censoring and to ensure two-waves worth of data, and this may add to the selection bias. However, the excluded patients are small in proportion (10% of the registrants of the NHI in 2000, Fig. 1) and thus should not lay a significant impact on our results. Lastly, our study timeframe of 16 years captures a small portion of a patient’s lifetime [46]. Nevertheless, our results still successfully showcase the causal effect of income on a patient’s health and quality of care received with methodology to address time-varying confounding bias within longitudinal population-based data.

Conclusion

Socioeconomic disparities in the quality of care delivered in the US persist despite the expansion of insurance coverage. Our study using data from a universal health insurance system found that income is a casual determinant of health even with equitable access to care. Patients in lower socioeconomic group would benefit greatly from policy interventions that recognize and address the upstream social determinants of health and the inequalities that emerge from them.