Introduction

The COVID-19 pandemic has highlighted longstanding injustices in the distribution of societal resources and their consequences for health disparities [1,2,3]. Low-income communities and racially minoritized groups are disproportionally affected by the COVID-19 pandemic. Mechanisms underlying these inequities include higher risk of exposure among minoritized groups—due to disproportionate rates of incarceration [4], disproportionate participation in the essential workforce and in jobs with unsafe work conditions [5, 6]—, increased susceptibility to severe and more debilitating outcomes, and reduced access to protective measures and health care [2, 7,8,9]. Understanding modifiable factors that contribute to these inequities is critical to inform policies toward mitigating morbidity and mortality in future disease outbreaks.

The neighborhood environment is a social determinant of health; in particular, neighborhood socioeconomic disadvantage is a key driver of health inequities. Research about neighborhood effects on outcomes such as all-cause mortality, cardiovascular disease risk, and pregnancy and birth outcomes has demonstrated that neighborhood characteristics such as socioeconomic status and built environment are strongly associated with health and health disparities [10,11,12,13,14], and the COVID-19 pandemic was no exception. For example, research using neighborhood-level cumulative case and death counts found that disadvantaged neighborhoods in US cities had higher COVID-19 incidence and mortality and lower vaccination rates and access to testing than more socially advantaged neighborhoods [1,2,3, 15,16,17,18,19]. These studies included a variety of datasets and neighborhood disadvantage metrics, but many used exclusively neighborhood-level data and aggregate counts of cases, hospitalizations, deaths, vaccines, or testing sites without accounting for individual-level characteristics.

There are also a number of studies examining individual-level clinical risk factors for COVID-19 morbidity and mortality [20]. These studies take advantage of electronic health records (EHR) containing a large number of clinical, laboratory, and imaging variables. Although these studies are important to guide individual clinical care, they have limited use within a public health framework, which focuses on populations and communities. A small number of studies have combined EHR with other sources of data on indicators of the social determinants of health in search of a perspective that is more relevant to guide public health interventions. For example, data from Veterans Health Administration (VHA) were linked to county-level SDOH indicators showing that the risk of COVID-19 increased with increased in adverse county-level indicators, such as percentage of residents without a college degree and percentage of residents living in crowded housing, after adjusting for individual-level covariates [21]. Other VHA-based studies have examining disparities in COVID-19 risk, but VHA data are not likely generalizable to nonveterans [22, 23].

This study is unique in that it takes advantage of EHR containing clinical data on comorbidity diagnosis and sociodemographic data linked to neighborhood-level data and uses a robust study design (case–control) to test a hypothesis based on an explicit conceptual framework. We tested the hypothesis that area-level social vulnerability (measured by a composite index) is associated with the occurrence of severe COVID-19 in the Southeastern Pennsylvania region (SEPA). This region has approximately 4 million residents, including those in the city of Philadelphia and the surrounding counties of Montgomery, Bucks, Chester, and Delaware. COVID-19 cumulative incidence in the SEPA region ranged from five cases per 100 residences in Bucks County to seven cases per 100 residents in Philadelphia in early 2021 [19]. Residents of this region have also been historically affected by environmental exposures from sources such as manufacturing facilities, oil refineries, and major highways [24, 25]. Some of the most affected areas by environmental exposures tend to be more densely populated and have high rates of poverty [26, 27], suggesting an accumulation of harmful factors to health.

Conceptual Framework

We used a neighborhood-centered approach to examine the association between area-level social vulnerability and COVID-19. A neighborhood-centered approach presents an alternative to biomedical and lifestyle models, which emphasize individual-level risk factors. In a neighborhood approach, individual-level factors are proximal to the health outcome(s) of interest and oftentimes on the pathway between neighborhood-level exposure and the outcome. In this context, a conceptual model demonstrating the expected pathways of association between variables at the individual- and neighborhood-level is critical to understand how adjustments for individual-level factors may change the associations between neighborhood-level factors and health outcomes. Figure 1 represents the conceptual model for the associations being tested in this study. In this figure, individual-level variables are mediators in the pathway between neighborhood social stratification and COVID-19 outcomes because neighborhood disadvantage is associated with chronic conditions (e.g., cardiovascular disease and diabetes) [10, 13, 28], and presence of comorbidities such as cardiovascular disease and diabetes is associated with COVID-19 outcomes and increase the risk of mortality [7, 29, 30]. At the same time, these comorbidities can also be confounders due to backdoor associations (Fig. 1) between neighborhood disadvantage and COVID-19 outcomes. For example, racially minoritized groups are more likely to (1) have higher prevalence of comorbidities, (2) live in disadvantaged neighborhoods due to historical policies that created segregated neighborhoods [31, 32], and (3) have higher rates of COVID-19 due to other exposures such as occupation.

Fig. 1
figure 1

Conceptual framework. The dashed box outline represents historical policies that generated social stratification by race/ethnicity and for which the effects in health persist today. Green arrows represent the mediation path from neighborhood environment to comorbidities to COVID-19. Red arrows represent backdoor paths (i.e., confounding)

Methods

Data Source and Study Setting

We used electronic health records (EHR) from HealthShare Exchange (HSX), a health information hub covering most hospitals in the Southeastern Pennsylvania Region (Philadelphia, Bucks, Delaware, Chester, and Montgomery counties) in addition to ambulatory care settings, long-term care, and community health settings, and patients covered by various insurance providers, Medicaid, and Medicare. We used data from HSX clinical data repository, which includes sociodemographic data (e.g., gender, age, race, ZIP code, marital status), and clinical data on hospital inpatient visits, emergency department encounters, diagnoses, and procedures. We processed approximately 3 million health care encounters.

Study Design

This is a case–control retrospective study. Specifically, this study can be classified as a case cohort, where all cases of severe COVID-19 were identified, and a random sample from all members of the base population was selected to construct the control group [33]. To construct the sample, we first identified all individuals with a diagnosis of SARS-CoV-2 between March 1 and December 31 of 2020. Among those, we identified 15,464 unique cases of severe COVID-19, defined as those requiring inpatient care. Second, we selected 78,600 controls, a ratio of approximately five controls to one case. Controls were selected randomly from the source population after excluding all cases. To select a random sample of controls we assigned each patient in the clinical data repository a unique number using a random number generator, then ordered the numbers and selected the first 78,600 patients. Controls lived in one of the 220 neighborhoods (ZIP code areas) in five counties (i.e., Philadelphia, Bucks, Delaware, Chester, and Montgomery) with high coverage in the HSX clinical data repository. Controls were required to have at least one health care encounter in 2020 and one encounter in 2018–2019. These criteria increase the likelihood that the individual was residing in the area during the first year of the COVID-19 pandemic and had retrospective data (2018–2019) used to construct measures of exposure and covariates. Supplementary Fig. 1 shows the map of the region and compares the distribution of the population according to Census data and the distribution of controls in the analytical sample.

Outcome

The outcome of the study was severe COVID-19, or inpatient cases, defined as those with at least one inpatient admission and a SARS-CoV-2 diagnosis (ICD-10 codes: B34.2, B97.29, U07.1) between March 1, 2020 to December 31, 2020. We included all cases that fit into the definition of outcome adopted.

Exposure

Area-level social vulnerability was measured using the CDC’s social vulnerability index (SVI), defined in terms of community characteristics that affects their capacity to anticipate or recover from a disaster. This composite measure uses 15 variables related to four components: socioeconomic status, household composition and disability, monitory status and language, and housing type and transportation [34]. Measures such as the SVI have the advantage of capturing multiple dimensions of social and economic disadvantage [35]. The SVI has been used to characterize variations in COVID-19 outcomes and to allocate resources such as COVID-19 vaccines to communities with high need [36, 37]. We used 2015–2019 American Community Survey data 5-year estimates to calculate the SVI at level of ZIP code tabulation areas (ZCTAs). To calculate the SVI, the ZCTAs were ranked according to each of the variables in descending order, except for per capita income, which was ranked in ascending order, following Flanagan et al. [34]. After ranking the ZCTAs, we (1) calculated the percentile ranking for each variable, (2) calculated of the percentile ranking for the specific domains as the sum of the percentile ranks for each variable in the domain, and (3) calculated the overall SVI as the sum of the percentile ranks of the four domains.

Individual-level data from EHR were linked to area-level SVI using the ZIP code recorded in the most recent health care encounter prior to 2020 and mapping it to its corresponding ZCTA. For the analysis, the SVI, which originally ranges from 0 to 1, was multiplied by 10 to facilitate the interpretation of the results; thus, a change in one unit of the new measure can be interpreted as a 10% change in SVI.

Covariates

We extracted data on diagnosis codes for known COVID-19 comorbidities (hypertension, diabetes, heart disease, renal disease, liver disease, cancer, and immunocompromised state) retrospectively from multiple healthcare encounters recorded between 2018 and 2019 (see Supplemental Box 1 for ICD-10 codes). We also extracted demographic variables, including age, sex, and race/ethnicity categorized into Hispanic, non-Hispanic American Indian and Alaska Native, non-Hispanic Asian and Pacific Islander, non-Hispanic Black, non-Hispanic White, and other race as recorded in the patient health record. Among all individuals included in the study, 12.8% had missing data for one or more comorbidities. In addition, 3.0% and 5.3% had missing data for race/ethnicity and marital status, respectively.

Analytical Strategy

Cases and controls were characterized by demographic variables and according to the presence of comorbidities. We also created density plots showing the SVI distribution among cases and controls by race and ethnicity groups. We then constructed three primary models. The first model included the main exposure (SVI) and adjustments for sex and age. This model measures the association between social vulnerability and inpatient cases without adjusting for potential mediators or confounders, other than age or sex. The second model included comorbidities because they can act as confounders in the association between SVI and inpatient case (Fig. 1). The third model added the variable race/ethnicity. Adding race/ethnicity in the third model does not imply a biological difference among racial and ethnic groups. Rather, it reflects the fact that racially minoritized groups have been historically disadvantaged by discriminatory policies that impacted their residential distribution in the urban space and the resources made available in neighborhoods with large share of minoritized groups (Fig. 1).

We constructed models using the overall SVI and each of its four components: socioeconomic status, household composition and disability, monitory status and language, and housing type and transportation. We used mixed effects Poisson models with individuals nested within neighborhoods, based on the current ZIP code of residence (random intercept). We used Poisson regression rather than logistic regression and reported the incidence rate ratio (IRR) rather than the odds ratio to prevent overestimation of the association on the risk scale.

Among cases, 12.6% (1950 cases) had missing ZIP code values in the pre-pandemic period (2018–2019). For these cases, we used the ZIP code from their 2020 inpatient encounter. We conducted a sensitivity analysis excluding these cases to minimize potential bias from unmeasured residential mobility during the pandemic. We also constructed models stratified by different phases of the COVID-19 pandemic: (1) early pandemic (March 1 to May 31), summer (June 1 to July 31), fall (August 1 to October 31), and winter (November 1 to December 31) to assess changes related to COVID-19 surges and its impact on health system capacity. Finally, we used multiple imputation based on Markov Chain Monte Carlo augmentation to impute missing values for demographic characteristics. Analyses were conducted using STATA 17.

Research Ethics Approval

This project was determined to be not human subject research and exempt from IRB review by the Drexel University Institutional Review Board, protocol #2,110,008,842.

Results

Table 1 shows the sample characteristics. Among controls, mean age was 53 years (SD = 19), 63% were female, and 48% were married or had a partnered, while 43% were not married/partnered. Non-Hispanic white individuals were the majority among controls (57%); non-Hispanic Black, Hispanic, non-Hispanic Asian Pacific Islander, and American Indian and Alaska Native individuals were 23%, 6%, 4%, and 5% of the sample of controls, respectively. Among controls, prevalence of chronic conditions varied from 24% for hypertension to 2% for liver disease. Among COVID-19 inpatient cases, mean age was higher than that for controls, 65 years (SD = 18), 52% were female, 36% were married or had a partner, and 57% were not married/partnered. The distribution of inpatient cases by race/ethnicity was also different; non-Hispanic Black individuals were the largest group (42%) followed by non-Hispanic white, and Hispanic individuals representing 37% and 11% of the cases, respectively. The prevalence of chronic conditions was considerably higher among cases than controls, varying from 67% prevalence for hypertension to 6% for liver disease. Prevalence of hypertension, diabetes, and heart disease was about three times higher among cases vs. controls, and immunocompromised state was about seven times higher among cases than controls.

Table 1 Sample characteristics

Figure 2 shows the distribution of cases and controls over the SVI by race/ethnicity groups. Racially minoritized groups are concentrated in neighborhoods with high SVI. Non-Hispanic white individuals are distributed more evenly across the SVI variable. In general, cases were more concentrated in high-SVI neighborhoods compared to controls.

Fig. 2
figure 2

Distribution of cases and controls by neighborhood social vulnerability index stratified by race/ethnicity groups

Table 2 shows the main results from models that included the overall SVI and models that included only one of the SVI components. For the overall SVI, models adjusted for different sets of covariates showed incidence rate ratios (IRR) ranging from 1.15 (95% CI, 1.13–1.17) in the model adjusted for individual-level age, sex, and marital status to 1.09 (95% CI, 1.08–1.11) in the fully adjusted model, which included individual-level comorbidities and race/ethnicity. Thus, the fully adjusted model indicates that a 10% higher area-level SVI was associated with a 9% higher risk of severe COVID-19. Secondary analyses excluding individuals who did not have ZIP code data prior to 2020 (Model 3.1) and model with imputed missing values (Model 3.2) showed similar results. Overall, models including only one of the SVI components (socioeconomic status, household composition, minority status and language, and housing type and transportation) showed weaker associations, but coefficients from models using the socioeconomic status component (IRR = 1.13–1.08, across models 1 to 3) were generally similar to coefficients from models using the overall SVI.

Table 2 Incidence rate ratio (IRR) and 95% CI for the association between retrospective measures of area-level social vulnerability and severe COVID-19, March to December 2020

Finally, Fig. 3 shows the IRR and 95% confidence intervals for the fully adjusted model (Model 3) stratified by different periods during the pandemic in 2020. Stratified models showed slight variation in the magnitude of the association between SVI and COVID-19 risk, but overall, the association was robust throughout the year.

Fig. 3
figure 3

Incidence rate ratio (IRR) and 95% CI for the fully adjusted model stratified by periods in 2020. Early pandemic: from March 1 to May 31; summer: from June 1 to July 31; fall: from August 1 to October 31; and winter: from November 1 to December 31

Discussion

In this case–control study using EHR from multiple healthcare systems and insurers in the SEPA region, we found a strong and persistent association between area-level social vulnerability and risk of severe COVID-19, defined as a case that required an inpatient visit. The association persisted even after adjusting for a number of comorbidities ascertained by retrospective diagnosis from EHR. A 10% higher SVI was associated with a 9% to 15% higher risk of COVID-19, depending on the set of covariates adjusted.

The patterns identified are consistent with the hypothesis that areas with high vulnerability do not just concentrate populations with high burden of disease, but that the conditions in which people live in these neighborhoods are associated with higher burden of disease. In this study, inpatient cases of COVID-19 were more likely to be from high-vulnerability neighborhoods, even after accounting for several comorbidities associated with COVID-19 and a number of demographic factors. In addition, given the fact that we used social vulnerability measures from retrospective data and SARS-CoV-2 was a new virus in 2020, reverse causation is not possible in this case.

The multiple sets of adjustments guided by the conceptual framework cover different possible association pathways between area-level vulnerability and COVID-19. The model with fewer adjustments shows a stronger association, likely due to the fact that these high-vulnerability neighborhoods also tend to concentrate sicker/more susceptible individuals, which is a source of confounding. The data also shows that high-vulnerability neighborhoods tend to concentrate individuals from racially minoritized groups who are likely exposed to other structural determinants of health [38, 39]. However, even after adjusting for several factors, the association persisted, indicating that social vulnerability is also directly associated with COVID-19 regardless of the presence of comorbidity and sociodemographic factors measured at the individual level. However, since comorbidities can also function as a mediator in the pathway between social vulnerability and COVID-19, adjusted models may be underestimating the main association. In this context, it is reasonable to believe that the true association is likely between that shown in models 1 (minimal adjustment excluding potential mediators) and model 3 (fully adjusted).

We found generally consistent results when examining different components of the SVI. Exploring different SVI components is particularly helpful to rule out issues related to the inclusion of variables such as race/ethnicity, which is not itself a marker of socioeconomic deprivation, but is linked to deprivation via racist policies [35]. Models including only the socioeconomic component of the SVI—which includes the variables percent of the residents living below poverty, percent unemployment, percent without a high school diploma, and average income, but does not include race or age composition—point to the same overall findings. Exploring specific SVI components also helps to rule out potential issues related to the inclusion of race and age both as individual-level variables and as neighborhood-level metrics in two of the SVI components [35].

Lastly, models stratified by periods of the pandemic showed consistent results. In the early pandemic period, the magnitude of the association between social vulnerability and severe COVID-19 risk was smaller compared to the subsequent periods. This could be due to overall low rates of hospitalizations among people from socially vulnerable neighborhoods who were more likely to face barriers to care and potentially dying at home [40] due to limited hospital capacity. This issue may have led to an underestimation of inpatient cases from socially vulnerable neighborhoods, leading to an underestimation of the risk ratio during this period.

Our findings are consistent with the literature examining the association between area-level socioeconomic measures and COVID-19 risk. For example, despite the differences between veteran and nonveteran populations [41], studies using EHR from Veteran Affairs (VA) linked to county-level social determinants of health [21], and neighborhood-level social vulnerability using the SVI and other composite metrics [23], showed an increased risk of COVID-19 infection and an increased risk of hospitalization in areas with greater socioeconomic adversity.

Our paper documents the association between neighborhood-level social disadvantage and the risk of severe COVID-19 in a population served by several healthcare systems in a large urban area. Our findings indicate that addressing area-level disadvantage is critical; by reducing social vulnerability, we may not just reduce the burden of COVID-19 and other potentially epidemic respiratory diseases directly, but we may also improve population health among those who need the most, which is critical to address health inequities. Information on these area-level characteristics can also guide equity-based public health interventions such as vaccination distribution or tailored social services to prevent hospitalizations and deaths in communities that concentrate disadvantage. In fact, this strategy was used by jurisdictions across the country to guide allocation of COVID-19 vaccines [42] with some showing promising results [43,44,45].

Limitations

This study has multiple strengths, including a well-designed sample construction and analytical approach, a large sample size distributed across over 200 neighborhoods (ZCTAs), and use of the retrospective data to measure exposure and confounders. However, there are also has limitations. One limitation is the possibility of selection bias; the control group included individuals who had at least on visit in the previous year, so those who are less likely to seek health care, either because they are healthier or because of barriers to access are less likely to be included in the control group. For example, differences in health seeking behaviors between females and males are well documented and the female/male ratio in our control sample confirms the pattern that females are more likely to use health care. In addition, limiting individuals in the control group to those who had at least one encounter in 2020 may also have led to including sicker patients, particularly as health care utilization for non-urgent care declined in 2020 due to the pandemic [46, 47]. Considering these issues, a control group composed of sicker individuals (compared to the overall population) may lead to an underestimation of the true associations we would have found at the population level. On the other hand, comorbidities may be underreported in the EHR. Even though we used retrospective data to capture records of diagnosis in the 2018–2019 period, some conditions such as hypertension are slightly lower in the sample when compared to population survey data [48], which can be due to underreporting of these conditions. This is a misclassification problem, i.e., individuals with comorbidities such as hypertension are misclassified as not having hypertension. This misclassification results in incomplete control for confounding, which may bias the results towards the direction of the confounding [49]; in this case, the bias would be away from the null, based on the comparison of models 1 and 2. Future work using this data set should “look back” at additional years when creating retrospective measures of prevalence to examine whether the lower prevalence of certain chronic conditions is a result of a relatively healthier user population or inconsistent reporting of chronic conditions in the EHR.

Conclusions

In this study, we found that individuals in neighborhoods with high social vulnerability were more likely to have severe COVID-19 after accounting for individual comorbidities and demographic characteristics. Our findings support initiatives that incorporate area-level social determinants of health when planning interventions and allocating resources to mitigate epidemic respiratory diseases, including other coronaviruses and influenza, which has historically shown similar disparities to COVID-19 [50]. By tailoring preventive measures to vulnerable communities, the number of cases requiring hospitalization could be reduced significantly, thus reducing the burden on this population and on the healthcare system in general. These results would likely extend to other diseases, making area-level interventions even more important to combat health inequities.