Background

As U.S. states begin to reduce coronavirus social restrictions, the risk of contracting COVID-19 is likely to increase. While statistical models have been built to predict severity of illness and mortality related to COVID-19 infection [1], less has been done to predict the risk of initial infection in community settings. Studies to date have contained limited demographic information, have focused on hospitalized patients, and have not been representative of U.S. populations [2,3,4,5,6,7].

Most studies are limited to known clinical risk factors for severe illness and mortality, such older age [3, 4] and chronic health conditions such as hypertension [3], cardiovascular disease [4], and diabetes [7]. More recent research by the U.S. Centers for Disease Control and Prevention (CDC) has identified specific groups at higher risk for severe illness, such as older adults living in long term care facilities, those with a BMI of forty or higher, and immunosuppressed individuals, including people withHIV/AIDS [8]. However, most risk models have not incorporated clinical, sociodemographic, and environmental variables, which may be predictive of community spread within the U.S.

As with other infectious diseases, predictors of COVID-19 infection may include employment status, education level, income, and housing conditions [9], which could influence the ability to seek care, adhere to treatment, and practice physical distancing measures. Thus, effective strategies for predicting risk factors for community transmission should include both clinical and social factors [10]. The latter factors in particular remain understudied, especially among communities of lower socioeconomic status [10].

Emerging data already show that communities of color and/or low socioeconomic status are experiencing disproportionate rates of serious illness if infected, due to pre-existing economic and health inequities [11, 12].

By performing large scale analyses, healthcare systems can play a role in investigating patient and population differences in disease susceptibility, distinct from mortality risk. The purpose of this study was to use collated data from an entire health system to identify the apparent sociodemographic and environmental, as well as clinical predictors of the risk of COVID-19 infection and their relevance to persistent health disparities across race, ethnicity, socioeconomic status, language, and age [13].

Methods

Study design and setting

This study was conducted at Providence Health System, the third largest not-for-profit health system in the U.S., servicing more than five million people across seven states located in the Western and Southwestern portion of the U.S.

Data source

Data were collected from the Providence enterprise data warehouse. The data elements that were collected were informed by a comprehensive review of prior scientific studies that documented mortality risk factors and the CDC list of groups at higher risk for severe illness [8]. Variables included patient demographic, social, and behavioral history information; chronic conditions documented in clinical history; current conditions; prescribed medications; laboratory testing results; and acute and ambulatory healthcare utilization.

To study sociodemographic and environmental variables, electronic medical record (EMR) data was utilized to link patients’ locations to the U.S. Census Bureau’s 2018 American Community Survey and the CDC air quality data. To join these datasets to EMR data, patient addresses were geocoded, and matched at the census block group or tract level.

Glottolog, a repository for the world’s languages, was used to assign language groups. Geographic regions and clinical symptoms were also included as variables. Census data on educational attainment and financial insecurity were used to assess socioeconomic status.

Participants and procedures

Patients residing in Alaska, Washington, Oregon, Montana, and California (Los Angeles and parts of Orange County) who were tested for acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection between February 28, 2020 and April 27, 2020 were included in the data set. Testing mechanisms included swabs from respiratory specimens appropriate for viral RNA testing from eight testing platforms.

Outcomes and predictors

The principle dependent variable for our model was COVID-19 infection, as indicated by a positive lab test.

Distributions of all continuous variables including age, BMI, number of medications, and neighborhood financial insecurity were examined for normality and transformed into categorical attributes. Comorbidities were determined by problem list documentation or clinical encounter diagnoses using standard International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) nomenclature and further summarized into a measure of disease severity using total number of chronic conditions. Substance, tobacco, and alcohol consumptions were captured from social history assessments and clinician documentation.

The following variables were used as indicators of physical proximity to other people (i.e., structural barriers to social distancing): transportation insecurity, relationship status, employment, housing insecurity, and age-stratified communal living.

Statistical methods and modeling

Descriptive statistics were used to summarize study participants. Continuous variables were described by means and standard deviations, while categorical variables were described using frequencies and percentages. We conducted bivariate analysis to assess a significant effect of each factor on the outcome. All covariates with p < 0.25 in the bivariate analysis were considered for model inclusion since use of a more traditional level of 0.05 often fails to identify variables whose association with the outcome could become stronger in the presence of other variables [14]. In addition, all variables of known clinical importance found in previous studies that could make an important contribution were included to improve upon previous models [1]. Beginning with all variables of interest, a stepwise selection with backward elimination was used to create a multivariable logistic regression model for predicting risk of infection.

Initial parameters for the model were identified in the training set and then tested at the subsequent step, with data randomly partitioned into two independent data subsets: 80% for training and building the model and another 20% for testing. Missing data was recoded as unknown and included in the analysis. Detailed covariate definitions and data sources are shown in the supplement.

The model’s ability to discriminate COVID-19 infection in the validation data set was evaluated using the area under the receiver operating characteristic curve and Hosmer-Lemeshow goodness-of-fit statistic. The observed and expected frequencies within each decile of risk was compared [14]. All data manipulation and modeling were completed in SAS EG (SAS Institute, Carry NC).

For all independent predictor subgroups, the risk of COVID-19 infection was quantified with odds ratios (OR) and 95% confidence intervals. These risks were calculated using the entire data set.

Results

Study population

A total of 34,503 COVID-19 tested patients were included in the study (Table 1). The average age was 50 years old (SD 20), 59.6% (21,209) were female, 12% (4183) were identified as non-white race, and 66% (22,610) had at least one comorbidity. Within the study population, 7.5% (2578) patients tested positive and 92.5% (31,925) tested negative for COVID-19. Of patients testing positive, 36% (924) were hospitalized and 9% (240) died during the study period.

Table 1 Study Participant Demographics and Characteristic

Risk factors

Table 2 shows the twenty-nine sociodemographic, clinical, and environmental covariates associated with odds of infection.

Table 2 Final Multivariable Model Results

Sociodemographic risk factors

Comparatively, individuals between 50 and 59 years of age (OR 1.69; 95% CI 1.41–2.02, p < 0.0001) or male gender (OR 1.32; 95% CI 1.21–1.44, p < 0.0001) were more likely to contract COVID-19. Being employed (OR 1.85; 95% CI 1.39–2.46, p = 0.02), or retired (OR 2.06; 95% CI 1.54–2.76, p < 0.0001) was associated with higher levels of infection. Asian race (OR 1.43; 95% CI 1.18–1.72, p = 0.0002), Black/African American race (OR 1.51; 95% CI 1.25–1.83, p < 0.0001), and Latino ethnicity (OR 2.07; 95% CI 1.77–2.41, p < 0.0001) were more likely than whites to contract COVID-19. Individuals who identified as being married or having a significant other were at higher infection risk (OR 1.12; 95% CI 1.01–1.25, p = 0.04), as were those whose primary language was not English (OR 2.09; 95% CI 1.7–2.57, p < 0.0001), and those who self-reported their religious affiliation as Christian denomination (OR 1.28; 95% CI 1.15–1.43, p < 0.0001).

Clinical risk factors

Clinical risk factors including being very severely obese (OR 1.58; 95% CI 1.31–1.91, p < 0.0001), or having been diagnosed with diabetes (OR 1.40; 95% CI 1.22–1.61, p < 0.0001), chronic kidney disease (OR 1.03; 95% CI 1.01–2.3, p = 0.04), dementia (OR 2.01; 95% CI 1.61–2.51, p < 0.0001), or HIV/AIDS (OR 1.43; 95% CI 1.03–2.63, p = 0.03). Having an external primary care provider (OR 1.23; 95% CI 1.1–1.37, p = 0.0004) or an unknown primary care provider (OR 1.27; 95% CI 1.11–1.46, p = 0.0005) were associated with higher infection risk compared to having a primary care provider within the Providence Health System. Receiving electronic communication through the EMR was associated with a lower infection risk (OR 0.72; 95% CI 0.66–0.8, p < 0.0001).

Environmental risk factors

Patients living in areas with low air quality (OR 1.01; 95% CI 1.0–1.04, p = 0.05), financial insecurity (OR 1.10; 95% CI 1.01–1.25, p = 0.04), transportation insecurity (OR 1.11; 95% CI 1.02–1.23, p = 0.03), or housing insecurity (OR 1.32; 95% CI 1.16–1.5, p < 0.0001) were at higher risk of infection. Living in senior living facilities was associated with greater infection risk (OR 1.69; 95% CI 1.23–2.32, p = 0.001).

Prediction of infection risk

The model performed consistently across training and testing data sets with a receiver operating characteristic area under the curve of 0.78 and the Hosmer-Lemeshow chi-square of 4.4 (p = 0.81). The probabilities of infection was partitioned into “deciles of risk” (i.e. equal groups from smallest to the largest) did not highlight any “underperforming” areas.

Discussion

Clinical risk factors

This retrospective study of the risk of COVID-19 infection identified several clinical risk factors also associated with serious illness in prior studies, including older age [3], male gender [15], diabetes [7], chronic kidney disease [16], high BMI [17], and immunosuppression [18]. However, some factors previously found to increase mortality risk, such as hypertension [3], and cardiovascular disease, liver disease, lung disease, or asthma [8], were not significant factors associated with initial COVID-19 infection.

Surprisingly, being prescribed more than ten medications or having a greater number of chronic conditions was associated with less infection risk, suggesting possible risk reduction behavior based on perceived risk. Further research is needed to understand the differences between factors associated with initial infection risk and those associated with serious illness and mortality once the infection occurs.

Healthcare access through a relationship with an internal primary care provider was associated with a lower infection risk; however, this may be a result of higher rates of testing for COVID-19 compared to individuals with no primary care provider. Patients without a primary care provider may have only been tested for COVID-19 after respiratory and other possible COVID-19 symptoms became conspicuous, thus increasing the probability of a positive test.

Receiving secure electronic communication through the EMR was associated with lower risk of infection, suggesting that access to health advice and education may reduce risk.

Serious mental illness and drug and tobacco use were associated with lower risk; however further study is necessary to understand the mechanisms behind such associations.

Sociodemographic risk factors

Race and ethnicity appeared to be important predictors of risk. Higher risk of infection among Black, indigenous, and/or people of color may be associated with other sociodemographic and environmental characteristics found to also be significant in this study. African Americans and Latinos are more likely to live in communities with poor air quality [19], work in jobs that cannot telecommute [20], and lack access to healthcare [21] which may increase the risk of infection and contribute to racial disparities in mortality. Additionally, chronic conditions such as obesity, stroke, and diabetes, and premature death also affect African Americans and Latinos disproportionately compared to whites [13]. Communities of color are also more likely to experience lower socioeconomic status [22], and be employed as essential workers [10]. Additionally, for these and other vulnerable groups, lack of personal transportation is both a barrier to healthcare access [23] and social distancing, further exacerbating infection risk. For these reasons, communities of color experience more structural barriers to social distancing measures and are more vulnerable to severe illness.

Having limited English proficiency can be a barrier to accessing health services and understanding health information, especially when written translations and/or trained translators are not available [24]. Over the course of the pandemic, health information has changed rapidly (e.g., mandates for masking), which can create barriers to accessing information and could leave indigenous and immigrant communities uninformed. During the Ebola epidemic in West Africa, language barriers were an obstacle to slowing the spread of the disease [25]. People with LEP are also more likely to have low health literacy compared to English speakers and are at a higher risk of poor health [26]. Culturally and linguistically appropriate interventions are essential, including communication materials of differentformats and reading levels developed through the collaboration of native language speakers and English speakers, as well as the use of community health workers that can engage with underserved groups [27].

Environmental risk factors

Older age may be considered both a clinical and an environmental risk factor, as it moderates both comorbidities (e.g., dementia) requiring caregiving and housing situations (e.g., living in senior communities). Our results showed that some sociodemographic patient characteristics that influence environmental exposure to social contact were also associated with increased rates of COVID-19 infection, such as being married or having a significant other, being employed, lacking access to a personal vehicle, and living in overcrowded housing, each of which significantly increased infection risk. Religious affiliation was also associated with increased risk, which may be attributed to attendance of large religious services or other behaviors associated with religious identity.

People experiencing housing insecurity may experience challenges with physical distancing, especially when housing is crowded. These individuals may also lack hand washing facilities and/or running water [28]. Both factors could facilitate community spread of infectious diseases.

Regional differences in infection risk were evident, with Southern California and the Western Washington having the highest infection rates (15.7 and 11.3% of tested patients) while Oregon and Alaska (4.3 and 4.7%) had the lowest rates. These regional differences may reflect some combination of population density, proximity to the initial points of COVID-19 entry into the U.S., and state-specific COVID-19 precautions.

Study limitations

This study was limited to patient data from the Providence Health System, and publicly available data sets. Although the organization serves a diverse patient population across seven Western U. S states, the generalizability of this study to the entire U.S is unclear. With limited testing available and evolving screening guidelines, clinical discernment and personal bias may have impacted which individuals received testing and thus may have influenced the rates of testing in certain populations. Additionally, it is impossible to correlate patient data to measures of individual patient behaviors, such as mask use or adherence to social distancing recommendations. Finally, this study focused on factors associated with initial infection risk, however other factors may further influence outcomes such as disease severity, time in hospital, and mortality.

Conclusions

Our construction of a multi-faceted prediction model of COVID-19 infection risk in our large, multi-state population has important implications for healthcare systems, public health departments, and city and state governments to further reduce the risk of infection and prevent the spread of COVID-19 in communities that may be disproportionately impacted. Knowledge of the complex mixture of clinical, ethnic, linguistic, and environmental factors that contribute to infection risk should enable more targeted public health approaches to decrease COVID-19 infection.

Linguistically and culturally appropriate prevention education, healthcare access including routine care and COVID-19 testing, and efforts to address substandard housing and hazardous working conditions are essential to reducing risk among vulnerable groups, especially communities of lower socioeconomic status which experience a greater incidence of infectious diseases [29]. Now, and as communities seek to “re-open,” addressing the disparities in infection that contribute to rates of serious illness and mortality are needed to alleviate the disproportionate burden of the pandemic and persisting health disparities.