Background

In December 2019, cases of pneumonia of unknown cause occurred in Wuhan, Hubei Province, China. On January 7, 2020, a new beta-coronavirus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was identified [1,2,3]. Like severe acute respiratory syndrome coronavirus and Middle East respiratory syndrome coronavirus, SARS-CoV-2 causes a lower respiratory infection called coronavirus disease 2019 (COVID-19). The World Health Organization (WHO) declared the disease a public health emergency of international concern on January 30, 2020. On February 3, 2020, Brazil announced the disease as a public health emergency of national concern [4], and on February 25, 2020, the country confirmed its first case [5].

On June 14, 2021, the world had 175,686,814 confirmed cases and 3,803,592 deaths, and Brazil had 17,452,612 confirmed cases and 488,228 deaths [6]. COVID-19 can occur asymptomatically, as influenza-like illness, or as more severe forms, which characterize severe acute respiratory syndrome (SARS). The most common clinical manifestations include fever, dry cough, diarrhea, vomiting, and muscle pain or fatigue [7, 8]. The average incubation period is 5.2 days [9]. Transmission can occur by droplets and aerosols or by contact with contaminated objects [10].

Its mortality rate is higher in individuals over 80 years of age and in people with comorbidities, so these constitute the risk group for severe forms of the disease [11]. The gold standard for laboratory confirmation of SARS-CoV-2 infection is reverse transcription-polymerase chain reaction (RT-PCR). The performance of mass RT-PCR testing, contact tracing, and isolation of positive individuals is essential to control the spread of the virus [12,13,14,15].

This study aims to analyze the factors associated with death in confirmed cases of COVID-19 in the state of Rio de Janeiro and considers the importance of defining these factors, the objective of several studies in the literature, for better case management [16,17,18,19,20,21].

Methods

Design

The design was a cross-sectional study evaluating the association between individual demographic, clinical, and epidemiological predictor variables and the outcome (death) among confirmed cases of COVID-19 in the state of Rio de Janeiro using data gathered from the Unified Health System (SUS) information systems.

Data collection

We gathered the confirmed influenza-like illness cases caused by COVID-19 from the e-SUS NOTIFICA system of the Ministry of Health. We obtained the confirmed SARS cases by COVID-19 from the Flu Epidemiological Surveillance Information System (SIVEP-Gripe) of the Ministry of Health. We linked these two databases. The consistency analysis removed any duplicates from the resulting database. Next, we linked this database with the Mortality Information System database of the Ministry of Health. The study covered the period from March 5 (the first confirmed case in the state of Rio de Janeiro) to September 15, 2020. During this period, the state of Rio de Janeiro confirmed 243,509 cases of COVID-19, of which 178,231 (73.2%) were of influenza-like illness (ILI) and 65,278 (26.8%) were of SARS. All cases included in this study met the ILI and SARS criteria defined by the Ministry of Health [22].

The individual predictor variables selected covered demographic, clinical, and epidemiological characteristics (Table 1).

Table 1 Individual variables selected

The immunodeficiency/immunosuppression, considered as a risk factor/ comorbidity both in the notification and investigation of ILI and SARS cases, is defined as any cause of suppression or deficiency of the immune system associated with medications (corticosteroids, chemotherapy, TNF-alpha inhibitors) or pathologies (neoplasms, HIV/AIDS, among others).

Data analysis

We used Microsoft Excel 365 software to handle the database, while we performed the statistical analyses with the software R × 64 4.0.0 through RStudio Desktop and Python 3.8.5.

In the descriptive analysis of the data, we categorized the quantitative predictor variable age according to age groups (in years): 0 to 9, 10 to 19, 20 to 29, 30 to 39, 40 to 49, 50 to 59, 60 to 69, 70 to 79, and 80 years or older. For the modeling, we did not categorize patient age.

The bivariate descriptive analysis used bar graphs, pie charts, or box plots for the numerical predictor variable (age) and contingency tables for categorical predictor variables. We used Pearson’s chi-squared test to analyze categorical predictor variables and the Wilcoxon test for the numerical predictor variables, both with a level of significance of 5%.

We checked the database for missing data. We kept variables considered epidemiologically relevant but with a high percentage of missing data in the analysis, creating a category named “missing” for each predictor variable. They were “kidney diseases”, “immunodeficiency/immunosuppression”, “diabetes”, “heart disease”, “dyspnea”, “cough”, “fever”, “headache”, “sneezing”, “odynophagia”, “muscle pain”, “diarrhea”, “loss of smell”, “loss of taste”, and “runny nose”.

Due to a large number of categories of the predictor variable “municipality of residence” (91 municipalities), we grouped the residences into regions defined by the state government of Rio de Janeiro [23]. The exception was the city of Rio de Janeiro, which has more than one-third of the state’s population and more than half of the deaths, so we analyzed it as an individual category, called “Metropolitan Region I - capital”.

Then, through the evaluation with several machine learning algorithms, we applied an Extreme Boosting Gradient (XGBoost), as it showed the best performance for the dataset. To meet the assumption of balanced data for a good performance of machine learning models, we balanced the predicted variable (outcome) using the synthetic minority oversampling technique (SMOTE) [24] and edited nearest neighbor (ENN) [25] techniques. The XGBoost model uses decision trees weighted by the estimation difficulty [26]. Compared to machine learning models in general, the advantages of this model are the better generalization of the results, the possibility of visualizing all decision trees, and the reduction of the bias and variance of a single tree.

To evaluate the relevance of each predictor variable, we used the SHapley Additive exPlanations (SHAP) metric [27], which can take negative or positive values, and the closer it is to 0, the less the predictor variable influences the predicted variable. The results show the mean of the modules of the importance of each predictor variable.

Also, we used sensitivity, specificity, accuracy, positive predictive value, negative predictive value, Youden’s index, receiver operating characteristic (ROC) curves, and Matthews’ correlation coefficient (MCC) to evaluate the goodness of fit [28].

From the probabilities generated by the XGBoost model, we transformed the data to the logarithm of odds, where odds is the ratio between the probability of death and non-death attributed to the categories of each predictor variable for each individual. We used the logarithm of odds to estimate the odds ratio, according to the formula below:

$${Odds\ ratio}_i\left({OR}_i\right)=\frac{e^{mean\left(\log \left({odds}_{int}\right)\right)}}{e^{mean\left(\log \left({odds}_{ref}\right)\right)}}$$

Where:

$$\mathit{\operatorname{int}}=\mathrm{Category}\ \mathrm{of}\ \mathrm{interest}\ \mathrm{of}\ \mathrm{the}\ \mathrm{predictor}\ \mathrm{variable}$$
$$ref=\mathrm{Category}\ \mathrm{of}\ \mathrm{reference}\ \mathrm{of}\ \mathrm{the}\ \mathrm{predictor}\ \mathrm{variable}$$

For the numerical variable age, we used the categories of interest described earlier in this section to obtain the summarized results.

We calculated the confidence interval according to Wayne W. LaMorte [29] and estimated the total number of individuals by category and outcome using the following formula:

$$95\% confidence\ interval={e}^{\left(\log \left({OR}_i\right)\pm \left[1,96\times SE\left(\log \left({OR}_i\right)\right)\right]\right)}$$

Where:

$$\mathrm{SE}\left(\log \left({\mathrm{OR}}_{\mathrm{i}}\right)\right)=\sqrt{\frac{1}{{\mathrm{p}}_{\mathrm{i}\mathrm{nt}}\times {\mathrm{n}}_{\mathrm{i}\mathrm{nt}}}+\frac{1}{\left(1-{\mathrm{p}}_{\mathrm{i}\mathrm{nt}}\right)\times {\mathrm{n}}_{\mathrm{i}\mathrm{nt}}}+\frac{1}{{\mathrm{p}}_{\mathrm{ref}}\times {\mathrm{n}}_{\mathrm{ref}}}+\frac{1}{\left(1-{\mathrm{p}}_{\mathrm{ref}}\right)\times {\mathrm{n}}_{\mathrm{ref}}}}$$
$$p= estimated\ probability\ of\ death\ for\ the\ category$$
$$n= number\ of\ in dividuals\ in\ the\ category$$

The authors used the odds ratio confidence interval to calculate its p-value, according to the formula present in Altman [30] for proportions. The significance level considered was 5%.

Results

Between March 10 and September 15, 2020, the state of Rio de Janeiro reported 243,509 confirmed cases of COVID-19, of which 37.3% were residents of the municipality of Rio de Janeiro. Of the total number of cases, 178,231 (73.2%) were classified as ILI, and 65,278 (26.8%) were classified as SARS.

Table 2 shows the incidence, mortality, and lethality of confirmed cases of COVID-19 by age group. The disease incidence was higher in the age group 80 years and older, followed by 40 to 49 years and 30 to 39 years. Mortality and lethality increased with increasing age, except in children younger than 10 years.

Table 2 Distribution of incidence, mortality, and lethality by age group

Table 3 shows the predictor variables’ distribution as a function of disease severity (ILI and SARS). There was a higher frequency of cases among the age group of 40–49 years (21.3%), among females (52.4%), and among brown individuals (32.9%), followed by white individuals (32.6%). The most frequent signs and symptoms were cough, fever, dyspnea, and odynophagia. The most prevalent comorbidities were cardiovascular disease and diabetes mellitus. Regarding disease evolution, the lethality was 7.4%.

Table 3 Distribution of confirmed cases of ILI and SARS, according to the individual variables selected

ILI was more frequent among younger individuals, the most common age group being 30–39 years (24.4%); the median age at ILI diagnosis was 41 years, and it was more frequent among females (52.6%) and brown individuals (34.1%), followed by white individuals (33.1%). The most frequent signs and symptoms were cough, fever, and odynophagia. The frequency of comorbidities was 5.8% for cardiovascular disease and 3.0% for diabetes mellitus. Only 1.2% of ILI cases progressed to death.

SARS was more frequent in individuals older than 60 years (38.3%); the median age was 53. Women had the highest frequency (52.0%), but concerning race/color, white individuals had the highest frequency (31.4%), followed by brown individuals (29.7%). Dyspnea (80.2%), cough (65.7%), and fever (60.3%) were more frequent than in cases of ILI. The rate of associated comorbidities was also higher than that in ILI, with 23.8% cardiovascular disease and 15.6% diabetes mellitus. The lethality observed in cases of SARS was 24.5%.

The statistics of Pearson’s chi-square tests and respective p-values ​​are listed in Tables 3 and 4. P-values ​​under 0.05 indicate a rejection of the hypothesis of independence between the observed variables. All predictor variables were significantly different between ILI and SARS cases.

Table 4 Distribution of COVID-19 outcomes according to the individual variables selected

Table 4 shows the relationship between the predictor variables and the outcome (death). There was a higher proportion of elderly age (over 60 years old), male sex, and black race/skin color among deaths than among non-deaths. The signs and symptoms that stood out were dyspnea, fever, and cough. Dyspnea was present in 56.5% of cases of death and in 18.7% of non-deaths. Fever was present in 48.3% of the cases of death and in 37.4% of the cases of non-death. Cough was present in 49.3% of cases of death and 44.1% of cases of non-death. The most common comorbidities in both groups were cardiovascular disease, at 34.1% in cases of death and 8.7% in cases of non-death, and diabetes mellitus, at 24.0% in cases of death and 5.0% in cases of non-death.

The number of missing data of predictor variables were as follows: 1085 (0.4%) for age, 3506 (1.4%) for signs and symptoms, 396 (0.2%) for chronic respiratory disease, 55,684 (22.9%) for skin color/race, 19,006 (7.8%) for cardiovascular disease, 21,701 (8.9%) for diabetes mellitus, 25,870 (10.6%) for chronic kidney disease, 26,181 (10.8%) for immunodeficiency/immunosuppression, and 273 (0.1%) for municipality of residence.

There was a statistically significant association between the outcome and the predictor variables listed in Table 4.

The historical case series per epidemiological week showed a proportional increase in cases of ILI relative to those of SARS starting from epidemiological week 19 (Fig. 1a). Figure 1b shows a proportional increase in deaths compared to non-deaths between the 13th and 20th epidemiological weeks, with a maximum of 2465 deaths (14.7%) in week 18. After that, the death percentage dropped.

Fig. 1
figure 1

The proportion of cases per epidemiological week according to clinical evolution (ILI/SARS) (a) and outcome (b). Note: confirmed cases of COVID-19, state of Rio de Janeiro, March–September 2020

Concomitantly, there was a progressive increase in cases in younger age groups throughout the study period (Fig. 2).

Fig. 2
figure 2

The proportion of cases per epidemiological week according to age group. Note: confirmed cases of COVID-19, state of Rio de Janeiro, March–September 2020

Figure 3 illustrates the comparison of the age distribution according to clinical evolution - ILI vs. SARS (a) and death vs. non-death (b). ILI was more frequent in younger individuals and SARS in older individuals. The comparison between the age groups of individuals according to the outcome showed a higher concentration of elderly individuals in the group that progressed to death. In both cases, there was a statistically significant difference.

Fig. 3
figure 3

Age distribution according to clinical evolution (ILI/SARS) (a) and outcome (b). Note: confirmed cases of COVID-19, state of Rio de Janeiro, March–September 2020

The geographical distribution of confirmed cases and deaths per municipality of residence is shown in Table 5. The incidence for the entire state was equal to 1522.9 cases per 100,000 inhabitants, the mortality rate was 112.9 deaths per 100,000 inhabitants, and the lethality rate was 7.4%. We observed the highest incidences in Varre-Sai, Macaé, and Natividade, with rates of 6321.9, 3799.7, and 3626.8 cases per 100,000 inhabitants, respectively. The municipalities with the highest lethality rates were Nilópolis, São João do Meriti, and Rio de Janeiro, with 14.3, 13.3, and 11.9%, respectively. Rio de Janeiro, Iguaba Grande, and Tanguá recorded the highest mortality rates, with 170.7, 148.8, and 136.7 deaths per 100,000 inhabitants, respectively.

Table 5 Geographic distribution of cases by regions and municipalities of the state of Rio de Janeiro

Regarding the regions of the state of Rio de Janeiro, Metropolitan Region I had the highest lethality (10.5%) and mortality (139.0 deaths per 100,000 inhabitants), while the highest incidence was in the Costa Verde Region (2840.2 cases per 100,000 inhabitants).

The values in Fig. 4 illustrate the mean values of the SHAP metric of the most relevant predictor variables in defining the odds of death of the XGBoost model. Immunodeficiency/immunosuppression, age, region of residence, and presence of dyspnea were the characteristics with the most significant impact on the outcome.

Fig. 4
figure 4

Means of the SHAP metric of the predictor variables of the XGBoost model. Note: confirmed cases of COVID-19, state of Rio de Janeiro, March–September 2020

Table 6 shows the evaluation measures of the XGBoost model. Its high sensitivity (89%) and specificity (89%) indicate the model’s ability to identify cases of death and non-death, respectively. The accuracy revealed 89% of hits, while the negative predictive value showed that 99% of the cases predicted as non-deaths were correct. The positive predictive value of 30% may reflect the data imbalance since more than 90% of the cases had a non-death outcome.

Table 6 Evaluation measures of the XGBoost model

Figure 5 shows the ROC curve, which appears very close to the upper left corner, with an area under the curve of 94%, revealing that the model performed excellently in predicting the outcome.

Fig. 5
figure 5

Receiver operating characteristic (ROC) curve of the XGBoost model. Note: confirmed cases of COVID-19, state of Rio de Janeiro, March–September 2020

Table 7 shows the odds ratio of death with the respective confidence intervals of the predictor variables. Note that the variables with the highest odds of death were metropolitan region I - capital, metropolitan region I - without capital, and metropolitan region II (compared to the northwest fluminense region); the presence of dyspnea, presence of fever, presence of diabetes, presence of heart disease (compared to absence); black race/color (compared to white color/race); male sex (compared to female); and age over 30 years (compared to age range 0 to 9 years). The variables associated with lower odds of death were coastal lowlands region (compared to the northwest fluminense region); the presence of headache, presence of odynophagia, presence of muscle pain, presence of diarrhea, presence of loss of smell, and presence of runny nose (compared to absence); and Asian race/color (compared to white color/race).

Table 7 Odds ratios and respective 95% confidence intervals were obtained from the results of the XGBoost model

Discussion

In the present study, the mortality and lethality of COVID-19 increased with age, showing that the disease tends to be more severe in the older population. The Chinese Center for Disease Control and Prevention [31] analyzed 44,672 confirmed cases of the disease and found similar findings. However, the Chinese study found much lower lethality in all age groups than in this study. In the age group of 70 to 79 years, the lethality of the present study was 3.6 times higher than the lethality in the Chinese study (28.97% versus 8.0%), while in the age group of 80 years and over, the lethality was 2.9 times higher (42.75% versus 14.8%).

Older people tend to have lower immunity and, therefore, greater susceptibility to severe forms of infectious diseases and a higher frequency of comorbidities such as cardiovascular diseases, diabetes, and chronic lung diseases. These factors may explain the higher frequency of severe form (SARS) cases in addition to higher lethality among people over 60 years. Advanced age as a risk factor for death has been described in Noor et al. [18] and Albitar et al. [19].

The higher incidence of COVID-19 in the age group between 30 and 49 years was probably attributable to their greater exposure to SARS-CoV-2 in public transport or at work since the need to generate income prevented many people from staying on lockdown. Conversely, the evolution of the frequency of cases of SARS and ILI over time showed a proportional increase in ILI relative to SARS, starting from epidemiological week 19. Over the same time, we observed an increase in the frequency of cases in younger age groups, which may be related to lower adherence to social distancing in this population group. There was a higher frequency of female patients in this study. Williamson et al. [16] Cobre et al. [17] found similar results. We show the association of black race/skin color with higher odds of death than white race/skin color. Williamson et al. [16] found a higher risk of death in black individuals than in white individuals, even after adjusting for other factors (hazard ratio 1.48 (95% CI 1.29–1.69)).

Dyspnea, cough, and fever were the most commonly found symptoms, being more frequent in SARS cases and cases of death. Fever is a clinical sign that is part of the set of organic inflammatory responses to the presence of an infectious agent and is mediated mainly by cytokines. These inflammatory mediators can cause tissue damage and organ dysfunction by stimulating free radicals and other toxic components, as cited in the study by Chang et al. [32]. The latter retrospectively followed 211 patients and investigated risk factors for the progression of COVID-19. Therefore, fever may be a clinical sign associated with greater disease severity. Dyspnea, in turn, is the main symptom of SARS, a severe form of the disease where more deaths are expected. Fever and dyspnea were associated with a higher chance of death. At the same time, other reported symptoms, such as headache, odynophagia, muscle pain, diarrhea, loss of smell, and runny nose, had an association with a lower chance of death. Maciel et al. reported dyspnea as a factor associated with COVID-19 death, while cough, fever, and other symptoms were protective factors [33].

The city of Rio de Janeiro had the highest mortality within the state. When mortality is standardized, the Standardized Mortality Ratio (SMR) of the city of Rio de Janeiro remained the highest (supplementary Table 1). On the other hand, lethality due to the main risk factors identified in this study showed that lethality was consistently higher in the city of Rio de Janeiro (supplementary Table 2). However, the prevalence of risk factors among cases of COVID-19 does not change significantly between regions (Supplementary Figure A). In addition, the city of Rio de Janeiro has significant social inequality, with about 22% of the population residing in socially vulnerable places called favelas (supplementary Table 3). This combination of factors may have contributed to the high mortality and lethality in the city of Rio de Janeiro. It is also noteworthy that the city of Rio de Janeiro has a better laboratory diagnostic capacity for COVID-19, which would lead to a more accurate classification of the underlying cause of death than in other regions of the state of Rio de Janeiro. Also, there is better access of the population to health services in the city, facilitating the confirmation of the diagnosis on time. The difficulty of laboratory diagnosis of COVID-19 and access to health services in most cities of the state may have contributed to inaccuracy in the diagnosis of COVID-19 [34]. A less robust health system, especially when associated with a lower socioeconomic level, seems to be a risk factor for higher lethality due to COVID-19 [20].

In this study, people with heart disease had a higher chance of death. Zheng et al. [34] described the association of some comorbidities with higher lethality from COVID-19 in a meta-analysis. Cardiac tissue has a higher number of angiotensin-converting enzyme receptors (ACE2), which are involved in the endocytosis of SARS-CoV-2, than other tissues, which could allow the virus to damage cardiac tissue directly. Mikami et al. analyzed 6493 hospitalized patients with a confirmed diagnosis of COVID-19 and found a higher risk of death in patients with elevated serum troponin levels [35].

This study found a higher chance of death in people with diabetes. Hyperglycemia secondary to diabetes mellitus can lead to immune dysfunction through the impairment of humoral and cellular functions and the antioxidant system. Besides, diabetic patients are more vulnerable to nosocomial infections [36]. These factors may be associated with a higher chance of death in diabetic patients with COVID-19.

Prognostic factors identified in the literature include laboratory and radiological findings [37, 38], which were not analyzed in this study due to the lack of information in the databases that were used. It is important to highlight that bad prognostic indicators do not guarantee an unfavorable evolution of the disease [39].

Some limitations of our study should be noted. The RedCap platform has received input of ILI notifications until March 27, 2020, and the e-SUS Notifica has received input of ILI notifications after that. The notifications of SARS are made by sending the scanned investigation form, which feeds the Influenza Epidemiological Surveillance Information System (SIVEP-Gripe). In this context, the reported cases refer to those who sought care in health units in the state of Rio de Janeiro and entered into these information systems. The instructions given to the population were to seek medical care in case of signs of severity, which might have generated a selection bias.

Another factor that should be taken into account is the missing data for the variables skin color/race, chronic kidney disease, immunodeficiency/immunosuppression, which may have generated information bias.

It is also important to mention that asymptomatic cases are not included in this study, which can lead to an overestimation of the lethality rate.

At the time of this study, there was no consensus of management of COVID-19 in Brazil. Thus, several health services have adopted their own protocols. There was no single protocol for all municipalities in the state of Rio de Janeiro. It was a very important feature of the Brazilian management of the pandemic and negatively affected the observed results.

The authors recognize the importance of risk factors such as smoking, chronic hepatic disease, and cancer in the analysis of COVID-19, but they were not present in the available databases.

Conclusions

COVID-19 is a disease that can evolve into severe forms leading to death, especially in certain population groups. This study showed that older individuals of black race/skin color with heart disease or diabetes who had dyspnea or fever were more likely to die. The present study aimed to contribute to the early identification of COVID-19 patients who may progress to a more severe form of the disease, improve the clinical management of patients with COVID-19, and reduce the disease’s lethality.