FormalPara Key Summary Points

Why carry out this study?

Traditional risk prediction models are limited to identifying the condition of an asymptomatic patient who deteriorates from mild to moderate or extremely severe risk of COVID-19 at triage

Existing disease risk assessment models were developed with limited size data sets, input variables, and unstandardized independent features without specific machine learning algorithms

What was learned from the study?

This prediction model, trained with patient-generated health data (PGHD) from nationwide COVID-19 screening centers, can be globally utilized to monitor hospitalized or quarantined patients with confirmed SARS-CoV-2 infection daily

This risk assessment model, developed with multivariable factors like demographic, geographic, and clinical characteristics of a superior performance, can be successfully deployed to triage patients with COVID-19

Introduction

Countries such as the UK, Singapore, Germany, Portugal, and Israel—with high vaccination rates—have created strategies for the new normal after COVID-19 [1,2,3] as many are resuming their pre-COVID-19 lives. However, as the coronavirus mutations cause breakthrough infections, the current vaccine has little effect on reducing the transmission of the virus. The number of confirmed cases in the UK and Singapore has been increasing since October 2021 [4]. The variants put a great burden on the healthcare system of those countries [5]. Thus, it is evermore imperative to ensure medical readiness at a national level by preparing accurate and reasonable patient severity classification criteria and procedures [6].

Over the past year and 10 months, South Korea has experienced four COVID-19 outbreaks, and the occurrence of confirmed cases has been suppressed through the 3T strategy (test, confirmation, investigation, tracking, treatment) and adjustment of social distancing without border blocking and regional blockade [7]. According to the Organization for Economic Cooperation and Development, South Korea has achieved quarantine results without any containment measures, minimizing economic damage, and most effectively blocking the spread of the virus [8]. Although South Korea has been performing relatively well in controlling COVID-19, it had difficulty in managing patients whose clinical condition deteriorated from mild to modulate risk level. In fact, there have been cases where patients died at home or a community treatment center, a facility for isolating asymptomatic and mildly symptomatic patients with COVID-19, as a result of delayed response [9, 10].

Thus, a risk prediction model that accurately identifies the condition of a patient who deteriorates from mild to moderate or severe risk is required. Furthermore, it is crucial to triage patients with COVID-19 on the basis of the severity of their infection to secure the entire medical system of a nation. For the self-quarantining population of COVID-19, accurate severity assessment tools are necessary to appraise health status every day [11, 12]. Several models have been developed to predict the prognosis of patients with confirmed COVID-19 or the possibility of COVID-19 diagnosis of patients before confirmation. However, there were several problems: (1) the size of the research data sets was too small, (2) the number of input variables was limited, (3) the non-standard variables were difficult to use by other institutions, or (4) the specific method of using the model was not presented. Moreover, to the best of our knowledge, there was no study on the mortality rate of SARS-CoV-2 according to symptoms at national level while there have been several studies conducted on the establishment of a model for predicting COVID-19 confirmation based on nationwide data set with features related to COVID-19. Preventing the spread of COVID-19 has difficult aspects such as requiring not only medical staff but also national action. In contrast, lowering the mortality rate can be effectively managed by medical staff by developing an appropriate triage protocol.

Thus, this study aims to review previous research of prediction models for COVID-19 and develop a model predicting mortality rate of SARS-CoV-2 using nationwide multicenter data, thereby allowing patients to easily predict the severity of COVID-19 by entering their patient-generated health data (PGHD) during quarantine out of hospital.

Methods

Review of Previous Research

The review of previous research was based on a search of three databases: Google Scholar, PubMed, and medRxiv. The following keywords were searched in combination: severity, machine learning, deep learning, COVID-19, triage protocol, mortality, and SARS-CoV-2.

In this paper, we propose a machine learning model that predicts the mortality of SARS-CoV-2 based on questionnaires completed by patients. This research was approved by the Institutional Review Board of Seoul National University Bundang Hospital (X-2110-717-902). An Informed consent form was not obtained owing to the nature of retrospective studies. The study was performed in accordance with the Helsinki Declaration of 1964 and its later amendments.

Data Source and Study Cohort

The data set was collected from February 2020 to July 2021 by the Korea Disease Control and Prevention Agency (KDCA), a government-affiliated organization, for all Koreans who tested positive for SARS-CoV-2 in polymerase chain reaction (PCR). Our study was approved by the Institutional Review Board of Seoul National University Bundang Hospital (X-2110-717-902). The data set consists of 149,471 patients who tested positive, of whom 2000 died. The data set is labeled according to whether the patient is dead or alive, and it is highly imbalanced (98.7% imbalance ratio). The data set are mainly composed of three types of patient data: (i) basic personal information, (ii) types of first symptoms, and (iii) underlying diseases. A detailed description of these features is given in Tables 1 and 2. As mentioned in the “Introduction”, the area of residence is included in the data feature because it affects the degree of virus activation and medicalization scale.

Table 1 Baseline characteristics of input features
Table 2 Underlying diseases of study participants

The data was collected from 1382 designated COVID-19 screening centers in South Korea. These centers consist of national safe hospitals (263), dedicated respiratory clinics (518), screening clinics in public health centers (627), temporary screening offices (200), and car mobile screening clinics (15). The process of initial screening, transfer, admission to a hospital or community treatment center (CTC) is presented in Fig. 1.

Fig. 1
figure 1

Management strategy of COVID-19 confirmed cases in South Korea

The triage process of patients with confirmed COVID-19 was initiated on the basis of the severity of their symptoms: asymptomatic to mild, moderate, severe, and critical. Symptoms were assessed by telephone interviews or face-to-face in the first-visit facility, and patients were quarantined at designated facilities according to their severity. Asymptomatic and mildly symptomatic patients were admitted to CTCs. Meanwhile, patients with an aggravated severity were hospitalized at tertiary hospitals. The referral system at each level of medical care aims to allow for patients to be efficiently transferred to a higher level of care before worsening clinical status [13].

The overall process of hospitalization and transfer is presented in Fig. 1.

Data Collection and Measurement

Previous studies revealed that the outbreaks of COVID-19 were associated with latitude, temperature, and humidity measurements, which reflects seasonal variation in the incidence of respiratory viruses [14, 15]. Thus, geographic information of latitude and longitude have been integrated into our model.

Easy-to-measure features are defined as variables such as body temperature, pulse rate, respiratory rate, blood pressure, any symptoms, and past medical history that can be directly collected from patients without much delay.

Outcome Definition

The outcome was defined as deceased cases due to COVID-19 in hospitals, CTCs, and at homes. The mortality cases were collected by the KDCA from national statistics.

Feature Generation

We observed that the structural stability of individual SARS-CoV-2 virus-like particles could be affected by the temperature and humidity of the atmosphere [16]. In addition, hospitalization rates may vary depending on access to medical resources and the severity of previous diseases [17]. For these reasons, we utilized additional features such as the date of the onset of symptoms (in months), the area of residence (in longitude and latitude coordinates), and underlying patient symptoms.

The features of the data set provided by KDCA as follows: sex, age, body temperature, clinical symptoms (cough, sputum, sore throat, dyspnea, musculoskeletal pain, headache, chill, ageusia, anosmia), self-reported underlying diseases. For body temperature (T), we divided patients and categorized them into four subgroups: (1) no fever with T ≤ 36.5 °C, (2) mild elevation of body temperature with 36.5 °C < T < 37.5 °C, (3) mild fever with 37.5 °C  ≤ T < 38.3 °C, and (4) overt fever with T ≥ 38.3 °C. All clinical symptoms have binary values: true or false. Since underlying diseases are self-reported in a free format, we manually classified the reported diseases into seven subgroups: liver disease, cancer, diabetes mellitus, cardio-cerebrovascular disease, renal disease, degenerative disease, and lung disease. Thus, if a patient had lung cancer and liver cancer, they were assigned a value of two to the feature named “cancer” for this patient. This was done to reduce the sparsity of our data set. Since there are so many different diseases, our data set would become very sparse if we treated each disease as a different feature. If a model is naively trained on a given sparse data set, the performance of the model would degenerate; worse still, it could increase the chances of the model wrongly predicting the mortality probability for a patient with a rare disease. Moreover, requiring many features would lower user convenience.

Training and Evaluation

We split the data set into training sets and test sets with an 80:20 ratio, and the model was evaluated on the test set. We used a tree-based gradient boosting machine learning model with binary logistic objectives, XGBoost (XGB) [18]. This model is a decision-tree-based ensemble machine learning model known for its powerful performance in classification problems in various fields [19, 20]. Since this is a tree-based model, it has the advantage of being able to process data with missing values [21]. Another benefit of using gradient boosting algorithms is that they enable straightforward measurement of feature importance scores in prediction by calculating how useful each feature is in the construction of the weak learners within the model. Therefore, this method does not tell us how positively or negatively the features affected the prediction and does not consider the association relations among features in making predictions.

Meanwhile, originating from game theory, the SHapley Additive exPlanations (SHAP) algorithm [22] is used to compute Shapley values [23] for each feature, where each Shapley value represents the impact of the feature to which it is associated and predicted. When used for tree-based models, SHAP has the great advantage of being able to calculate Shapley values relatively quickly. Therefore, we have utilized it to identify the principal features in model prediction.

The model was evaluated on the test set using various metrics, including area under the curve of receiver operating characteristic (AUROC), area under the precision–recall curve (AUPRC), F1 score, precision, sensitivity, and specificity. Moreover, we performed a decision curve analysis on the model. ROC analysis provides information about diagnostic test performance; a ROC curve consists of the true positive (TP) and false positive (FP) rates and demonstrates the discriminatory ability of a binary classifier system by varying the discriminant thresholds. In other words, the discriminatory ability of the test could be powerful when the vertex of the curve is closer to the upper left (high TP rate and low FP rate). In addition, the baseline for AUROC is always 0.5.

On the other hand, PR curves plot the precision against the recall, and AUPRC is especially useful for imbalanced data in a setting where we focus more on detecting the positive examples. Unlike AUROC, the baseline for AUPRC is equal to the fraction of positives. This means that obtaining an AUPRC of 0.4 on a class with 10% positives is good but obtaining an AUPRC of 0.6 on a class with 80% positives is undesirable [24].

Results

Literature Review

Previous research was classified according to the five classification criteria: (1) type of learning data, (2) type of prediction models, (3) outcome variables, (4) data type, and (5) whether or not easy-to-measure input features were utilized. In terms of modeling and utilizing the prediction models, they have four major components: gathering patients’ information such as symptoms, signs, previous medical history; results of imaging studies; and laboratory tests; confirmation of COVID-19 through reverse transcriptase polymerase chain reaction (RT-PCR) test; and triage of confirmed cases. The schematic flow of management for patients with COVID-19 is presented in Fig. 2.

Fig. 2
figure 2

Classification of the previous prediction models according to the type of learning data and type of prediction models

In terms of outcome variables, previous studies were classified into four major classes.

Outcome class 1: diagnosis.

  • A + B ⇒ 1 (Zoabi, Menni) [25, 26].

  • B ⇒ 1 (Yanamala) [27].

  • D ⇒ 1 (Gozes, Song, Jin, Punn) [28,29,30,31].

  • A + B + C + E + 2a ⇒ 1 (Feng) [32].

Outcome class 2: mortality.

  • F + 1 + 2a + 2b + 2c ⇒ 3a vs 3b (Cifuentes) [33].

  • A + B + C + E + 1 + 2a ⇒ 3a vs 3b (Her) [34].

  • C + 1 + 2a + 2b + 2c ⇒ 3a vs 3b (Cho) [35].

  • C + E + 1 + 2a ⇒ 3a vs 3b (Ikemura) [36].

Outcome class 3: mortality and complication.

  • B + D + E + 1 + 2a ⇒ (3a + 3c) vs 3b (Shamout) [37].

  • C + E + 1 + 2a ⇒ (3a + 3c) vs 3b (Subudhi) [38].

  • A + B + C + E + 1 + 2a ⇒ (3a + 3c) vs 3b (Marcos) [39].

  • A + B + C + 1 + 2a ⇒ (3a + 3c) vs 3b (Kim) [40].

  • C + E + 1 + 2a ⇒ (3a + 3c) vs 3b (Su) [41].

Outcome class 4: complication.

  • A + B + C + 1 + 2a + 2b + 2c ⇒ 3b vs 3c (Rinderknecht) [42].

  • A + B + C + D + E + 1 + 2a ⇒ 3b vs 3c (Wang) [43].

We reviewed 19 existing studies and classified them by the four classification criteria into the four major outcome classes. The result is presented in Table 3.

Table 3 Previous research regarding COVID-19 prediction models

The baseline characteristics of the input features used in the research are presented in Tables 1 and 2. The area of residence for each confirmed patient was converted to floating-point variables using the Python Google Maps API client owing to its large scale.

The distribution of longitude and latitude of the study participants is presented in Fig. 3. The x-axis represents the latitude–longitude coordinate, while the y-axis shows its number of patients. The μ and σ in the title denote the mean and the standard deviation, respectively. Even though discrepancies between the actual area of residence and latitude–longitude pair exist, they were ignored because such cases were rare.

Fig. 3
figure 3

Histogram of patients' distribution by latitude (top) and longitude (bottom)

The seasonality of the cumulative number of confirmed cases per month is presented in Fig. 4. The height of each bar represents the number of patients in that month. We marked the number of patients and their percentage (%) at the top of the bar.

Fig. 4
figure 4

Cumulative number of confirmed cases per month

Model Performance

The proposed model achieved an AUROC score of 0.950 at a 95% tolerance interval (TI) 0.940–0.958 and 95% confidence interval (CI) 0.949–0.950, Youden’s index of 0.739, F1 score of 0.861, recall 0.807, precision 0.923, and specificity 0.933. Since the size of the test set was 29,895, and there were 398 positives in the test set, the fraction of positives is 0.013, which is the baseline for the AUPRC score. The model achieved an AUPRC score of 0.268 (with 95% TI 0.225–0.310 and 95% CI 0.266–0.269), greatly outperforming the baseline score of 0.013. The general ROC curve and PR curve are presented in Fig. 5.

Fig. 5
figure 5

a ROC curve and b precision–recall curve. The gray bands around the curves are pointwise 95% TI and 95% CI, which are derived by bootstrapping with 1000 repetitions

We compared the performance of four different models (Table 4). The XGB model achieved the highest scores with an AUROC of 0.950 and AUPRC of 0.268.

Table 4 Performance of four different models

Explainability

Feature importance was measured by SHAP, as presented in Fig. 6. Features in the plot are sorted in descending order by their maximum absolute values. A single dot on each row represents the explanation for each patient, and the original feature values are represented by their colors. The SHAP analysis proved age to be the most important relevant risk factor for mortality. Body temperature was also an important risk factor, as were previous diseases before COVID-19 infection, such as renal disease, degenerative disease, cancer, liver, cardiovascular, and lung disease. Among initial symptoms of patients, dyspnea was shown to be an important risk factor. Geographic information is also closely related to the mortality of patients with COVID-19. Higher longitude and latitude are related to high mortality. The northeast region is covered with more mountains than the west or southern region in South Korea while almost all large cities are located in the southern and western parts of the country. In terms of accessibility to acute care facilities, geographic location significantly affects the mortality of patients with acute respiratory diseases [44, 45]. Different weather according to location may also affect the severity of disease or mortality of the patients [46].

Fig. 6
figure 6

Feature importance plot

Fig. 7
figure 7

Decision curve analysis and the histogram of predicted probabilities of the XGB model

Cost–Benefit Analysis

Decision curve analysis (DCA), as depicted in Fig. 7, provides the range of threshold probabilities in which a prediction model shows the value and magnitude of benefit [47]. In the context of this research, the threshold can be used to decide whether a self-quarantined patient should be hospitalized or not. The threshold should be set depending on the medical and economic environment of the country in which the model is implemented. The DCA identified the optimal threshold range in which net benefit does not fall below zero. In our model, the optimal threshold for the DCA ranged from 0 to 0.05.

We also investigated the types of medical institutions visited by patients according to their predicted mortality probabilities, as shown in Fig. 8. First, we divided the test set into three groups: patients with predicted mortality probabilities less than 0.05, those between 0.05 and 0.5, and those greater than 0.5. Then, we analyzed the types of medical institutions that the patients visited first for each group. Since public health centers are the first places where patients receive the PCR test in general, the proportion of public health centers among the medical institutions where patients get treated is great. However, the proportion of hospitals in the pie chart increases if the mortality rate of patients increases, which means more severely infected patients visited hospitals at first than those with less severe cases.

Fig. 8
figure 8

First-visit facility of patients with COVID-19 according to the patients’ mortality probabilities

Discussion

In this research, we propose a machine learning model that predicts the prognosis of SARS-CoV-2-infected patients by obtaining 20 basic pieces of PGHD. The model was developed using the data of 149,471 patients from 1382 designated COVID-19 screening centers. Thus, our model can be utilized globally for triaging patients with confirmed SARS-CoV-2 infection at the initial stage and monitoring hospitalized or quarantined patients daily.

The characteristics of SARS-CoV-2 and the related spectrum of signs and symptoms are the subjects of much ongoing research. Initial triage of the patients is crucial to prevent the shutdown of the entire medical system of a country. Thus, there have been many studies on developing patient triage algorithms using easily obtainable signs and symptoms. The model in this study provides a novel method integrating easily obtainable signs and symptoms, along with geographic and seasonal data that reflect characteristics of respiratory viruses, all from the nationwide multicenter database, including hospitalization and mortality data.

Accurate patient triage may lower the burden currently faced by health systems through facilitating optimized management of healthcare resources during future waves of the SARS-CoV-2 pandemic [48]. This is especially important in developing countries with limited resources to maintain essential health services [49].

While reviewing the existing research, we found that most of the previous studies utilized limited data. Furthermore, almost all of them utilized various input features that are not easy to measure. Compared to the previous studies, we adopted two types of demographic information, one geographic location, one sign, nine symptoms, and seven underlying diseases, which are easy to measure. Only body temperature and the nine symptoms are changeable during quarantine and hospitalization. Thus, patients can check the severity of the disease every day with the variable input features. The data for the research was collected from 1382 designated COVID-19 screening centers in South Korea, which means the developed model covered patients with variable clinical characteristics from all over the country. In addition, we adopted longitude and latitude in our model to reflect clinical characteristics of the acute respiratory virus by weather and accessibility of acute care facilities in each region.

Through the result of DCA, users can set a threshold for intervention such as transfer to a higher level of care or medical facility or a thorough examination by doctors. For example, if they are allowed to have a higher false-positive rate and want to screen necessary patients for intervention as much as possible, they can set the threshold near 0. If they have to save hospital beds for severe patients when medical resources are depleting, they can set the value closer to 0.05.

The SHAP analysis found patients with previous renal, degenerative, or cardiovascular diseases or cancer should be monitored thoroughly. In addition, body temperature and dyspnea should be considered the most important factors to assess aggravation of their health daily.

One of the main limitations of the study is that our model has not yet been extensively applied to the field. Therefore, we could not quantify how efficiently our model could lower the burden on the healthcare system. However, since our model has high performance and is easily accessible, we expect to have positive results and leave this analysis for future work.

Conclusion

We developed a model for predicting COVID-19 diagnosis by obtaining 20 basic pieces of PGHD based on nationwide multicenter data reported by KDCA. With the help of COVID-19 vaccination and medicine to be released soon, it will be more important to manage patients under quarantine at home or a facility. Our framework can be implemented and utilized conveniently to triage patients with positive RT-PCR test results as well as enabling them to monitor themselves at home or a quarantine facility.