Introduction

Emergency department (ED) triage is a critical process for emergency patients who need appropriate treatment and for hospitals that need optimal resource allocation1,2. During a pandemic, ED triage is much needed to distinguish patients with high acuity, as there was an increase in the number of cases presenting to ED with higher acuity after COVID-193.

Several early warning-scoring systems, such as the National Early Warning System (NEWS) or the Modified Early Warning System (MEWS), have been established to identify the risk of catastrophic deterioration and inpatient deaths4. The Canadian Emergency Department Triage and Acuity Scale (CTAS) is a well-recognized and validated triage system that prioritizes patient care by the severity of illness5.

Based on the CTAS, the Korean Triage and Acuity Scale (KTAS) was developed to assess the patient’s severity in Korea6. Despite its potential, there were some problems, such as dependence on subjective medical staff assessment during ED triage1,7,8.

Several digital machine learning-based triage systems have been proposed for ED triage7,9,10. However, the black box property of machine learning makes it hard to interpret and implement in real-world situations. Few studies focus on interpretation to solve the black box problem11,12,13.

Interpretable AI includes reasoning processes that can help make AI predictions understandable for triage in ED14. Xie et al. developed the Score for Emergency Risk Prediction (SERP) based on the Singapore population12. It used the AutoScore framework to generate and interpret the score13. However, this was a single-center study, and external validation will be critical for generalization. This study aims to validate the SERP score derived from the Singapore population on the Korean population and compare the prediction result to that of conventional scores for various perspectives.

Results

As shown in Fig. 1, during the study period from 2016 to 2020 in SMC, 373,172 patients visited the ED. Among them, 87,649 patients were excluded, and 285,523 patients were included in the final analysis (Fig. 1). The mortality rate of the whole cohort was 1.60% for in-hospital death and 3.80% for death at 30 days.

Figure 1
figure 1

Flow chart of the study population.

The distribution of ED patients’ demographics is shown in Table 1. The pre-pandemic period cohort included 232,982 ED visits (mean [SD] patient age, 59.9 [17.1] years; 119,681 [51.6%] female). Whereas the pandemic period cohort included 53,541 ED visits (mean [SD] patient age, 56.1 [17.4] years; 27,114 [50.6%] female).

Table 1 Baseline characteristics of the validation population.

There were differences between the pre-pandemic and pandemic periods, especially in vital signs and mortality prevalence. Systolic blood pressure and Diastolic Blood Pressure during the pandemic (mean [SD] 130.3 [24.9] and 77.5 [15.1]) were higher than those during the pre-pandemic period (134.1 [24.6] and 81.5 [15.3]). The 30-day mortality was 4.0% during the pre-pandemic period and 2.5% during the pandemic. Regarding the comorbidities, cancer, diabetes, and stroke were the most common diseases. Moreover, patient severity at scene was quite different, the pandemic period saw higher severity patients (1637 (0.8%) vs. 103 (0.2%) (pre-pandemic) for KTAS1, 15,715 (7.2%) and 2762 (5.9%) (pre-pandemic) for KTAS2.

The SERP-30d achieved better performance than KTAS for in-hospital and 30-day mortality prediction, with an AUC of 0.813 (95% CI 0.809–0.817) and 0.795 (95% CI 0.789–0.801), respectively (Table 2). In contrast, KTAS achieved an AUC of 0.717 (0.712–0.722) and 0.741 (0.733–0.749) which results in more than 40% improvement.

Table 2 Comparison of AUROC by different scores and outcomes.

The SERP-30d score showed good calibration (based on the Kolmogorov Smirnov test for calibration data: P = 0.405). The SERP-30d calibration plot on the validation data set is illustrated in Supplementary Fig. 1. As shown in Supplementary Table 2, the results before and after the pandemic period based on 2020 were very different. All SERP performance after the COVID season was superior to that before the COVID season.

In terms of score accuracy, we compared the performance at the same sensitivity and specificity level from 0.7 to 0.9. As shown in Table 3, the SERP score achieved a higher sensitivity than KTAS at the same specificity level. For example, at the same 0.7 sensitivity, the specificity of SERP was 0.790, whereas KTAS was 0.568. This result shows that SERP can detect more patient with a higher mortality risk than KTAS.

Table 3 Comparison of prediction model accuracy with same specificity point.

Regarding the alarm fatigue problem, we compared the performance between scores at the same mortality event occurrence. As shown in Fig. 2, KTAS results in more alarms for the same event than SERP. For example, for 9937 and 7925 events, KTAS raised 263,172 and 143,382 alarms, respectively, whereas the SERP score resulted in only 211,848 and 85,134, a decrease of 19% and 40% of alarms, respectively.

Figure 2
figure 2

Comparison of the number of needed alarms at the same sensitivity point for predicting mortality between KTAS (Korean Triage Acuity Scale) and SERP (Score for Emergency Risk Prediction).

Discussion

We validated the SERP score to predict mortality in the ED using SMC data. The results of SERP in our main two aspects (performance and alarm fatigue) were better than conventional ED triage scores. Also, SERP resulted in fewer false alarms for the same event occurrence. Excessive false alarms can reduce productivity and result in alarm fatigue, putting critical patients at risk15.

Previous studies on machine learning usually focus on accuracy7,9,10. However, only a few studies have demonstrated interpretability for easy use of the model. One of the critical points is the importance of real-world application. In the complex and busy ED environment, it is necessary to make the model light and interpretable. The other strength of SERP is that it requires few features for development. The features in the SERP are routinely collected during triage—so implementing the SERP score in the ED is not a big challenge.

There is growing consensus among researchers related to efforts for the real-world application of AI in healthcare and practical issues regarding the implementation of AI into existing clinical workflows16,17. Brajer et al. suggested a machine-learning model fact sheet reporting for end-users18. Visualization-based efforts such as population, patient, and temporal level feature importance, or nomograms, could be adopted19,20,21,22. Like the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) or The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines for reporting machine learning results23,24, there should be a guideline for the standardization of user interfaces (UIs) and a format for clinical decision support for end-users, including clinicians and patients25,26. In terms of data-sharing, privacy, and interoperability across multiple platforms, hospital policies and national laws are also important. Lack of standardization, black box transparency, proper evaluation, and problems with patient safety are the other major key issues for AI implementation16,17.

The characteristics of the patient populations could be quite different in different hospitals and countries. Although the SERP validation performance was good for long-term outcomes, conventional indexes such as NEWS, MEWS, and KTAS were equivalent for short-term outcomes. There may be a role for customization of a new SERP score for Korea. We recognized that as the mortality timeframe increased from 2 to 30 days, the performance worsened in the conventional indexes but improved in the ML-based score.

The subgroup analysis showed a difference in the performance between the pandemic and pre-pandemic periods. This could be due to the different patient mix during the pandemic27,28. We also identified differences in feature importance between the pandemic and pre-pandemic periods. During the COVID season, the top three important features were related to vital signs, whereas age was the second most important variable during the pre-pandemic period. Finally, the rate of admission and transfer were higher during the pandemic, even though patient illnesses were less severe based on KTAS.

There are some limitations to this study. First, it is a retrospective study and needs to be further evaluated prospectively, although the strengths of this validation are the multi-center and multi-nation nature of this evaluation. Second, we only considered Korean SMC data, which may not represent all Koreans. In the future, we intend to conduct the same validation with more hospitals in Korea or the National Emergency Department Information System (NEDIS), which is a nationwide registry of ED data29. As the variable used for SERP score is not complicated, we can consider international validation of the score, applying to other nationwide registry ED using Common Data Model or Pan Asia Trauma Outcome Study.

In this study, we validated the SERP score with Korean data. Its performance was better than the conventional indexes in terms of accuracy and false alarms.

Methods

Study setting

This was a retrospective validation study of the SERP score using data from the Samsung Medical Center (SMC) in Korea. SMC is a tertiary hospital located in a metropolitan city in Korea. The hospital has approximately 2000 inpatient beds. More than 80,000 patients visit the ED annually.

The Electronic Health Records (EHR) were obtained from the Clinical Data Warehouse at SMC. This study was approved by the Samsung Medical Center Institutional Review Board (2022-05-083-001), and a waiver of consent was granted for EHR data collection and analysis because of the retrospective and de-identified nature of the data.

All methods were performed in accordance with the relevant guidelines and regulations24.

Population

The population for the validation cohort was ED visits from 2016 to 2020. All patients who visited the ED from January 2016 to December 2020 were initially included. We excluded patients who were under the age of 20 years, did not come for emergency treatment, left without being seen by a clinician, had missing triage data, or were dead on arrival (DOA) (see Fig. 1)15. To assess the impact of the COVID-19 pandemic, we defined two non-overlapping cohorts based on “pre” and “post” pandemic periods.

SERP score

Three SERP scores were validated using the primary outcomes of 30-day and in-hospital mortality from the ED visits. Each score was developed using the AutoScore framework, which is an automatic and interpretable score generator for risk prediction using machine learning and logistic regression12,13.

For outlier data, we assumed that extreme ranges of vital sign data were input errors and designated them as “missing” based on clinical knowledge. For example, any vital signs value under 0, heart rate above 300/min, respiration rate above 50/min, systolic blood pressure above 300 mm Hg, diastolic blood pressure above 180 mm Hg, or oxygen saturation as measured by pulse oximetry above 100% were treated as a missing value and imputed with the median value from a training cohort. Missing rates of each variable are presented in the “Supplemental Tables S1”.

Statistical analysis

The data were analyzed using R software, version 3.5.3 (R Foundation for Statistical Computing).

For the descriptive summaries of baseline characteristics of the study population, frequency (percentages) for categorical variables and mean (SD) for continuous variables were reported.

Performance evaluation

We compared the validation performance of SERP with conventional indexes such as NEWS, MEWS, and KTAS, in terms of two main aspects4. First, how accurately can the SERP score predict the outcome compared to a conventional index? The predictive power of validation was measured using the AUC in the receiver operating characteristic (ROC) curve. Other metrics such as sensitivity, specificity, and positive predictive value, were calculated under a certain threshold from 0.7 to 0.9 for the comparison. We also identified the calibration plot for the agreement between predictions and the observed outcome30. Second, can SERP reduce the false alarm rate more than the conventional index? The alarm rate is important for the validation of SERP because false alarms can result in alarm fatigue31. Alarm fatigue can make medical staff tired and cause critical alerts to be missed. Finally, it could affect patient safety and quality of care in the clinical environment. Therefore, an ideal SERP should have high sensitivity and a low false alarm rate. We compared the frequency of alarming events with the KTAS.