FormalPara Highlights
  • We are the first to evaluate an existing triage algorithm for clinical pathways applied in German hospitals based on a unique German multicenter dataset.

  • We propose analytics and AI-based extensions which improve performance metrics compared to those of the existing algorithm.

  • We explicitly include the explainable AI discussion in literature and integrate explainable and easy-to-apply new approaches, as well.

  • We study the influence of varying AI and non-AI data preparation strategies.

1 Introduction

The Covid-19 pandemic has pushed many hospitals to their capacity limits. Therefore, triage of patients has been discussed controversially primarily through an ethical perspective (see, e.g., [28] or [30]. Even though the term triage seems to have taken on a weighty meaning with the pandemic, it is still not new and triage algorithms have been used for a long time like in the emergency department or for mass casualty incidents. Triage within mass casualty incidents is about saving as many patients as possible with limited resources outside the hospital [21, 37, 38]. For emergency departments, the task is on classifying arriving patients due to urgency of treatment whereby scarce resources play a subordinate role [17].

In Germany, physicians have not been forced to decide in view of scarce resources during the Covid-19 pandemic so far. But in general, Covid-19 triage with limited personnel and ventilation resources in hospitals integrates both approaches, i.e., emergency department and mass casualty incidents triage, and contains many aspects, such as urgency of treatment (e.g., [46], testing (e.g., [10]), severity of the disease (e.g., [11], access to critical care (e.g., [44] or the classification of patients regarding clinical pathways [39]. The classification of patients regarding clinical pathways determines ward, Intensive Care Unit (ICU), and outpatients starting from the emergency department. Although this classification is highly important for patient care and capacity planning in hospitals, it is hardly discussed in literature (see, e.g., reviews by [31, 47] or [1]. Regarding Covid-19 diagnosis (e.g., [48]), prognosis (e.g., [2] or [8] or [7], scores (e.g., [24] or [34], severity (e.g., [29] or mortality (e.g., [40], plenty of research has been proposed with a strong focus on Artificial Intelligence (AI) approaches. Symptoms, vital signs, medical imaging techniques, risk factors, blood counts or a combination of the categories are among the most integrated input parameters for the predictions (e.g., [5, 14, 35, 46, 47, 49]. The focus here is often on data-driven training and evaluation of standard models, without considering the actual application and transparency. In addition, it is noticeable that a broad data basis and the validation are usually lacking [47].

In this work, we evaluate the performance of triage algorithms for the classification of patients regarding clinical pathways based on a multicenter dataset with more than 4,000 Covid-19 patients of the Lean European Open Survey on COVID-19 Patients (LEOSS) registry. Compared with previous work, the size of our dataset significantly exceeds current literature [1]. The decision tree proposed by Pin et al. [39] is suggested by the German Society for Interdisciplinary Emergency and Acute Medicine (DGINA) to be considered as a guideline and applied in emergency departments in Germany, e.g., in the University Hospital of Augsburg. The results on the decision tree by Pin et al. [39] serve as a benchmark for our data-driven, AI and interactive human-AI extensions. Besides a base classifier regarding outpatient, ward, and ICU care, we research a hypothetical extension with outpatient, ward, ICU, and palliative care (i.e., death), to juxtapose data-induced and ethical considerations. As data issues arise in such settings, we study the influence of varying AI and non-AI data preparation strategies as well. We thus aim to close the validation and the application gap on a broad data basis for predicting the clinical course of incoming patients, which has not been in the focus of Covid-19 triage researchers so far. In addition, we take up the broad, ethical, and explainable AI (XAI) discussion in literature (see, e.g., [3] and present the performance of a human-AI interaction on the classification problem. We find significant potential of Covid-19 triage in the emergency department regarding accuracy, sensitivity, and other performance metrics. Comparing AI methods with the human-AI interaction, the human-AI approach shows similar performance in general and is significantly better at classifying ICU patients. An additional label palliative care does not improve the outcome, which is an important finding for the ethical discussion on Covid-19 triage.

Our work is structured as follows. In Section 2, we discuss the definitions and the literature which lay the basis for our methodology. Section 3 describes the data preparation process, the base triage algorithm, its data-driven extension, our AI systems, and performance metrics. Section 4 provides the results for both, a basic pathway classifier involving three labels (outpatient, ward, ICU) and an extended version with four labels (outpatient, ward, ICU, palliative care). The results are critically discussed in Section 5. Section 6 presents concluding remarks.

2 Related definitions and literature

The healthcare sector faces substantial challenges such as staff shortages and increasing treatments, for which advances in digitalization are generally known as a possible solution. The Covid-19 pandemic has aggravated the problem of staff shortages. Artificial Intelligence and Machine Learning are an important base for digitalization in healthcare. Often and in this work as well the terms are used synonymously, but in fact Machine Learning is defined as a part of Artificial Intelligence [19]. While Artificial Intelligence focuses on autonomous algorithmic decisions in general, Machine Learning denotes a machine autonomously learning from data. There exist supervised and unsupervised Machine Learning methods. Unsupervised methods aim at clustering of unlabeled data and supervised Machine Learning methods focus on classification and regression problems for labeled data.

Machine Learning methods such as decision tree, Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGB) or Random Forest (RF) are attributed to the category of predictive analytics. Predictive analytics summarize different approaches for event prediction. Descriptive analytics summarize different statistical approaches for the descriptive and retrospective analysis of data. Prescriptive analytics aim at prospective decision support based on statistical and mathematical programming techniques [32].

We use analytics as a general term for mathematical and statistical methods with the aim to learn from data and focus on a classification problem in a Covid-19 setting. The (meta-) pathway of incoming Covid-19 patients starting from the emergency department is to be determined. Patients are assigned to the ward unit, the intensive care unit (ICU), the palliative care unit (PCU) or are discharged, i.e., outpatient, from the hospital. Our four different (meta-) pathways are defined as follows: (1) ED \(\to\) Discharge, (2) ED \(\to\) Ward, (3) ED \(\to\) ICU, and (4) ED \(\to\) PCU. By the determination of the pathway, an incoming Covid-19 patient is triaged with respect to the subsequent place of treatment. In times of digitalization in healthcare, the question is not only how analytics influence the decision-making process, but also the question remains as to who is making the actual triage decision of patients. The triage can be done autonomously by a physician experienced in Covid-19 care, i.e., human approach, autonomously by a supervised Machine Learning algorithm trained with relevant data, i.e., AI approach, or any interactive version of the options, i.e., interactive human-AI approach (see detailed definition on human-AI interaction below).

As the concept of triage itself raises ethical concerns because patients are grouped with potential consequences for their well-being, so does AI-based decision support. Various requirements for an ethical AI have been elaborated in literature (see, e.g., [36, 45] or [6]. The requirements include, among others, the autonomy of physicians and patients or a certain transparency of the methods. The definition of this transparency of AI methods is controversial in the literature stream on Explainable Artificial Intelligence (XAI). Arrieta et al. [3] define transparency, i.e., interpretability, as an intrinsic characteristic of a model. For example, decision trees are referred to transparent methods, because the structure and decision-making process immediately become visually clear to the user. According to Arrieta et al. [3], explainability is an extrinsic characteristic of a model. Multilayer Perceptron or Random Forest, for instance, are non-transparent models with a certain potential for explainability, because simplification techniques or feature importance analyses might contribute to explainability for the user. Understandability is to be distinguished from transparency and explainability according to Arrieta et al. [3]. An algorithm is defined to be understandable, if and only if the algorithmic decision is understandable. The major goal of XAI is trustworthiness in the AI-based models, which is a basic prerequisite, among technological concerns, for the actual use of the techniques in healthcare institutions. Fuhrman et al. [18] apply a similar distinction of explainability and transparency, i.e., interpretability, in their review on AI-assisted medical imaging in Covid-19 settings, while Tjoa and Guan [41] use the terms explainability and transparency synonymously. Tjoa and Guan [41] do not distinguish between intrinsic and extrinsic characteristics but concentrate on the efforts to make the algorithmic decision transparent to the user. In this work, we differentiate between explainability and transparency, i.e., interpretability, as Arrieta et al. [3] or Fuhrman et al. [18] do. The term XAI is used as a general term defining the research stream of ensuring trustworthy AI-based decisions.

Human-AI interaction might contribute to XAI in healthcare [22]. Van Berkel et al. [42] generally “define human-AI interaction as the completion of a user’s task with the help of AI support […]” and describe the wide variety of different human-AI interactions with respect to the initiator of the interaction, the timepoints of the interaction in the decision-making process, and the user’s reaction. For example, an AI-based clinical decision support system might suggest a certain classification of a patient. The interaction might be the system’s output which is the basis for the classification of the patient by a physician or consecutive decision-making depending on the predicted outcome or any other interactive decision-making process.

Not only transparency and interaction influence the actual application of decision support in hospitals, but also implementation issues and usability. As there are different advances in many countries regarding digitalization in hospitals, e.g., the Hospital Future Act in Germany,Footnote 1 the broad implementation of digital decision support tools will be made possible in the near future. Reviews on the usability of mobile apps and mobile health apps can be found in Harrison et al. [20] and Azad-Khaneghah et al. [4]. Usability is mainly determined by “Effectiveness”, “Efficiency” and “Satisfaction” [23] of the application and is strongly associated with the performance, transparency, and implementation of the algorithm, consequently.

We assess the influence of the decision maker, the analytics-based definition, and the transparency of the decision-making process on the accuracy of Covid-19 triage in the emergency department. To evaluate the influence, four different approaches are examined: the base triage algorithm (TA) proposed by Pin et al. [39], an analytics-based extended version of the base algorithm (TAE), AI-based algorithms, and an integrated triage algorithm (ITA). The four approaches vary in the decision maker, the definition, and the transparency of the decision-making process (see Table 1). For the AI-based algorithms, we apply Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGB), and Random Forest (RF). We take a data driven retrospective perspective which lays the basis for a prospective evaluation, and the ethical discussion about AI-based decision support for Covid-19 triage in the emergency department. In addition, we aim at a contribution to the discussion on the ethics of triage by evaluating the flexible inclusion of a palliative care label in some algorithms.

Table 1 Comparison of the different approaches for Covid-19 triage in the emergency department

3 Methods

3.1 Data processing

Our study is based on a LEOSS data export with 4,310 Covid-19 patients and 190 columns (i.e., features) from January 2021. Thus, our study captures the first and second pandemic wave in Europe (March 18, 2020, with January 7, 2021). The Lean European Open Survey on SARS-CoV‑2 infected Patients project is a prospective European multi-center cohort study that enables retrospective analyses on a broad basis [26]. We consider LEOSS baseline data due to our interest in parameters collected at an early stage of infection. In the LEOSS protocol, diagnosis is confirmed via PCR or rapid tests as an acceptable alternative. To ensure anonymity in all steps of the analysis process, an individual LEOSS Scientific Use File was created, which is based on the LEOSS Public Use File principles described in Jakob et al. [25]. The study was conducted in accordance with the Declaration of Helsinki Ethical Principles and Good Clinical Practices and was reported to the local Ethics Committee.

The raw data contains demographical features, blood counts, vital signs, Covid-19 related symptoms, comorbidities, medical imaging outcomes and the clinical (meta-) pathway of the patients. First, the raw data was cleaned up regarding incorrect entries. Data preparation for the remaining data set is based on feature importance (e.g., vital signs and laboratory parameters) or commonly known methodologies (percentage of blank rows). Since statistical guidelines recommend using data with more than 40 percent missing entries solely as hypothesis generation, these columns are removed beforehand (e.g., [16, 27]. Furthermore, vital signs and laboratory parameters have a high impact on the course of Covid-19 disease, which is why at least two values of them must be filled in. In general, missing values are a common problem in healthcare. In order not to ignore any highly relevant features, the removed features were discussed with experts. In addition, the remaining missing values need to be filled since not all machine learning algorithms and oversampling techniques (see below) are able to handle missing values. The methods used for filling in empty values include a simple imputer and two iterative machine learning imputers (Random Forest and Multi-Layer-Perceptron algorithms). Following the creation of the three different datasets by filling in the empty values, the comorbidities are summarized. This is a common procedure in the Covid-19 literature to reduce complexity while retaining important information. There are two different variants for the summary of comorbidities, namely the sum of the comorbidities and the Charlson Comorbidity Index [13].

The different data preparations were divided into feature and label matrices. Our label definition leads to two different formats, as we distinguish between three (3) and four (4) (meta-) pathways in the following: The base case with ED \(\to\) Discharge (i.e., outpatient), ED \(\to\) Ward (i.e., ward), and ED \(\to\) ICU (i.e., ICU), may be extended by a palliative care label (ED \(\to\) PCU) which has been incorporated in the base case labels before. All patients who were in the ICU (or Intermediate Care, IMC) during their hospital stay were assigned to the ED \(\to\) ICU pathway (i.e., ICU), all other inpatients to the ED \(\to\) Ward (i.e., ward), pathway, and the rest to the ED \(\to\) Discharge (i.e., outpatient) pathway. In the four-label classification, all deceased patients were assigned to the ED \(\to\) PCU (i.e., palliative care) pathway. Please find a detailed description of our data preparation and label definition in Supplementary Fig. 1.

Depending on the imputer, the summary of comorbidities, and the label definition, we define twelve different input data sets with 3,543 patients and 58 features each: six for each of the two different classifications with three or four labels, whereby three different imputers (Simple Imputer, RF, MLP) and two different summaries of comorbidities (Sum, CCI) are applied. Table 2 provides an overview of the twelve different input data sets. Table 3 lists the 58 different features per input data set.

Table 2 Description of the 12 different input data sets (RF: Random Forest, MLP: Multiple Layer Perceptron, CCI: Charlson Comorbidity Index)
Table 3 Description of the 58 features in the input data sets (CT: Computer tomography, CCI: Charlson Comorbidity Index)

To avoid overfitting throughout our study, we used ten-fold cross validation. Each input data set is randomly split into ten different folds, while every subset is subsequently defined as test data set with a training and testing ratio of 90% and 10%. Performance is then measured based on the metrics for the different test data sets.

3.2 Base triage algorithm and data-based extension

Figure 1 illustrates the base triage algorithm (TA) for clinical pathways of Covid-19 subjects [39] which is considered as a guideline for emergency departments in Germany as suggested by DGINA. TA is constructed as an easy-to-understand and simply applicable decision tree. After examining classical Covid-19 symptoms (such as dry cough and vital signs), the overall clinical presentation are evaluated. Step three involves blood counts and medical imaging. Finally, the results of all steps of the algorithm are reviewed in their entirety and the patient is classified as outpatient (i.e., discharge), ward, or ICU. Other than for ED \(\to\) Ward and ED \(\to\) ICU, a physician can also classify the pathway ED \(\to\) Discharge based on steps one to three. Note, TA only involves three classification labels with outpatient (i.e., discharge), ward, or ICU.

Fig. 1
figure 1

Base triage algorithm (TA) according to Pin et al. [39]

Our extended version of the base algorithm (TAE) flexibly considers either three labels or an optional fourth classification label (i.e., palliative care), while TAE always builds upon alternative analytics-based first and final steps (see Fig. 2). The new first step in TAE avoids the discharge of patients (i.e., ED \(\to\) Discharge) in the first step of the algorithm. Due to a significant number of asymptomatic patients in the data, the finding of a symptom-free status may not be sufficient to classify the patient as an outpatient. Therefore, in contrast to TA, patients arriving in the emergency department always have their vital signs and clinic checked after symptoms were reviewed. The new final step, namely the calculation of the TAE score, is based on findings of abnormalities and risk factors for severe Covid-19 progression in literature (e.g., [15, 33, 43]). Compared to the TA, the TAE score includes the severity of an anomaly. For example, a distinction is made among the laboratory values as to whether a patient's temperature is only elevated or high. Together, these form a TAE score to classify patients with high accuracy (see Table 4). Both implemented changes compared to the TA are highlighted with yellow boxes in Fig. 2.

Fig. 2
figure 2

Extended triage algorithm (TAE). Yellow boxes highlight the differences compared to TA (see Fig. 1). TAE Scores for laboratory values, vital signs, demographic values, and comorbidities are shown in Table 4

Table 4 TAE Scores for lab, vital, demographics, comorbidities

3.3 AI and human-AI systems

We focus a classification modeling problem and thus apply a Multi-Layer Perceptron (MLP), a Random Forest (RF) and an Extreme Gradient Boosting (XGB) classifier to the data. The MLP is characterized by a multi-layer neural network structure consisting of an input layer, several hidden layers, and an output layer. The RF consolidates the predictions of different decision trees based on a majority decision. The XGB algorithm is also constructed from decision trees by an ensemble or boosting idea. Note that these AI approaches, while generating an autonomous classifier, are of a black-box style and do not, other than a (simple) decision tree, meet the transparency requirements by Arrieta et al. [3].

In addition to the existing decision tree by Pin et al. ([39], TA) and the machine learning methods (MLP, RF, XGB), we investigate the potential of integrating both approaches in a two-step process: integrated triage algorithm (ITA). An AI-based autonomous pre-triage is made before the physician starts the actual triage of patients by means of a data-guided decision tree based on the ITA scores given in Table 5. The literature-based TAE scores (see Table 4) are incorporated into the recalculated ITA scores (see Table 5). In contrast to the TAE scores, scores for the different clinical pathways are formed in the ITA score. The calculation of the scores is based on feature importance, detailed data analytics, and discussions with experts. In the ITA algorithm, first, sequential MLP and XGB algorithms filter ICU patients and outpatients (i.e., discharge) based on the accurate prediction. Second, based on a white-box decision tree and the ITA scores, the remaining patients are classified as ICU, ward, or outpatient. By combining both approaches, we aim at the evaluation of a human-AI interactive algorithm, with autonomous black-box and white-box components. The autonomous pre-triage component (i.e., the black box model) saves working time of medical staff that has become scarce during the pandemic, while the second component (i.e., the white-box model) contributes to transparency. The two-step process, in addition, is of a human-AI interactive type because the pre-triage’s output is the basis for the classification of the patient by a physician. Fig. 3 presents our human-AI ITA algorithm.

Table 5 ITA Scores for ICU, ward, outpatient (TAE Scores are shown in Table 4)
Fig. 3
figure 3

Integrated triage algorithm (ITA). ITA Scores for ICU, ward, and outpatient are shown in Table 5

3.4 Performance measurement

We measure and compare the algorithms’ ability to correctly predict the patient (meta-) pathway in terms of outpatient (i.e., discharge), ward, ICU, and palliative care, by accuracy, sensitivity (i.e., recall), specificity, F1-score, precision, and the area under the receiver operating characteristic (ROC AUC).

While accuracy provides information on the correctly classified patients, precision focuses the true positive results divided by the positively classified. F1-score and ROC AUC incorporate either precision and sensitivity or specificity and sensitivity. The reported metrics are based on a ten-fold cross validation, hyperparameter tuning and the Synthetic Minority Oversampling Technique (SMOTE) to meet the problem of imbalanced data. SMOTE fills in the underrepresented classes in the data set by a resampling mode. Particularly in the case of multiclass classification, SMOTE achieves good results with respect to imbalanced data [9]. Hyperparameter tuning is a preprocessing optimization technique to the actual optimization of, for example, weights in a multi-layer neural network and defines hyperparameters such as the learning rate. Please note there exist different forms of AUC depending on the characteristics of the data set. Since our data set is balanced by SMOTE, we consider ROC AUC. However, in the case of imbalanced data it may be more appropriate to use a form of partial AUC as suggested by Carrington et al. [12]. A simple dummy classifier (DC) randomly classifying subjects as outpatient, ward, ICU, and palliative care patients with equal probability serves as a benchmark for the different classifiers.

4 Results

In total, 3,543 Covid-19 patients are included in our study. Table 6 provides an overview on data availability, important demographic, and clinical characteristics of the patients. Most patients in the data set are over the age of 56 years old, male, and suffer from one cardiovascular disease at least. Fever is the most frequent classical Covid-19 symptom, followed by dry cough and dyspnea. Gamma-glutamyl transferase (GGT) and Lactat-dehydrogenase (LDH) are frequently increased in the patients. The distribution of labels is characterized by the fact that most patients in the data set remain in ward (see Table 7). Few patients are discharged from the hospital upon presentation in an emergency department.

Table 6 Demographic and clinical values at admission of COVID-19 patients
Table 7 Label distribution

4.1 Outpatient, ward, and ICU classifier: three labels

In this section, we compare the base triage algorithm (TA), it’s extension (TAE), the dummy classifier (DC), the three machine learning techniques (MLP, RF, and XGB) and the integrated triage algorithm (ITA) for the base classifier task with three labels, outpatient, ward, and ICU. The overall accuracy ranges between 27% for TA, approx. 51% for TAE, approx. 73% for ITA, and up to 78% for the machine learning techniques (see Fig. 4). By comparison, the DC achieves only 4% total accuracy and a 50% ROC AUC. The machine learning algorithms obtained a significantly higher ROC AUC (between 76 and 88%, see Fig. 4). Differences are more in the labels than in the AI methods. The TA demonstrates high sensitivity for the outpatient class (up to 84%) but shows poor performance in classifying ward patients (approx. 15%). The TAE demonstrates a better performance regarding ward patients (up to 54%), while sensitivity in terms of the ward class is highest for the machine learning techniques (up to 94%). Regarding the ICU class, sensitivities vary from 43% (TAE) to 72% (ITA, see Fig. 5). While precision of TA varies significantly for the three labels (4% vs. 82%), the performance of MLP, RF and XGB is rather balanced here. Observing the specificity, it is noticeable that especially the ITA evokes rather balanced values between 63 and 98% compared to the ML algorithms (33% vs. 100%). The AI and human-AI methods consistently obtain higher F1-scores than the TA and TAE techniques. A detailed summary of the performance metrics provides Supplementary Table 1. A radar chart for a visual comparison of performance metrics for the three labels is provided in Fig. 4. The radar chart underlines the results of a poor performance of TA compared with the AI-based algorithms in all metrics. In addition, the significant improvement of the sensitivity for the ICU label and the ITA is illustrated.

Fig. 4
figure 4

Comparison of the algorithms based on the accuracy (upper), ROC AUC (middle) and radar charts (lower) for data sets with 3 labels (left hand side) and four labels (right hand side). The respective boxplot represents the distribution of accuracy for the different data preparations. Both radar charts compare sensitivities, precision, and accuracies of the different algorithms. On the left-hand side, the XGB is used for all machine learning models, because of the similar performance

Fig. 5
figure 5

Comparison of the algorithms based on the recall of ICU. The respective boxplot represents the distribution of recalls for the different data preparations with 3 labels

In the synopsis of the results, AI and human-AI methods in most metrics outperform TA and TAE. Comparing the three machine learning classifiers (i.e., MLP, RF and XGB), XGB, a MLP imputer and the Charlson-Comorbidity Index for grouping comorbidities should be preferred. However, data processing has a minor influence on the performance metrics, overall. The confusion matrix and ROC AUC for the preferred XGB algorithm with three labels, a MLP imputer, and the Charlson-Comorbidity Index (i.e., 3MC data set) are presented in Supplementary Fig. 2.

4.2 Outpatient, ward, ICU, and palliative care classifier: four labels

In case of four labels (i.e., outpatient, ward, ICU, palliative care), we compare the TAE, the DC, and the three machine learning techniques (MLP, RF and XGB). The overall accuracy decreases for TAE, MLP, RF, and XGB (see Fig. 4). Nonetheless, the basic statement remains that a significant improvement is achieved here by machine learning techniques. The ROC AUCs of the machine learning algorithms (i.e., MLP, RF, XGB) consistently show much better performance than the DC and vary between 70 and 90% (see Fig. 4). The introduction of the new class palliative care leads to a crucial decrease of sensitivity for the ICU class (varying between 7 and 31%), while sensitivities for the outpatient (i.e., discharge) and ward classes remain almost unchanged. Specificity for the ward class deteriorates for almost all algorithms, but remains constant for the outpatient (i.e., discharge) class and increases for the ICU class. In addition, ROC AUC providing an integrated view on sensitivity and specificity remains at a high level. The new class palliative care obtains a sensitivity score from 34 to 53%. Regarding precision, F1-scores and an algorithm preferred, the interpretations of Section 4.1 remain unchanged.

A detailed summary of the performance metrics is provided in Supplementary Table 2. A radar chart for a visual comparison of the different performance metrics discussed before is provided in Fig. 4. The confusion matrix and ROC AUC for the preferred XGB algorithm with four labels, an iterative Random Forest imputer, and the Charlson-Comorbidity Index (i.e., 4RC data processing) are presented in Supplementary Fig. 3.

5 Discussion

Taking the different metrics into consideration, the performance of the base triage algorithm (TA) which is suggested as a guideline in Germany is significantly improved by an analytics-based adaptation: the extended triage algorithm (TAE). The AI-based algorithms and the integrated human-AI algorithm (ITA) perform similar, but significantly superior compared to the base triage algorithm (TA) and the extended triage algorithm (TAE). A major advantage of the integrated human-AI algorithm (ITA) is the high sensitivity with respect to the ICU category. The sensitivity for the ICU class is particularly important because especially ICU capacities have become scarce during the Covid-19 pandemic and the correct classification of critical care patients directly influences their well-being.

We find the human-AI interactive algorithm and the AI-based algorithms for superior performance. As the algorithms directly influence patients and medical staff in the emergency department not only a data-driven, but also ethical, usability, and implementation perspective are considered. Ethical considerations are mainly driven by the autonomy of the decision maker and the transparency of the algorithm, a basic characteristic in the XAI definition (see, e.g., [36, 45] or [6]). Human-AI interaction also contributes to XAI in healthcare [22]. The AI-based algorithms are non-transparent black-box models whereas the base triage algorithm and the extended triage algorithm (both being decision trees) are classified as transparent white-box models. The human-AI interactive model integrates both ideas and is partially transparent. Other than for the AI-based models, the physician, i.e., human approach, is the decision maker for the base and the extended triage algorithm. In case of human-AI algorithm, the decision is made interactively by the machine and the human being in a two-step approach.

Regarding usability and implementation, the decision trees, i.e., base, and extended triage algorithms, are preferable because decision support can already be provided through an easy-to-understand figure. For the AI-based and human-AI algorithms, elaborate implementation, and an interface to the hospital information system are essential. As there are different advances in many countries regarding digitalization in hospitals, e.g., the Hospital Future Act in Germany, the broad implementation of digital decision support tools will be made possible in near future.

The integrated human-AI algorithm performs similarly to the AI-based methods, but elucidates a higher sensitivity regarding the ICU category, it is partially transparent, and integrates the machine and the human being as decision makers. As implementation issues will be solved soon, the human-AI interactive algorithm is preferable. This result is not influenced by the distinct data preparation proceeding. The consideration of the pathway palliative care, which is controversial in Covid-19 triage (see, e.g., [28], is to be avoided from our data-driven perspective, and not only from ethical considerations. This conclusion is of particular importance in times of high ICU capacity utilization.

Our study builds upon an existing triage algorithm, a data set with more than 4,000 Covid-19 patients, and AI techniques. Due to the nature of the LEOSS registry, inpatients are significantly overrepresented, so the algorithm should not be applied to ambulatory care settings outside an emergency department. Limitations include the data quality due to missing values. By filling the data using the most frequent value, i.e., the simple imputer, a rather inaccurate approximation is assumed. Imputation using machine learning methods (RF, MLP) is more accurate in terms of the optimal solution, but the stopping criterion is not reached in certain cases. This can be attributed to the number of missing values. In addition, the LEOSS data builds upon predefined ranges regarding the categorization of demographic data and other parameters such as the blood counts. Thus, the scores, e.g., the CCI, are applied via an approximation, because the LEOSS ranges do not exactly match those of the respective scores.

In addition, the LEOSS dataset represents a European sample of infected individuals with a strong focus on German health care institutions. Varying prevalence rates, possible mutations or hygiene conditions in other countries could influence the result. Consequently, the results are assumed to be a representation of emergency departments in other European and developed countries in a comparable state of the pandemic, but further data is necessary to validate the algorithms for varying courses of the pandemic and emergency departments in non-developed countries. The algorithms concentrate on a specific emergency department setting, i.e., the classification of Covid-19 patients, but there is a certain ability to apply the algorithms to other emergency department settings, such as the classification of patients with viral infections in general, e.g., flu. The base triage algorithm was suggested during the first pandemic wave as a guideline in Germany, and we use the LEOSS data output at a rather early stage of the pandemic. Consequently, there might exist interdependencies, i.e., the outcomes in part of the LEOSS data could be influenced by the triaged outcomes using the base triage algorithm. On the other hand, based on LEOSS, we use the realized highest care unit of treatment, e.g., ICU, of each patient as ground truth label which is not necessarily defined based on the base triage algorithm.

6 Conclusion

In this work, we evaluate the performance of Covid-19 triage algorithms in the emergency department and discuss the potential of integrating analytics, AI, XAI and human-AI interaction in detail. The results are based on a dataset with more than 4,000 PCR confirmed SARS-CoV-2 infected patients. Compared with existing papers, the size of our dataset significantly exceeds current literature.

We find that data-driven manipulation of the existing human-made base triage algorithm can improve the classification, but AI adaptations promise a superior performance. Comparing the AI methods with an integrated human-AI method, the algorithms are comparable in many performance metrics. But based on ethical AI considerations in terms of transparency, we suggest the use of the integrated human-AI algorithm. The data preparation process plays a subordinate role for the performance of the algorithms. The hypothetical consideration of the (meta-) pathway palliative care might be excluded from our data perspective for times when enough ICU beds are available. This finding is important for the ethical dimension on the broad triage discussion.

Our data-driven retrospective perspective lays the basis for a prospective evaluation of the human-AI algorithm and behavioral analyses in future research. Aspects such as information asymmetry in between humans and machines can be studied on using experiments in the emergency department.