Introduction

The coronavirus disease 2019 (COVID-19) prompted an ongoing pandemic and has raised a global public health concern. More than 220 million cases have been confirmed worldwide up to September 2021 [1]. COVID-19 radiological imaging data continues to grow at a disproportionate rate compared to the number of available trained radiologists, contributing to a dramatic increase in radiologists’ workload. Therefore, an effective method is urgently needed for accurate and rapid diagnosis of COVID-19.

The World Health Organization (WHO) recommends the use of a reverse transcription-polymerase chain reaction test (RT-PCR) as the gold standard to confirm the diagnosis of a suspected COVID-19 infection [2]. Some studies have shown that chest computed tomography (CT) examination is a widely available, rapid, and under certain conditions, more sensitive than antibody or RT-PCR tests at various stages of the infection [3, 4]. Therefore, the WHO recommends the further use of chest CT to diagnose COVID-19. However, a meta-analysis suggests a low specificity in differentiating pneumonia-related lung changes due to the overlapping imaging characteristics between COVID-19 and non-COVID-19 pneumonia in chest CT [5]. Artificial intelligence (AI) is an emerging technology that can automatically extract quantitative imaging features to support radiological decision-making, increase workflow efficiency and improve clinical management [6]. Previous studies have integrated AI for COVID-19 detection, diagnosis, and prognosis using multicenter CT scans [7,8,9,10,11,12,13,14,15,16,17,18]. These existing AI studies presented a good performance in detecting COVID-19 but revealed several limitations: first, the data were mostly exclusive from China or the United States, without including larger amounts of European data [7,8,9,10,11,12, 14, 18]. Previous studies mostly include the initial chest CT of COVID-19 cases but without any specific consideration of the CT-based early, progressive, peak, and absorption stage[8, 10,11,12, 15]. This might lead to a situation where predominantly identical, morphologically typical, and thus simply recognizable COVID-19 stages were preferentially or exclusively included in the analysis and the AI algorithm development. Finally, most studies focus on the AI system’s performance and compare this with the radiologists’ performance [12, 16, 17]. Although studies included the AI result as assistance for radiologists, the benefit was only quantified in terms of increased accuracy [8, 10, 11]. Different levels of confidence in the diagnostic decisions and the individual time for diagnosis, as associated criteria determining the clinical benefit using an AI algorithm in an international multicenter dataset, including all COVID-19 stages, were not evaluated to our knowledge.

Thus, the aims of this study were (1) to develop an AI algorithm to distinguish between COVID-19 and CAP using a multicenter, multi-vendor CT imaging dataset, (2) to compare the diagnostic accuracy of the AI algorithm with the performance of human radiologists, (3) to identify potential improvements in the diagnostic performance of radiologists in terms of reporting time, confidence and accuracy by using the developed AI algorithm as additional assistance compared with the performance of the radiologists without AI assistance. In conclusion, the potential advantages of additional use of AI assistance for COVID-19 detection in everyday radiology shall be discussed.

Materials and methods

Patient population

This retrospective study received ethical approval, and informed consent was waived at all participating hospitals (Jilin: 2020-595, Wuhan: [2020]17, Ningbo: PJ-NBEY-KY-2020-194-01, Cologne: 20-1676, Frankfurt: 20-719, and Heidelberg: S-293/2020).

Inclusion criteria were as follows: (1) chest CTs with inflammatory infiltrations. (2) inclusion to the COVID-19 group occurred in case of a positive RT-PCR test within 48 h before the CT examination. (3) The inclusion criteria to the CAP group were a CT scan collected before the first outbreak of COVID-19 (from February 23, 2016, to December 31, 2019) or after the outbreak of COVID-19 (since January 7, 2020), an additional microbiological diagnosis of bacterial or virus-related pneumonia and negative COVID-19 RT-PCR test was required.

Chest CTs were characterized by inflammatory infiltrations, exclusion from lung tumors, tuberculosis, trauma, and postoperative scarred lesions. Further lung changes, which have no potential overlap with acute inflammation, were classified and recorded according to Fleischner Society guidelines (see Table S2 in the supplementary material [19, 20].

Referring to the COVID-19 guideline [21], based on the CT characteristics and the gap between CTs and symptoms onset, the COVID-19 group was divided into early (0–3 days), progressive (4–7 days), peak (8–14 days), absorption (≥15 days) stages [22]. To reflect the clinical distribution of COVID-19 (50%) and CAP (50%), a homogenous test dataset of different stages and severities per stage was used in the test dataset. Briefly, earlier COVID-19 stages (early and progressive) 30.5%, later stages (peak and absorption) 19.5%, and CAP with mild, moderate, and extensive severities including 25% viral and 25% bacterial pneumonia (Table 1).

Table 1 Imaging stages of COVID-19 and etiological classification of CAP

Data workflow for AI training and imaging analysis without and with AI support

For AI training and validation, and for non-AI-assisted and AI-assisted CT analysis, a total of 1591 chest CT scans were divided into three independent datasets (Fig. 1): (1) AI training and validation dataset: 989 patients with 991 CT scans (462 COVID-19 and 529 CAP) from three centers in China (2) independent Chinese test dataset of 300 CTs from 290 patients with 100 CTs of every Chinese center with 50 COVID-19 and 50 CAP respectively (without any duplications from the training dataset), and (3) independent German test dataset of 300 CTs from 300 patients with 100 CTs of every German center with 50 COVID-19 and 50 CAP respectively. From Chinese-German test datasets, a balanced subtest dataset of 180 CTs (90/90 China and Germany) was selected for subsequent international radiologists’ intrareader evaluation without and with AI support.

Fig. 1
figure 1

Distribution of the Chinese-German multicenter, multivendor dataset for AI classification model development and composition for un-assisted and AI-assisted imaging analysis

Development of 3D AI algorithm

An AI algorithm that can differentiate COVID-19 from CAP infection on chest CT was developed using the IntelliSpace Discovery platform (ISD, v3.0.6, Philips Healthcare, Best, The Netherlands). The classification algorithm is based on two deep learning models, a segmentation model and a classification model (Fig. 2). The lung segmentation model (LSM) is based on an encoder-decoder conventional neural network (CNN) architecture called SegNet [23]. The network is fed with individual 2D slices and the corresponding manual ground-truth lung segmentations during training. The input CT scans to the LSM were resampled to an in-plane resolution of 1 x 1 mm, resized to 512 x 512, and intensity clipped to a range of −1000 to 500 Hounsfield unit (HU). During inference, the LSM accepts the entire CT volume for each subject and generates a 3D lung segmentation (Figure S1 in supplementary material shows a schematic of the CNN during inference). The LSM was trained on 35,000 CT slices and validated on 9,500 CT slices, reaching a validation Dice-index of 0.94 ± 0.03. The 3D COVID-19 classification model (CCM) is based on the ResNet50 architecture, called COVNet [7, 24]. During training, the network is supplied with the masked lung region and the corresponding class labels (i.e., COVID-19 or CAP). The input CT scans to the CCM were resized to an in-plane size of 224 x 224, down-sampled by a factor 5 in the Z-direction, intensity clipped to a range of −1250 to 250 HU, and intensity normalized between 0 and 1. Data augmentation such as random rotation, flipping, and adding Gaussian noise to the input data during training was applied. All features calculated from the input CT slices were pooled using a max-pool operation and passed to a fully connected layer with the SoftMax activation function to produce class probabilities. During inference, on the independent Chinese-German test dataset, the CCM accepts the entire CT volume for each subject and outputs the class label (Figure S2 in supplementary material shows the schematic of the CCM). The CCM was trained and validated using the 5-fold cross-validation. The five models from the 5-fold cross-validation are simultaneously applied to the CT volume, and the class probability values are ensembled to obtain the final class label. CT scans of the training and validation dataset were evaluated using a 5-fold cross-validation strategy. Firstly, the training and validation dataset was randomly divided into five folds without overlapping samples. Secondly, five different models (models 1–5) were trained and evaluated. Lastly, the results from the five models were combined using a model ensemble strategy to obtain the final predictions on the independent Chinese-German test dataset.

Fig. 2
figure 2

Overview of the proposed CT classification workflow. A CT scan is used as input. A deep learning-based lung segmentation model segments the lungs, which are used to mask out the lung regions. A COVID-19 classification model takes as input the masked lungs and provides a classification either as COVID-19 or as CAP

Radiologist performance evaluation without and with AI assistance

Radiological image interpretation was performed in three steps: (I) radiological image evaluation of the 300 test dataset from China and Germany independent of one or the other. (II) A subtest dataset of 90 CTs consisting of 45 COVID-19 and 45 CAP cases from each country, resulting in 180 cases, was evaluated. The 180 CT scans were evaluated simultaneously by four Chinese and four German radiologists at Jilin and Cologne online. For a detailed analysis of the image evaluation performance, accuracy of diagnosis, time for evaluation, and confidence of diagnosis were tracked within the online evaluation session (see below). The entire evaluation process of the 180 CTs was performed twice, once without (IIa) and followed with AI assistance (IIb). In the subsequent step (IIb), CTs were displayed in changed order, and the interval between the evaluation sessions was set to over three months to avoid image recognition (May 17th, 2021, without AI assistance; September 15th, 2021, with AI assistance).

Therefore, in a quiet environment, the eight radiologists (D.L. and L.W. from Jilin, China and P.F. and J.K. from Cologne, Germany, as experienced radiologists, S.Y. and N.M. from Jilin, China and M.W. and M.R. from Cologne, Germany, as unexperienced radiologists, inexperienced radiologists with < 5 years of radiological training and read less than 50 COVID-19 cases, as well as experienced radiologists with ≥ 5 years of radiological training and read more than 100 COVID-19 cases) evaluated the CT images using the ISD platform through remote online network simultaneously. All radiologists reviewed the images by setting the lung window of 1600 window width and −600 window level and giving the information on the patients’ sex and age. The radiologists made their diagnostic decisions independently and not by consensus. As much time as desired could be taken for evaluation and diagnosis. The radiologists were allowed to zoom in or rotate the images. The subtest dataset was anonymized and stored in their ISD platform. Two project executors (M.F. from Jilin, China, and R.S. from Cologne, Germany) recalled and played the images and did not participate in the evaluation.

Radiologists classified the images as COVID-19 or CAP, and their self-experienced confidence and time to make a diagnostic decision. The individual confidence in the diagnosis was scored using the 5-point Likert scale (1 = very certain diagnosis, 2 = certain diagnosis, 3 = likely diagnosis, 4 = uncertain diagnosis, 5 = very uncertain “ambiguous” diagnosis), the time from the start evaluating to the final decision was measured with a digital stopwatch. The overall accuracy, sensitivity, and specificity of the diagnosis were assessed.

Statistical analysis

Demographic characteristics were compared between COVID-19 and CAP groups by using the χ 2 test for categorical variables and the t-test for continuous variables. Statistical data analysis was performed using R version 3.6.2 on Rstudio version 1.2.5033.35 (https://cran.r-project.org/). Figures were plotted using the ggplot2 package [25]. Continuous variables were reported as mean and standard deviation (SD). Statistical hypothesis testing of the non-parametric dichotomous performance data was calculated from 2 × 2 contingency tables using McNemar's test, Wilcoxon signed-rank test for the ordinal scaled data, or Student's t-test for parametric data. Intra- and interreader reliability was tested using the Fleiss' kappa (< .4 fair, < .6 moderate, < .8 substantial, > .8 excellent) for the non-parametric performance data using the irr package. Statistical significance was defined as p < .05.

Results

Patient characteristics

In total, 1591 CT scans, divided into 762 CT scans of 753 patients with present COVID-19 pneumonia and 829 CT scans of 826 patients with present CAP acquired from six centers in China and Germany between February 2016 and May 2021, were included. The 1279 patients from China included CT data from the University Hospital of Jilin (n = 358), Wuhan (n = 736), and Ningbo (n = 185). The 300 patients from Germany contained data from the University Hospitals Cologne (n = 100), Heidelberg (n = 100), and Frankfurt (n = 100). For a detailed presentation of the individual groups, see Fig. 1.

The mean age of patients in the COVID-19 group was 54.2 ± 14.3 years, with 369 male and 384 female patients. The mean age of patients in the CAP group was 60.3 ± 15.9 years, with 478 male and 348 female patients (see Table 2). The mean age of patients in the 180 cases (90 cases from each country) subtest dataset was 54.9 ± 14.4 years, with 94 male and 86 female patients. Patients in the CAP group were older than those in the COVID-19 group within the Chinese portion of the 300 cases. Furthermore, there were no significant differences in age or gender distribution between the groups. Detailed patient characteristics are shown in Table 2. Further included lung changes, which have no potential overlap with acute inflammation, were classified according to Fleischner Society guidelines [19, 20](see Table S2) and demonstrated no significant difference in frequency of occurrence between the COVID-19 group and CAP group.

Table 2 Demographic values of the patients in the COVID-19 and CAP groups

AI-based diagnosis

The diagnostic performance of the AI model in the Chinese-German test datasets (Table 3) was 76.5%, 67.9%, and 85,7% (accuracy, sensitivity, specificity). The diagnostic performance within the various datasets of the AI model did not differ significantly. The diagnostic performance was marginally below the diagnostic performance of the radiologists (without AI assistance) but, on overall average, did not significantly different from it (p > .05).

Table 3 Performance of the AI algorithm at the international datasets

Within the identical subtest dataset, the AI algorithm achieved a performance of 80.0%, 72.1%, and 87.2% (accuracy, sensitivity, specificity). Within the reference subgroup of cases with earlier COVID-19 stages (early and progressive), the performance of the AI algorithm was 78.6%, 68.3%, and 87.2% (accuracy, sensitivity, specificity). In contrast, the performance was 82.0%, 76.9%, and 87.1% (accuracy, sensitivity, specificity) in the advanced stages (peak and absorption) [22]. The AI algorithm significantly better differentiated the subgroups of later than early COVID-19 stages (p < .05).

Radiologists’ diagnosis in the test datasets

In the combined test dataset, a performance in the arbitrary classification of the different types of pneumonia (COVID-19 vs. CAP), the German-Chinese average performance was 78.9%, 72.7%, and 84.1% (accuracy, sensitivity, specificity). In the German 300 test dataset, the German radiologists achieved performance in classifying the different types of pneumonia of 86.0%, 84.4%, and 85.1% (accuracy, sensitivity, specificity). Solely in this part of the dataset, the radiologists from Germany achieved a significantly better performance than the AI (p < .05). In the Chinese 300 test dataset, a performance in classifying the different types of pneumonia of 71.9%, 60.9%, and 83.0% (accuracy, sensitivity, specificity) was achieved by the Chinese radiologists.

Radiologists’ diagnosis without AI assistance in the subtest dataset

Without AI assistance, the diagnostic performance of the eight radiologists in the Chinese-German subtest dataset with 180 cases regarding the diagnosis of either COVID-19 or CAP pneumonia resulted in 1440 decisions with a result of 79.1%, 79.0 %, and 79.2% (accuracy, sensitivity, specificity). The diagnostic performance of the radiologists from Germany was slightly better (86.3%, 88.9%, 83.6%) than the performance of their colleagues from China (71.9%, 69.2%, 74.7%). The mean evaluation time without AI assistance was 36.2 ± 22.1 s. The mean confidence score was 2.53 (see Table 4). In the present experiment, no significant overall difference in diagnostic performance between experienced and inexperienced radiologists occurred (p > .05; see Fig. 3b).

Table 4 Results of the evaluation regarding the decision CAP or COVID-19 pneumonia
Fig. 3
figure 3

a Comparison of correct diagnoses with and without AI assistance in the runs. b Different performances of each radiologist split by country and level of experience, and the AI result in the corresponding data set (2x 90 cases)

Without AI assistance, within the reference subgroup of cases with earlier COVID-19 stages (early and progressive), the performance of the eight radiologists was 79.5%, 81.9%, and 77.5% (accuracy, sensitivity, specificity). In contrast, the performance was 75.2%, 74.9%, and 75.5% (accuracy, sensitivity, specificity) in the advanced stages (peak and absorption). The eight radiologists’ evaluations without AI assistance, early stages were significantly better differentiated than late stages within the two subgroups (p < .05).

Radiologists’ diagnosis with AI assistance in the subtest dataset

With the presentation of the AI results prior to the individual diagnosis (AI assistance), better results were achieved on average with 81.5%, 71.4%, and 91.5% (accuracy, sensitivity, specificity). Also, looking at the individual radiologists from Germany (87.2%, 83.1%, 91.4%) and China (75.7%, 59.7%, 91.7%), both achieved significantly higher accuracy rates (p < .05, see Fig. 3). The average evaluation time to decision with AI assistance was 26.9 ± 10.8 s, resulting in a significantly faster time interval to diagnosis (p < .05; see Table 4, Fig. 4b). The interreader reliability between the eight radiologists within the round without AI assistance showed a relatively low kappa of 0.34 (poor), which represents a relatively heterogeneous assessment of the cases. In the round with AI assistance, the kappa increased markedly to 0.58 (moderate), indicating more homogeneous diagnostic decisions.

Fig. 4
figure 4

a Different levels of radiologists' confidence in diagnoses using the 5-point Likert scale with and without AI. b Mean time to diagnosis (seconds)

The mean confidence score was also significantly reduced to 2.20 and showed a significantly (p > .05) a higher proportion of “clear decisions” in the AI-assisted (n = 349 vs. n = 229) diagnostic round. Similarly, the proportion of "uncertain" diagnoses and “very uncertain” diagnoses was significantly reduced with AI assistance (n = 83 vs. n = 217, n = 13 vs. n = 37; see Table 4, Fig. 4a).

With AI assistance, within the reference subgroup of cases with earlier COVID-19 stages (early and progressive), the performance of the eight radiologists was 81.9%, 70.5%, and 91.5% (accuracy, sensitivity, specificity). In contrast, the performance was 81.8%, 72.0%, and 91.7% (accuracy, sensitivity, specificity) in the advanced stages (peak and absorption). In the early COVID-19 subgroup and in the later COVID-19 subgroup, the use of AI assistance significantly improved the radiologists’ performance (p < .05).

Based on the decision of the eight radiologists in the first evaluation round without AI assistance, a maximum number of eight correct readers’ decisions between COVID-19 vs. CAP was achievable per case. The cases were classified into a maximum of eight correct decisions per case in “clear cases” with six or more correct ratings from the readers. In contrast, “difficult cases” were classified as those that received fewer than six, and “ambiguous cases” were characterized as those CT scans that received less than 50% (fewer than four correct ratings). This analysis, according to the case difficulty, showed that there was a minor significant difference (p < .05) in the cases classified as “clear,” where more errors were committed with AI assistance. In the cases classified as “difficult,” there was a significant (p < .05) reduction in the number of incorrect diagnoses in the subsequent diagnosis with AI assistance. In the cases already categorized as “ambiguous,” there were numerically fewer errors. However, this finding did not prove to be significant (p > .05). Figure 5 showed examples of the cases classified correctly by AI and the result in terms of accuracy, confidence, and diagnostic time of the eight radiologists using the AI assistance.

Fig. 5
figure 5

Examples of the included cases, each with the correct AI result and the respective performance data of the radiologists' readings. a Correct AI classification. Using this result, all eight radiologists decided on the correct diagnosis, whereas without AI assistance, two were incorrect. Overall, this resulted in a faster reporting time and a higher degree of confidence. The right centrally located ground-glass opacities and beginning consolidations could have been the causative factor for the radiologists’ partially wrong result without AI assistance. b Only four of eight radiologists in the initial round without AI correctly classified images of viral non-Covid pneumonia, which could be due to the discrete, predominantly peripheral infiltrates. The AI correctly classified the image, resulting in a mainly correct diagnosis with a higher degree of confidence by the radiologists in the AI-assisted round (seven of eight now made a correct decision)

Discussion

Since the onset of the global COVID-19 pandemic in March 2020, it has been widely recognized that, in addition to conventional laboratory testing methods, radiological identification was also a robust option to diagnose and follow up COVID-19 pneumonia. CT scans of COVID-19 patients present characteristic morphological features, which several research groups exploited to develop AI-based networks for automated differentiation between COVID-19 pneumonia and pneumonia caused by other pathogens. To the best of our knowledge, this is the first study analyzing the potential improvement of AI assistance in radiologist decision-making on an international multicenter, multivendor CT dataset from China and Germany. This study provides deep insights into the advantages of AI-supported diagnosis in terms of accuracy, confidence in diagnosis, and time to diagnosis as a benefit for clinical implementation in daily routine.

Our classifier achieved excellent accuracy comparable to both similar published algorithms and the performance of human experts. Other international multicenter studies, which use publicly available data, present AI algorithms with accuracy scores of up to 92% [8, 11,12,13, 18]. It must be considered that some of these studies achieve slightly better performance rates in terms of accuracy, specificity, and sensitivity than our AI model, which may be partly caused by the heterogeneity of the included pathogen groups.

COVID-19 pneumonia exhibits different radiological morphologies during the infection stages and thus may have some overlap in appearance with other types of pneumonia, mainly non-COVID viral infections [26, 27]. Whereas previous studies did not consider the approved CT morphology defining time course, stage, and severity of the disease, which imposes a selection bias by including only the early stages of COVID-19 pneumonia with typical findings for COVID-19 [12]. In our study, COVID-19 CT scans were acquired before, during, and after hospitalization, including all stages of COVID-19, with difficult and overlapping findings, especially at the later stages of pulmonary infection. The training and validation dataset included a heterogeneous set of early-to-absorption stages images with mild to extensive severities, which might have favored misclassification in the test dataset due to a substantial morphological overlap with other viral pneumonia [26, 27]. This finding is in line with Harmon et al [13], who report examinations of different disease stages resulting in overall sensitivity rates of 75.1%, comparable to our study. Moreover, the Zhang et al [11] and Li et al [7] studies included non-pneumonia CT scans in the control cohort, where it can be assumed that these imaging features are significantly different from that of COVID-19, thus improving the accuracy score. The performance dependence of the AI algorithm on the COVID-19 stage was demonstrated in the present study by looking at the subtest datasets with earlier (early and progression) and later (peak and absorption), where the algorithm was superior to distinguish later COVID-19 stages from other types of pneumonia.

In addition to the development of an AI-based algorithm, our study incorporated eight radiologists’ evaluations of the Chinese-German 180 subtest dataset with and without the assistance of an AI-based algorithm. In these trials, it was shown that the radiologists' final diagnosis improved significantly in accuracy, time, and self-experienced confidence when the AI-based assistance was consulted. These findings are consistent with the results of Bai et al [8] in a larger international cohort. However, our data demonstrate the benefit of time and confidence in using AI assistance in diagnostic decisions. A further sub-analysis showed that the use of AI assistance is useful in the early and late COVID-19 stages. However, it seems that its use seems particularly useful in the later stage group, especially due to the poorer differentiation performance of radiologists within the later stages and the superior performance of the AI algorithm. Moreover, in additional analysis, we demonstrated that AI assistance was most valuable in difficult-to-categorize cases and improved the radiologists’ diagnosis.

In routine clinical work, it is difficult for radiologists to make a dichotomic diagnosis of COVID-19 or CAP based on CT features alone, often with different confidence levels [28, 29]. Herein, AI-assisted diagnosis improved both speed and confidence in the diagnostic process. Therefore, the presented AI model can enhance clinical efficiency, which is crucial during the overload of the healthcare system in the current pandemic. Routine AI-assisted reporting could be particularly useful in sudden COVID-19 disease peaks in the winter months. At the same time, an AI assistance system may be an important tool to assist a generation of younger radiologists in further diagnosis who do not have the experience of the first major pandemic episodes. In this sense, AI assistance is also a way to provide the image-based knowledge gathered in earlier stages of the global pandemic for potentially further COVID-19 pandemic episodes.

This study found no significant difference between inexperienced and experienced readers regarding the measured performance data, comparable to Giannakis et al [30]. This study suggests that experienced and inexperienced readers can benefit from using AI assistance in significantly faster reporting times, higher diagnostic accuracy, and higher confidence.

To summarize the strengths of our study, we developed a deep learning model using three Chinese centers, including nearly 1000 patients. We tested the algorithm using an independent Chinese-German test dataset of 600 patients. To the best of our knowledge, the usage of the Chinese-German test dataset is unique compared to previous studies. In addition, in this study, a large balanced, and clinically relevant CT dataset of varying pneumonia was used, including all stages of COVID-19 disease. Finally, this is the first large international multicenter study to demonstrate the benefits of using AI-based assistance for daily radiology usage during the global COVID-19 pandemic.

Limitations of the present study include the retrospective design and the sole inclusion of patients with “CT positive” pneumonia, which leads to a selection bias that only patients with visible acute or subacute pulmonary inflammation are included. In contrast, patients with negative CT screening but positive RT-PCR were not included. However, this also supports this work because only in CT with pneumonic infiltrations the radiologist will demand AI support. In addition, we divided COVID-19 into different stages according to the time based on the onset of symptoms and CT image manifestations. However, in detail, we did not categorize them into typical, atypical, and intermediate types. In addition, it is to be mentioned that our AI algorithm and the subsequent experiment regarding the radiologists’ decisions only perform dichotomous decisions between COVID-19 and CAP. Other possible differential diagnoses, e.g., malignant diseases, are not considered. Hence, our proposed method is limited to the clinical context of infectious lung disease. The clinical setting and laboratory examination might be included in additional AI algorithms to improve the classifier’s performance. Furthermore, follow-up studies should address the cases identified herein in which AI assistance is particularly useful. Thus, a follow-up study should clarify to what extent the benefit of AI assistance depends on the causative pathogen, the extent of the affected lung volume in combination with the detailed COVID-19 stage, or the individual confidence level of the radiologist. In addition, a corresponding AI should be integrated into the clinical routine, and the benefit in everyday practice should be recorded within the framework of a prospective clinical evaluation. In this context, it would be useful to conduct a large-scale study to determine user satisfaction using AI assistance in daily radiology practice.

In conclusion, we developed an AI-based diagnostic tool to classify CT datasets as cases of COVID-19 or CAP. Our classifier yielded good accuracy. We demonstrated that using this AI as an assistance system leads to higher diagnostic accuracy while decisions were made faster and more confidently than in a solely radiologist-based evaluation. The present study shows for the first time that the use of AI algorithms as an assistance system can significantly improve radiologists’ performance in the diagnostic evaluation of COVID-19 CTs in times of an ongoing global pandemic.