Introduction

On December 31, 2019, a novel coronavirus probably originated from Wuhan, Hubei Province, China, was reported to the World Health Organization (WHO) [1,2,3]. Subsequently, WHO declared the novel coronavirus Public Health Emergency of International Concern (PHEIC) on January 30, 2020. At a situation report on March 4, 2020, this largest and most widespread outbreak of the novel coronavirus resulted in a total of 90,870 confirmed patients diagnosed as coronavirus disease-19 (COVID-19) and 3112 deaths. Global spread of this new coronavirus has resulted in 10,566 confirmed cases cross 72 countries with 166 deaths [4, 5]. To date, COVID-19 has led to more deaths than combination of SARS and MERS despite the relatively lower mortality rate [6]. Rapid spread of COVID-19 resulted from human-to-human transmission [7, 8]. During any outbreak, prompt recognition and patient quarantine play a vital role in containment of the threat. As this is a newly discovered virus, the spectrum of the effective diagnostic tools remains narrow. Instead of real-time polymerase chain reaction (RT-PCR), which is partially restricted by insufficient testing kits, delayed testing cycle, and questionable extraction technique, CT is expected to be applied in initial screening of suspected patients to accelerate the definite diagnosis especially with the emergence of artificial intelligence (AI) techniques [9,10,11,12].

In the last few years, AI in health care has been widely suggested as an important tool to guide the disease detection and clinical decisions [13, 14]. It is notable that AI is emphasized to work effectively in current epidemic for prediction of outbreaks as a Canadian company (Blue Dot) successfully reported the location of this outbreak in late December 2019. In addition, AI is also used to aid in the development of image checking in order to distinguish COVID-19 pneumonia with other benign respiratory illness [15]. Also, some ongoing works based on AI are attempting to find new ways to control the spread of COVID-19 and eliminate or reduce the threats from the epidemic. Despite current success in outbreak prediction and COVID-19 recognition, no exploritation of AI in accurate assessment of COVID-19 pneumonia has been reported officially. Thus, the purpose of our study is to automatically detect and quantitatively analyze the pneumonia lesions in chest CT images from patients diagnosed as COVID-19.

Methods

Study population

We used a commercially available deep learning algorithm (Deepwise & League of PhD Technology Co. Ltd.) [16] which was previously trained and validated in 19,291 CT scans from 14,435 patients collected from seven hospitals in China (mean age 40.9 ± 0.9; 51% male, 49% female) with the inclusion criteria of (1) CT images with slice thickness ≤ 2 mm and (2) patients diagnosed as pneumonia or healthy participants, and the exclusion criteria of (1) patients had history of pulmonary surgery; (2) CT images diagnosed as infection but not pneumonia, such as pulmonary tuberculosis; and (3) CT images with poor quality, e.g., heavy breathing artifacts and metal artifacts. Among all the 14,435 collected patients, 2154 patients were diagnosed as COVID-19 by pathogenic test, while 5874 patients were diagnosed as other pulmonary pneumonia (bacterial pneumonia, fungal pneumonia, and other viral pneumonia).

The algorithm was tested in this non-overlapping set of 96 consecutive patients in 3 hospitals from January 20, 2020, to February 10, 2020, who were diagnosed with COVID-19 by RC-PCR test using respiratory secretions extracted from nasopharyngeal or oropharyngeal swabs (84 patients from Taihe Hospital, Shiyan, Hubei; 11 from Wuhan First Hospital, Wuhan, Hubei; 1 from Jinling Hospital, Nanjing, Jiangsu). All patients involved underwent chest thin-slice CT. Patients who had (1) incomplete CT imaging data, (2) chest radiograph only, and (3) other CT examinations were excluded. The study was approved by the institutional review boards of all 3 hospitals with all written informed consents waived. The mean age of enrolled patients was 44 years. Forty-six of them are male with the age of 45 years ± 17, and 50 of them are female with the age of 43 years ± 13.

CT protocols

All patients included underwent non-contrast CT scans using the following multidetector CT scanners (Somatom definition AS, Somatom definition flash, Siemens Healthcare; Optima CT540, Optima 680, GE Healthcare). Each CT scan was performed during end stage of inspiration with supine position, ranging from lung apex to diaphragm. The detailed CT parameters were listed as follows: (1) voltage 120 kVp, (2) reference tube current 110–250 mAs, (3) detector collimation 16–320 × 0.5–0.625 mm, (4) slice thickness 1.0–1.25 mm, (5) slice interval of 0.9–1.25 mm, and (6) pitch of 1–1.375.

Image readings and definition of reference standard

Clinical readings were independently performed by three cardiothoracic resident radiologists (L.Q., L.W., and X.Y.Z. with 6, 5, and 2 years of experiences in chest imaging interpretation, respectively) who were blinded to clinical data and previous imaging results. All 3 readers are first required to record the presence or absence of COVID-19 and the number and location (lobe and segment) of the lesions if present.

Then, abnormal features of chest CT images were recorded including (1) ground-glass opacity (GGO) presented as an area of increased attenuation with no obscuration of bronchial and vessels [9]; (2) pulmonary consolidation; (3) crazy paving pattern; (4) diffused, central, or peripheral distribution of lesions defined based on one previous publication [17]; (5) thoracic lymphadenopathy with the short-axis diameter of lymph nodes ≥ 10 mm; as well as (6) other pulmonary illness such as emphysema or fibrosis. The number of abnormal lobes was also recorded. In addition, CT severity score was calculated according to chest CT images. The scoring of each lung lobe was identified as follows: 0 normal and 1 abnormal (any lesions detected regardless of their opacities and extent). Accordingly, the maximum score was recorded as a cumulative of 5 with all the 5 lobes involved. CT severity score in this study is expressed as (n)/5 × 100% (n = the number of involved lung lobes). CT severity was categorized into the following classes: (a) mild (≤ 20%), (b) moderate (20–50%), and (c) severe (> 50%).

The reference standard for the presence of COVID-19 and imaging features on chest CT was defined by two well-experienced senior radiologists (G.M.L. and Z.Y.S. with 37 and 18 years of experiences in chest radiology) who made the final decision in consensus combining the patients’ clinical, laboratory, and chest CT imaging data.

Deep learning algorithm development

An automatic AI pneumonia detection and evaluation system was used to extract CT features and quantitatively estimate the pulmonary involvement of abnormalities. This system is built based on deep neural networks, where three major steps are designed to ensure the final accuracy which will be available to detect the patients with COVID-19 pneumonia, including (1) abnormality detection, (2) voxel segmentation, and (3) pulmonary lobe segmentation. All the processes were performed by AI system automatically without any interaction of human.

Abnormality detection and segmentation

In this study, COVID-19 pneumonia-based lung lesions included consolidation, GGO, nodules and others such as fibrosis. A convolutional MVP-Net [18] is exploited to achieve automatic detection of the lesions. Domain knowledge is incorporated in clinical practice during the model design. Considering that radiologists tending to inspect multiple windows to obtain accurate diagnosis, we achieved this idea by using a multi-view feature pyramid network, where multi-view features were extracted from images rendered with varied window widths and window levels. To effectively combine this multi-view information, a channel-wise attention module is employed to capture complementary information across different views. The overall architecture of the network is shown in our previous published work [18]. A three-pathway architecture is built to extract the most prominent features from each representative view, followed by a classifier and regressor to classify and localize the potential abnormal regions in CT images. Afterwards, 3D U-Net [19] was introduced to classify voxels that represented the abnormality in the detected regions. Thus, we could acquire the extracted voxel-wise regions of abnormality. As a natural result based on the output of the abovementioned methods, a number of metrics, such as the volume and CT value of the lesions, could be calculated and output.

Pulmonary lobe segmentation

In order to provide the localization information of lesions in the lung, pulmonary lobe segmentation was necessary. To this end, a 3D U-Net is adopted as the basic segmentation network. Besides, a smooth margin loss is proposed to mine the most informative samples for training. To guarantee a desired result, two effective metrics which leverage anatomical priors were used to help select the best model during training [20].

All the CT data in this study has never been used before, and there is no overlapping among the patient identities among all datasets. After analyzing the CT images with this system, the presence or absence of COVID-19 pneumonia was recorded on a per-patient and per-lobe basis.

Deep learning algorithm training, validation, and testing

A total of 19,291 pulmonary CT scans from 14,435 individuals were used for the deep learning algorithm training and validating, among which 3854 scans were derived from 2154 COVID-19 patients, 6871 scans were collected from 5847 patients with patients diagnosed as other pneumonia (bacterial pneumonia, fungal pneumonia, and other viral pneumonia), and the rest 8566 scans were taken from 6434 healthy people. All the 96 CT scans were enrolled in validation set without overlap between training set and validation set (Fig. 1).

Fig. 1
figure 1

Flow diagram shows the overview of deep learning algorithm and participant selection

Comparison of deep learning algorithm and radiologists

The dataset of 96 COVID-19 patients with chest CT images was used for the comparison of diagnostic performance of three independent radiologists (resident 1, 6 years; resident 2, 5 years; resident 3, 2 years of experiences in chest imaging interpretation) and deep learning algorithm. Pneumonia lesions detected per-patient or per-lobe basis were used for the evaluation of diagnostic performance.

We also investigated the impact of deep learning algorithm on guiding the diagnosis of the three radiologists. To avoid the potential memorization bias, residents were requested to make a diagnosis with the assistance of AI system after 2 weeks of initial test. Abnormality detection, voxel segmentation, and pulmonary lobe segmentation were processed by AI system automatically. During the second round of reading the same CT images, AI system will present the labeled lesions in CT slices and provide its diagnosis of lesion detection of each lobe. Residents were requested to make final diagnosis with the assistance of AI system and compared the diagnostic performance with residents’ initial reports.

The reference standard was defined by two well-experienced senior radiologists (G.M.L. and Z.Y.S. with 37 and 18 years of experiences, respectively, in chest radiology) who made the final decision in consensus combining the patients’ clinical, laboratory, and chest CT imaging data.

Statistical analysis

We performed statistical analysis using commercially available statistical software SPSS (V23.0, IBM SPSS Inc.). Categorical variables were presented as numbers and percentages. Continuous data was presented as mean ± standard deviation (std) or median (interquartile range), as appropriate. On a per-patient and per-lobe basis, the accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and 95% confidence intervals of three resident radiologists’ evaluations and deep learning algorithm were assessed. Sensitivities and specificities of deep learning algorithm were compared with three residents by chi-square test with two experienced radiologists’ reading reports as reference standard. A p value cutoff of 0.017 was used based on Bonferroni correction for three comparisons. Reading time per graphic assessment unit between deep learning model and radiological residents were compared with unpaired t test. F1 score was calculated as harmonic mean of recall and precision. F1 scores and confusion matrix were calculated using scikit-learn 0.19 (scikit-learn.org). p value < 0.05 was regarded as the significant threshold, not corrected for multiple comparisons.

Results

CT image findings

As shown in Table 1, pneumonia was detected in 88 patients (91.7%) on chest CT images from all the 96 patients involved. Most of these 88 patients (74, 77.1%) diagnosed as COVID-19 pneumonia had multiple lesions in initial CT images. Sixty-six patients (68.8%) were found to have more than 2 lobes involve. For all the lesions identified by experienced radiologists, 75 patients had abnormalities in the right lung and 73 patients in the left lung. Seventy-five patients (78.1%) presented as bilateral lung involvement. All the 88 patients had GGOs (10, 10.4%), consolidation (3, 3.1%), or the integration of GGOs and consolidation (75, 78.1%). Crazy paving pattern was observed in 32 patients (33.3%), and interstitial abnormalities were found in 50 (52.1%) patients. Eighty-two patients (85.4%) had subpleurally distributed diseases. Typical CT features are listed in Table 1. As defined above, CT severity score ≤ 20% (mild) was seen in 30 patients, 20–50% (moderate) in 12 patients, and > 50% (severe) in 54 patients.

Table 1 Overview of CT imaging features in 96 patients

Comparison of performance between deep learning model and radiological residents

The performances of deep learning model and radiological residents in detecting abnormalities from chest CT images are listed on Table 2 based on per-patient and per-lung lobe analysis, respectively. For lesion detection-based per-patient lobe level, the algorithm had a sensitivity of 1.00 (95% CI 0.96, 1.00) in the identification of patients with abnormal CT images. The reading reports from three residents showed sensitivities of 0.94 (95% CI 0.87, 0.98), 0.93 (95% CI 0.86, 0.97), and 0.89 (95% CI 0.80, 0.94) ), respectively. The specificity of algorithm was 0.25 (95% CI 0.03, 0.65), while the specificities of residents were 1.00 (95% CI 0.63, 1.00), 0.75 (95% CI 0.35, 0.97), and 1.00 (95% CI 0.63, 1.00). F1 score of the algorithm was 0.97, which was higher than those of the resident 2 and resident 3 (0.95 and 0.94, respectively) and slightly lower than that of resident 1 (0.97). Accordingly, the sensitivity of algorithm was superior to residents in detecting abnormal CT images. Considering the trade-off effect, the specificity of algorithm is inevitably lower, while F1 score of the algorithm is comparable with that of 3 residents. For lesion detection-based per-lung lobe level, accuracy, sensitivity, and specificity of the algorithm were 0.82 (95% CI 0.79, 0.86), 0.96 (95% CI 0.94, 0.98), and 0.63 (95% CI 0.55, 0.69). The accuracy and sensitivity of the algorithm are superior or similar to those of residents, and the specificity of algorithm is slightly inferior to residents. F1 score of the algorithm was 0.86, which was slightly lower than residents (0.89, 0.89, and 0.89, respectively). Overall, the sensitivity of the algorithm is significantly higher than residents, but the specificity is inferior to residents (all p values < 0.017). Figure 2 shows the representative CT images of confirmed COVID-19 patients and the corresponding outputs of the deep learning algorithm.

Table 2 Performance of deep learning model versus radiology residents
Fig. 2
figure 2

Representative cases. Panels a and b. Chest CT images of a 53-year-old female diagnosed with COVID-19 pneumonia. a Axial unenhanced chest CT image and (b) corresponding output with deep learning algorithm show the multifocal subpleurally distributed GGOs with consolidation in the upper lobe of left lung (arrow). Panels c and d. Chest CT images of a 56-year-old female diagnosed with COVID-19 pneumonia. c Axial unenhanced chest CT image and (d) corresponding output with deep learning algorithm show the multifocal subpleurally distributed GGOs with crazy paving sign in both lungs (arrows)

The comparisons of diagnostic sensitivity between the algorithm and residents based on lung lobe are shown in Table 3. In terms of F1 score, the algorithm was slightly inferior to residents in detection of lesions located on the right upper and right lower lung lobes (F1 score 0.87 vs. 0.92, 0.94, and 0.93, 0.84 vs. 0.88, 0.88, and 0.87). While for right middle, left upper, and left lower lobes, the algorithm was similar to residents with F1 scores of 0.84, 0.87, and 0.90, respectively. The utilization of algorithm enabled high sensitivity of 0.96 (95% CI 0.87, 1.00), 0.94 (95% CI 0.83, 0.99), 0.98 (95% CI 0.91, 1.00), 0.96 (95% CI 0.87, 1.00), and 0.97 (95% CI 0.89, 1.00) in all the five lung lobes, respectively. The sensitivity of the algorithm is slightly higher than residents, demonstrating the distinct advantage of the algorithm in detecting abnormalities from CT images of patients confirmed of COVID-19.

Table 3 Performance of deep learning model versus radiology residents based on anatomical structure

Performance of deep learning model in CT severity scoring

As defined in methods, the accuracy for grading CT severity on a scale was 0.66 (95% CI 0.55, 0.75) for the algorithm, and 0.79 (95% CI 0.70, 0.87), 0.77 (95% CI 0.67, 0.85), and 0.84 (95% CI 0.76, 0.91) for residents. The confusion matrix in Fig. 3 demonstrated the grading discrepancies of the algorithm and residents. The algorithm showed superiority in grading severe CT images, and the algorithm was similar to residents in grading moderate CT images. For mild CT images, the algorithm was inferior to the residents.

Fig. 3
figure 3

Confusion matrix comparing CT severity grading performance between deep learning model and radiological residents

Volume information extracted by deep learning model

The algorithm in this study specifically extracted the detailed volume and density of each abnormality, distance of lesion from pleura from chest CT images. The median volume of all the detected lesions on per-patient basis is 40.10 cm3 (interquartile range 7.67, 116.16). The median volume of single lesion is 0.64 cm3 (interquartile range 0.11, 3.06). The median CT value of the lesion is − 555 HU with the interquartile range of − 6980 HU and − 401 HU. The median distance of lesion from pleura is 2.90 mm with the interquartile range of 0.93 and 10.83.

As shown in Table 4, the algorithm exhibited a much faster diagnosis speed at a mean rate of 20.3 s ± 5.8 per case, while the residents executed the task with reading speed of 101.1 s ± 53.3, 68.3 s ± 18.5, and 112.4 s ± 44.7, respectively (all p values < 0.017).

Table 4 Running time comparison (unit in second)

Assistance of deep learning algorithm generated results

Figure 4 displays performance of residents with the aiding of AI system. The algorithm had AUCs of 0.86 (0.74, 0.98) and 0.87 (0.75, 0.98) in the identification of lesions on per-patient and per-lobe basis, which were slightly inferior to the residents (triangle markers). However, the assistance of AI system improved the diagnostic performance of three residents (circle markers). As shown in Table 5, the sensitivity was slightly improved with the assistance of AI system (0.94 vs. 0.98, 0.93 vs. 0.97, 0.89 vs. 0.97) without sacrifice of specificity on per-patient basis. For per-lobe basis, the diagnostic performance of three residents with the combination of AI system was also superior to their initial performance. Notably, the AI system can assist radiologists make quicker diagnosis with much faster diagnosis speeds (101.1 vs. 44.9 s, 68.3 vs. 39.2 s, 112.4 vs 48.8 s, all p values < 0.0001) (Table S1).

Fig. 4
figure 4

Receiver operating characteristic (ROC) diagram for AI system versus radiologists. The blue curve was created by taking different thresholds over the predicted probability, showing the macro-average AUC of AI system. The asterisk showed the performance of model in a balanced setting. The filled markers showed residents’ performance. Dashed line connected performance of radiologists with and without the assistance of AI system

Table 5 Performance of residents with assistance of deep learning model

Discussion

In our study, we utilized and validated a deep learning approach for precise chest CT image feature identification and quantitative assessment in 96 consecutive patients diagnosed with COVID-19. In the survey of chest CT images, the algorithm specifically analyzed the volume of abnormalities and distance between lesion and pleura. Also, the algorithm presented a much faster rate in CT image reading than residents. In the detection of infected patients with COVID-19 pneumonia, the algorithm showed robust performance with sensitivity of 1.00 (0.96, 1.00), which is significantly higher than residents. Based on per-patient or per-lung lobe level, it was demonstrated that algorithm was comparable with that of radiologists, with F1 scores of 0.97 vs. 0.97, 0.95, and 0.94, and 0.86 vs. 0.89, 0.89, and 0.89. This study highlights the usefulness of this deep learning model in actual clinical practice.

Utilization of chest CT scanning for suspected patients at admission has been recommended by Chinese health professionals for prompt diagnosis [21]. AI technology powers many aspects in medical research, especially the image processing [22, 23]. In the past years, there have been several deep learning–based automatic algorithms for detection of abnormalities in chest radiography and CT images, including lung cancer screening, malignant pulmonary nodule detection, and pulmonary tuberculosis classification [24,25,26,27]. These researches demonstrated the property of deep learning model in facilitating the screening and evaluation of pulmonary diseases. In this study, we applied a deep learning model which is comparable with radiologists in detecting abnormities on CT images from patients confirmed of COVID-19. The automatic detection and analysis make the diagnosis of COVID-19 pneumonia much faster than traditional reading process and reduces the burden of clinicals in repeated exposure in the new coronavirus. To some extent, the application of deep learning algorithm in medical imaging accelerates the diagnosis and reduces the human-to-human transmission in hospital.

Noteworthy, radiologists across the world have provided new insights by accessing the lung CT as additional diagnosis or screening tool of COVID-19 pneumonia. Basically, bilateral GGOs, consolidative pulmonary opacities, as well as the prominent subpleural distribution are regarded as classical features in chest CT images of patients diagnosed with COVID-19 pneumonia, which are similar to those reported with SARS-CoV and MERS-CoV [9,10,11,12,13,14,15,16,17,18,19]. In parallel with these findings, our study also demonstrated higher incidence of GGOs and consolidative opacities in the CT images from COVID pneumonia patients. Specially, as shown by Bernheim in a relatively larger retrospective study, lung abnormalities of COVID-19 pneumonia detected by CT was related with virus time course, and mostly, the lesion features progressed from GGO to crazy paving pattern [28]. The “Diagnosis and Treatment Program of 2019 New Coronavirus Pneumonia” (trial sixth version) released by Chinese Health Commission highlighted that the change of lesion volume larger than 50% in 24 to 48 h was suggested as severe disease in management [21]. The deep learning model we used here can automatically calculate the volume of lesion and precisely locate the lesions which may be of great importance in monitoring, evaluating disease severity, and guiding the treatment by collecting and analyzing data from baseline and follow-up CT images.

Another advantage of our study is that we evaluated the performance of deep learning models in abnormality detection from chest CT images of COVID-19 pneumonia patients. It is confirmed that the algorithm we used was non-inferior to experienced radiologists in lesion detection and identification. Currently, there is a study by Xu which retrospectively analyzed the performance of inception migration-learning model in distinguishing COVID-19 with other pathogen infection [15]. In the external test, their algorithm model showed a total accuracy of 73% with sensitivity of 74% and specificity of 67%. Unlike it, our algorithm was specifically developed for detailed structure information extraction and precise lesion detection. For all the 96 patients with chest CT images involved, this algorithm exhibited high sensitivity in pneumonia diagnosis both the per-patient and per-lung lobe basis. High sensitivity of algorithm would be especially important in prompt screening of COVID-19-infected patients. When compared with radiology residents’ report, we found the specificity of algorithm is inferior to clinicians, which is attributed to metallic or respiratory marked artifact (n = 3) and fibrosis (n = 3) easily recognized by human experts. Objectively, the deep learning model we utilized here improved the sensitivity with the sacrifice of specificity in lesion detection. Despite the trade-off between sensitivity and specificity, considering the global outbreak and fast spread, prompt diagnosis and quarantine should be the most imperative action; sensitivity, instead of specificity, should play a more important role in identifying patients infected with the new coronavirus. We believe the application of deep learning model will accelerate the speeds of patient screening and effectively stop the human-to-human transmission.

Due to the development of computer science, AI techniques have been widely applied in biological and medical researches in recent years. So far, there have been some successful cases based on AI which have made great contributions to epidemic alert and infected patient screening [13]. Li et al recently reported a COVID-19 detection neural network (COVNet) which successfully distinguished COVID-19 pneumonia from community-acquired pneumonia [29]. To the best of our knowledge, our study first applied deep learning model to comprehensively analyze lesion features from chest CT images of COVID-19 patients. Notably, the involvement of AI markedly accelerates the reading process without the sacrifice of sensitivity. And the assistance of AI system improves the diagnostic performance of radiologists. We believe the application of AI system will effectively accelerate the diagnosis of pneumonia and provide the precise location of pneumonia lesions. COVID-19 will not be the last epidemic to challenge public health experts. The growth of AI-driven techniques to identify epidemiologic risks early will be key to our improvement of prediction, prevention, and detection of future global health risks.

There are several limitations of this study. First, since this is a retrospective study, the performance of deep learning model on an actual clinical situation is not validated. Real-time application of this model in clinical practice is needed. Second, we used experienced radiologists’ reading reports as reference standard. Although it is a routine practice, there might still be some variabilities. Third, we involved a total of 96 patients from three hospitals across China, whereas 87.5% are from one single institution, so the reproducibility of the performance of our algorithm remains unclear. Fourth, because of the small sample size and outbreak of epidemic, our study suffered the imbalanced database problem. Appropriate statistical evaluation was not applied because commonly used probabilistic metric or ranking metric is not applicable in this deep learning algorithm. Also, the testing results from a small dataset might not generalize well to all the unseen cases, we expect larger database from multi-centers across the world to test our deep learning model in COVID-19 pneumonia detection. Finally, this deep learning model showed worse specificity than radiologists in lesion detection, which will lead to more false positive cases. However, these results are easily recognized by human experts.

In conclusion, we utilized a deep learning model in specific feature extraction and quantitative lesion detection from chest CT images of patients diagnosed with COVID-19 pneumonia. The precise lesion identification such as volume may provide valuable information for clinical classification and treatment selection. Moreover, the algorithm we used in this study presented superior diagnostic performance in quantitatively detecting abnormalities on per-patient and per-lung lobe basis compared with radiologists, making rapid referral suggestions that deep learning algorithm should be a standard care in real-time application.