Introduction

The coronavirus disease (COVID-19) pandemic has caused more than 10.1 million infections and 503,000 deaths worldwide as of June 30, 2020 [1]. The virus nucleic acid real-time reverse transcriptase chain reaction (RT-PCR) test is the current recommended method for COVID-19 diagnosis [2, 3]. However, with the rapid increase in the number of infections, RT-PCR tests may be fallible depending viral load or sampling techniques and may vary in its availability across global regions. Multiple studies have shown utility for chest CT for diagnosis of COVID-19 [4,5,6], including reports of diagnostic accuracy of chest CT > 80% using deep learning (DL) approaches [7, 8].

While these studies often report classification performance, i.e., positive or negative for COVID-19 pneumonia [8, 9], investigations on specific high-dimensional features unique to COVID-19 pneumonia compared to other similar-appearing lung diseases remain relatively unexplored. Furthermore, there is paucity of research that directly compares predictive performance of feature engineering (e.g., radiomics), deep learning, and other machine learning approaches [10]. While it is generally accepted that deep learning performs robustly with large datasets for image discrimination, its performance compared to radiomics is not closely examined, a relevant topic in medicine where sample sizes are much smaller than non-medical image datasets, such as ImageNet. The types of machine learning approaches that most optimally augment clinician performance also remain unknown.

A recently described neural network using large-scale bi-directional adversarial network (BigBiGAN) has shown potential for an end-to-end COVID-19 pneumonia diagnosis [11]. In this network, CT chest images are taken as the input and high-dimensional semantic features representing specific characteristics of each image are produced based on the modules of image encoding, image generation, and image discrimination. A recent study has shown a potential utility of this method for distinguishing COVID-19 pneumonia from other viral pneumonias [12].

In this study, we aim to uncover image features of COVID-19 lung disease and further compare radiomics versus deep learning model performance using chest CT. Specifically, we target COVID-19 and non-COVID-19 viral pneumonia of patients who presented with similar clinical symptoms and CT chest findings and compare the performance of deep learning features, radiomics features, and combined approaches for COVID-19 pneumonia diagnosis.

Material and methods

Study cohort

The inclusion criteria of this retrospective study were patients with symptoms suspicious for COVID-19 and diagnosed with COVID-19 or non-COVID-19 viral pneumonia during the COVID-19 outbreak; patients obtained CT chest with or without contrast at time of diagnosis; patients obtained RT-PCR tests (based on samples of bronchoalveolar lavages, endotracheal aspirates, nasopharyngeal swabs, or oropharyngeal swabs) to determine COVID-19 status. Only those patients who tested positive or negative on at least two RT-PCR tests were included. Patients who were confirmed to have PCR-confirmed COVID-19 pneumonia with underlying lung diseases (e.g., lung cancer) were included. Lesion segmentation was performed only on lung lesions suspicious for pneumonia, excluding known sites of lung cancer or other chronic lung lesions. For non-COVID-19 pneumonia patients, we included patients clinically suspected to have viral source of infection. Tuberculosis, fungal, or bacterial pneumonia patients were excluded to examine specific image features of non-COVID-19 viral pneumonia. This retrospective study was approved by the institutional review board of University of Science and Technology of China (IRB no.2020-P-038) and Stanford University (IRB no.51059), with waiver of informed consent or assent.

Chest CT technique and image annotations

Chest CT was performed at slice thickness range 1.25–5 mm (NeuViz 64 or 128, Neusoft, Shenyang, China) with or without contrast. CT parameters of the external dataset were 1–3 mm slice thickness with or without contrast (LightSpeed VCT and Revolution, GE Healthcare, Milwaukee, WI; Aquilion, Toshiba Medical Systems, Otawara, Japan; SOMATOM, Siemens, Erlangen, Germany).

Four blinded attending radiologists in China (> 5 years’ experience) independently segmented the boundary of all lung lesions slice-by-slice using ITK-Snap software (v.3.6.0). Detailed segmentation procedure is presented in Supplementary Figure S1. All segmentations underwent quality control for proper annotation by an expert chest radiologist (> 10 years’ experience). Images were not segmented if the radiologists did not detect lung lesions.

Feature extraction and model development

Open access Google Colab platform [13] and PyRadiomics [14] were used for DL and radiomics feature extraction, respectively. CT data from two hospitals from China were randomly divided into a training, validation, and test datasets (80:10:10), and the data was processed on servers in China. Dataset from a US hospital served as an external evaluation and processed on servers in the USA.

High-dimensional deep learning features

We used a BigBiGAN-based architecture to train and extract high-dimensional deep learning features of COVID-19 versus non-COVID-19 pneumonia lesions. Two different data inputs were used: (1) original CT with segmentation masks, where the pixels of the pneumonia lesions were retained as the original CT intensity and the pixels outside were set to zero (Supplementary Figure S2); (2) CT images of the whole lung without segmentation mask. The batch size and epoch of the BigBiGAN training were set as 20 and 200, respectively. The 120-dimensional deep learning features were extracted by the encoder module of BigBiGAN when the loss was minimum in the last training epoch, which then served as input for the subsequent classifier models.

Radiomics features

We used PyRadiomics [14], an open source package recommended for standardized radiomics analysis workflow [15], to extract 120-deminsional radiomics features of the segmented lung lesions. We extracted the following radiomics features: 19 first-order statistics features, 16 shape-based 3D features, 10 shape-based 2D features, 24 gray-level co-occurrence features, 16 gray-level run length features, 16 gray-level size zone features, five neighboring gray tone difference features, and 14 gray-level dependence features. Details of the feature extraction are presented on the webpage of PyRadiomics [16].

Classifier models

To determine performance of DL and radiomics-extracted features, we used two widely used classifiers: a linear classifier typically used in supervised learning, and least absolute shrinkage and selection operator (Lasso) often used in radiomics [17,18,19]. In addition, we combined both the DL and radiomics features as a single input to determine performance for each of the two models. Model performance was evaluated on hold-out test set, as well as external validation set. The overall study design is illustrated in Fig. 1.

Fig. 1
figure 1

Radiomics and artificial intelligence neural network workflow in this study

AI augmentation for clinical diagnosis

Three blinded radiologists from an independent hospital in China reviewed the CT images from the test dataset and external validation dataset and performed the first round of diagnosis for COVID-19 versus non-COVID-19 pneumonia. The radiologists were blinded to clinical diagnosis and RT-PCR test results. After a 2-week wash out period, the reviewers were provided model predictions of combined deep learning and radiomics features for a second round of review. Clinical performance with and without model was calculated.

Statistics analysis

The receiver operating characteristic (ROC) curve and area under curve (AUC), sensitivity, and specificity were used to evaluate the diagnostic accuracy for COVID-19 pneumonia. All statistical computing was performed using R language (version 3.4.3, Vienna, Austria). Based on the image features extracted by PyRadiomics and BigBiGAN, the linear classifier and Lasso were implemented by the “lm()” and “glmnet(),” respectively, for the significant feature selection. Chi-square and ANOVA tests were used to evaluate the differences in demographics. To determine interobserver variability of radiomics features regarding manual segmentation, a new radiologist performed manual segmentation of lung lesions in 10 random patients. Mann-Whitney U test was performed on the radiomics features extracted from the two set of manual segmentations. P < 0.05 was considered a significant difference.

Results

Study cohort

A total of 266 patients were initially collected in this study. Ninety-three consecutive COVID-19-positive and 91 COVID-19-negative pneumonia patients from The First Affiliated Hospital of University of Science and Technology of China and The Lu’an Affiliated Hospital of Anhui Medical University in China (January 18–February 29, 2020) met the inclusion criteria. Seventeen patients with COVID-19 pneumonia and 15 patients with COVID-19-negative viral pneumonia from Stanford University Hospital (February 1–May 30, 2020) served as external validation. The flowchart of patient enrollment is shown in Fig. 2.

Fig. 2
figure 2

Patient enrollment in our study. Asterisk denotes the exposure history defined in our study (for patients from China): history of travel to Wuhan in the last 14 days, history of contact with confirmed COVID-19 patient(s), and history of being in a dense crowd. The relevant exposure history was selected as an inclusion criterion since these patients were high-risk of COVID-19 infection during this period

The mean patient age was 45 years (standard deviation 15.6) with no significant difference between male (n = 128) and female (n = 88) (P > 0.05). The median time interval from symptom onset to CT was 8 days for both COVID-19-positive and -negative patients. The chief complaints of the patients were cough and fever, comprising 95.4%. Detailed demographics are shown in Table 1.

Table 1 Demographics of the patients enrolled from the three hospitals in this study

Chest CT dataset

CT scan details of the study population are shown in Supplementary Material Appendix A. Within the discovery cohort, ten COVID-19-positive and 12 COVID-19-negative patients who did not have visible lung lesions and were excluded from segmentation. Two patients’ lesion segmentations were controversial in radiologists and were excluded. This resulted in a total of 12,001 CT slices (7173 of COVID-19 positive and 4828 of COVID-19 negative) and 24,216 pneumonia lesion segmentations. All slices were randomly divided into training (9, 573), validation (1209), and test (1219 images). The external validation comprised 3799 images (2349 COVID-19 positive and 1450 COVID-19 negative) containing 7883 lesion segmentations of 17 COVID-positive and 15 COVID-negative pneumonia patients. No significant difference was found in the radiomics features extracted from the two sets of manually segmented pneumonia images of the 10 random patients by Mann-Whitney U test (P > 0.05).

Image features and model performance

The AUCs (sensitivity and specificity) of linear and Lasso classifiers using 120-dimensional DL features, 120-dimensional radiomics features, and combined 240-dimensional DL and radiomics features are shown in Fig. 3. Individual DL performances using the whole lung images as inputs are also shown in Supplementary Figure S3. Although the AUCs (sensitivity and specificity) using the whole lung were higher than the pneumonia lesion on the training dataset (linear classifier 0.98 [91.8%, 93.4%] vs. 0.91 [80.0%, 87.2%] and Lasso classifier: 0.97 [93.0%, 92.1%] vs. 0.91 [80.8%, 86.3%]), its performance on the external validation dataset was slightly inferior (linear classifier 0.84 [75.7%, 76.8%] vs. 0.86 [76.5%, 80.9%], and Lasso classifier 0.83 [71.2%, 81.0%] vs. 0.87 [73.5%, 81.8%]).

Fig. 3
figure 3

The pneumonia lesions on the CT image were used as the input of the BigBiGAN and PyRadiomics. Receiver operating characteristic curves (ROC) and area under curve (AUC) of the linear classifier and Lasso classifier for the differentiation of COVID-19 from other forms of viral pneumonia with clinical symptoms and CT signs similar to those of COVID-19. The four ROC curves in each chart represent the training (red), validation (green), test (blue), and external validation datasets (yellow), respectively

Based on the two classifiers, 32 high-dimensional deep learning features were filtered as the significant features (P < 0.0001) for the classifying COVID-19 pneumonia, and four radiomics features (P < 0.0001), specifically, mean intensity values of the lesion and texture features (RMS: feature of “original_firstorder_RootMeanSquared; Uniformity: feature of “original_firstorder_Uniformity; NGTDMB: feature of “original_ngtdm_Busyness) were associated with COVID-19 differentiation. The loss curve of the BigBiGAN training is shown in Supplementary Figure S4. Details of the significant features are shown in Supplementary Material Appendix B and C. When using the combined 240-dimensional DL and radiomics features, the following features were selected by both classifiers (P < 0.0001): five radiomics features (diagnostics_Image-original_Mean, original_shape_Maximum2DDiameterSlice, original_firstorder_Skewness, original_firstorder_Uniformity, and original_ngtdm_Busyness) and 6 deep learning features (18th, 24th, 35th, 50th, 65th, and 79th feature). The distribution of the values of these features in COVID-19-positive and -negative images is shown in Supplementary Material Appendix D. The signatures constructed by the linear classifier and Lasso classifier based on the combined features are shown in Supplementary Figure S5; CT images of COVID-19 and non-COVID-19 with significant image signature values are shown in Fig. 4.

Fig. 4
figure 4

The CT images of COVID-19 positive (a) and COVID-19 negative (b) with significant different signature values based on the combined feature matrix. Figure a(1) represents a 35-year-old male and CT manifested as bilateral opacities, and linear signature score of 1.32 and Lasso signature score of 0.99; figure a(2) denotes a 43-year-old female and CT manifestation are bilateral ground-glass opacities, vascular thickening, and interlobular septal thickening, with signature scores of 1.23 and 0.99; figure a(3) denotes a 62-year-old male and CT manifestation is bilateral multifocal consolidations. Signature scores are 1.24 and 0.98; figure a(4) represents a 45-year-old female and CT manifested as bilateral peripheral multifocal lesions with signature scores of 1.25 and 0.97; figure b(1) represents a 29-year-old male and CT manifestation is multifocal ground-glass opacities in the left lung. Signature scores are − 0.14 and 0.04; figure b(2) represents a 30-year-old female and CT manifestation is multifocal, mixed ground-glass opacity and consolidation in the right lung. Signature scores are − 0.11 and 0.08; figure b(3) represents a 30-year-old male and CT manifestation is bilateral multifocal consolidation. Signature scores are 0.07 and 0.70; figure b(4) represents a 29-year-old male and CT manifested as mixed densities in the right lung. Signature scores are − 0.17 and 0.03, respectively

Clinical use

Diagnostic sensitivity and specificity of 3 radiologists on two test datasets were 74.7% and 80.3%, respectively. With the prediction outputs from combined deep learning and radiomics features, their performance increased with sensitivity and specificity of 91.2% and 91.9%, respectively, as shown in Fig. 5.

Fig. 5
figure 5

Sensitivity and specificity of the radiologists’ diagnosis on the test datasets without (first round of diagnosis) and with (second round of diagnosis) the assistance of our AI semantic features plus radiomics features

Discussion

In this study, we uncover some of the deep learning and radiomics features that contribute to differentiation of COVID-19 from non-COVID-19 viral pneumonia. Features extracted from both deep learning and radiomics showed similar performance with linear and Lasso classifiers, with sensitivity > 73% and specificity > 75% on the external cohort. DL features extracted from pneumonia lesion performed superior to the whole lung on the external validation dataset. Prediction outputs generated from our combined deep learning-radiomics model further augmented human expert performance.

To our knowledge, this is the first study to compare performance of DL versus radiomics models for differentiation of COVID-19 pneumonia. Various studies have described performance of DL models for COVID-19 pneumonia [20, 21]. However, specific image features relevant to COVID-19 classification remain opaque. While it is well-known that with large datasets, DL models perform superior to hand-crafted feature extraction [22, 23], large data are not always possible in medicine and may be limited by disease prevalence, obstacles to data procurement, and other clinical factors. For smaller data, studies have suggested feature engineering may be a more suitable machine learning strategy with notable advantages of radiomics for medical imaging analysis [15, 24, 25]. At present, studies that directly compare radiomics and deep learning clinical model performance are relatively unexplored [26]. In this study, we address these questions and further aim to enhance interpretability of such machine learning models.

Recent studies have shown image features learned from a BigBiGAN framework can achieve state-of-the-art performance for image classification [11, 12, 27]. Unlike traditional generative adversarial network typically used for image synthesis, de-noising, or generation of high-quality images, BigBiGAN has shown robust performance for learning high-dimensional semantic features. We identified 32 deep learning features that differed significantly between COVID-19-positive and -negative lesion images (P < 0.0001). Using PyRadiomics analysis, four radiomics features were selected by the two classifiers (P < 0.0001) to differentiate COVID-19 from other types of viral pneumonia. When we combined both approaches, 6 deep learning features and 5 radiomics features were selected (P < 0.0001) by the two classifiers. These results might suggest more distinguishing features were learned on neural network. Although ROC might suggest slightly more robust performance for radiomics models, sensitivity and specificity did not differ among deep learning, radiomics, and combined features.

Among the four significant radiomics features, we found mean intensity of COVID-19 (− 665.9) lesions to be higher than non-COVID-19 lesions (− 887.0) which might reflect more diffuse opacities or greater degree of fluid or debris affecting the airspaces. Based on NGTDMB, which measures the rate of change in intensity between pixels and its neighborhoods, we also found less intense change between adjacent pixels for COVID-19 (0.39) compared to non-COVID-19 (0.72), which might also indicate that within an affected lung region, there is more diffuse airspace process, sparing fewer of the neighboring alveoli in COVID-19 compared to other types of viral pneumonia. This might also explain why RMS, which measures the magnitude of the image values by calculating contributions of each gray value (absolute), was higher for non-COVID-19 (660.8 vs. 563.1), where greater number of spared alveoli, i.e., air-filled, rather than alveoli fluid-filled by disease, could contribute to a wider magnitude of absolute gray values in the pixels. Uniformity was lower for COVID-19 (0.03) compared to non-COVID-19 (0.06) lesions, with a larger range in irregular texture for COVID-19 and suggested a more heterogeneous lung texture possibly due to diversity in airspace disease phenotypes (consolidation, ground-glass opacities, etc.) that combine varying degrees of edema and vascular and interlobular septal thickening, sometimes described as “crazy paving” on visual inspection [28].

Although it is difficult to directly map image phenotype from DL features alone, signatures from combined DL and radiomics features provide some clues to image-based discrimination for COVID-19 versus non-COVID-19 lung disease. COVID-19 patients showed higher signature scores for irregular intensity changes, heterogeneous intensities, and wider range in textures within the lung lesions compared with non-COVID-19 patients. For example, despite a large area of mixed opacity that might raise suspicion for COVID-19 on visual inspection (Fig. 4b(2)), feature extraction revealed intensity changes that were relatively regular within the lesion, a pattern that was associated with non-COVID-19. In another example, although multiple, nodular opacities in peripheral consolidative pattern on CT might raise suspicion for COVID-19 (Fig. 4b(3)), strong uniformity of CT intensity values within the lesions suggested a non-COVID-19 process. This was supported by the linear signature score (0.07) that was consistent with non-COVID-19.

When predictions from a combined deep learning-radiomics model were available, we observed improved radiologist diagnostic performance with increase in both sensitivity and specificity by 16.5% and 11.6%, respectively, suggesting a potential role for machine learning for augmenting clinician decision support.

There are several limitations to this study. While we used the high-dimensional, semantic features from the encoder module of the BigBiGAN framework, we did not examine other features that can be produced in the framework or by a different architecture. Since there is no specific definition of these deep learning semantic features, in the future, we will explore image encoding processes used to generate each of the deep learning semantic features to further enhance interpretability of these features. While deep learning and radiomics approaches performed comparably with a training cohort of around 180 patients with 9500 images, larger dataset could show more robust performance for deep learning. Finally, while we used features when the loss was minimum in the last training epoch, “deep learning features” might vary over training parameters or change with other experimental settings.

In conclusion, we uncover specific deep learning and radiomics features relevant to COVID-19 pneumonia to assist interpretability of machine learning algorithms and contribute to understanding of COVID-19 pneumonia imaging phenotypes. Furthermore, we compare performance of deep learning and radiomics models for COVID-19 pneumonia diagnosis using chest CT and show potential for augmenting radiologist diagnostic performance with the aid of machine learning predictions.