Decoding COVID-19 pneumonia: comparison of deep learning and radiomics CT image signatures

Purpose High-dimensional image features that underlie COVID-19 pneumonia remain opaque. We aim to compare feature engineering and deep learning methods to gain insights into the image features that drive CT-based for COVID-19 pneumonia prediction, and uncover CT image features significant for COVID-19 pneumonia from deep learning and radiomics framework. Methods A total of 266 patients with COVID-19 and other viral pneumonia with clinical symptoms and CT signs similar to that of COVID-19 during the outbreak were retrospectively collected from three hospitals in China and the USA. All the pneumonia lesions on CT images were manually delineated by four radiologists. One hundred eighty-four patients (n = 93 COVID-19 positive; n = 91 COVID-19 negative; 24,216 pneumonia lesions from 12,001 CT image slices) from two hospitals from China served as discovery cohort for model development. Thirty-two patients (17 COVID-19 positive, 15 COVID-19 negative; 7883 pneumonia lesions from 3799 CT image slices) from a US hospital served as external validation cohort. A bi-directional adversarial network-based framework and PyRadiomics package were used to extract deep learning and radiomics features, respectively. Linear and Lasso classifiers were used to develop models predictive of COVID-19 versus non-COVID-19 viral pneumonia. Results 120-dimensional deep learning image features and 120-dimensional radiomics features were extracted. Linear and Lasso classifiers identified 32 high-dimensional deep learning image features and 4 radiomics features associated with COVID-19 pneumonia diagnosis (P < 0.0001). Both models achieved sensitivity > 73% and specificity > 75% on external validation cohort with slight superior performance for radiomics Lasso classifier. Human expert diagnostic performance improved (increase by 16.5% and 11.6% in sensitivity and specificity, respectively) when using a combined deep learning-radiomics model. Conclusions We uncover specific deep learning and radiomics features to add insight into interpretability of machine learning algorithms and compare deep learning and radiomics models for COVID-19 pneumonia that might serve to augment human diagnostic performance. Electronic supplementary material The online version of this article (10.1007/s00259-020-05075-4) contains supplementary material, which is available to authorized users.


Introduction
The coronavirus disease (COVID-19) pandemic has caused more than 10.1 million infections and 503,000 deaths worldwide as of June 30, 2020 [1]. The virus nucleic acid real-time reverse transcriptase chain reaction (RT-PCR) test is the current recommended method for COVID-19 diagnosis [2,3]. However, with the rapid increase in the number of infections, RT-PCR tests may be fallible depending viral load or sampling techniques and may vary in its availability across global regions. Multiple studies have shown utility for chest CT for diagnosis of COVID-19 [4][5][6], including reports of diagnostic accuracy of chest CT > 80% using deep learning (DL) approaches [7,8].
While these studies often report classification performance, i.e., positive or negative for COVID-19 pneumonia [8,9], investigations on specific high-dimensional features unique to COVID-19 pneumonia compared to other similar-appearing lung diseases remain relatively unexplored. Furthermore, there is paucity of research that directly compares predictive performance of feature engineering (e.g., radiomics), deep learning, and other machine learning approaches [10]. While it is generally accepted that deep learning performs robustly with large datasets for image discrimination, its performance compared to radiomics is not closely examined, a relevant topic in medicine where sample sizes are much smaller than non-medical image datasets, such as ImageNet. The types of machine learning approaches that most optimally augment clinician performance also remain unknown.
A recently described neural network using large-scale bi-directional adversarial network (BigBiGAN) has shown potential for an end-to-end COVID-19 pneumonia diagnosis [11]. In this network, CT chest images are taken as the input and highdimensional semantic features representing specific characteristics of each image are produced based on the modules of image encoding, image generation, and image discrimination. A recent study has shown a potential utility of this method for distinguishing COVID-19 pneumonia from other viral pneumonias [12].
In this study, we aim to uncover image features of COVID-19 lung disease and further compare radiomics versus deep learning model performance using chest CT. Specifically, we target COVID-19 and non-COVID-19 viral pneumonia of patients who presented with similar clinical symptoms and CT chest findings and compare the performance of deep learning features, radiomics features, and combined approaches for COVID-19 pneumonia diagnosis.

Study cohort
The inclusion criteria of this retrospective study were patients with symptoms suspicious for COVID-19 and diagnosed with COVID-19 or non-COVID-19 viral pneumonia during the COVID-19 outbreak; patients obtained CT chest with or without contrast at time of diagnosis; patients obtained RT-PCR tests (based on samples of bronchoalveolar lavages, endotracheal aspirates, nasopharyngeal swabs, or oropharyngeal swabs) to determine COVID-19 status. Only those patients who tested positive or negative on at least two RT-PCR tests were included. Patients who were confirmed to have PCR-confirmed COVID-19 pneumonia with underlying lung diseases (e.g., lung cancer) were included. Lesion segmentation was performed only on lung lesions suspicious for pneumonia, excluding known sites of lung cancer or other chronic lung lesions. For non-COVID-19 pneumonia patients, we included patients clinically suspected to have viral source of infection. Tuberculosis, fungal, or bacterial pneumonia patients were excluded to examine specific image features of non-COVID-19 viral pneumonia. This retrospective study was approved by the institutional review board of University of Science and Technology of China (IRB no.2020-P-038) and Stanford University (IRB no.51059), with waiver of informed consent or assent.

Chest CT technique and image annotations
Chest CT was performed at slice thickness range 1.25-5 mm (NeuViz 64 or 128, Neusoft, Shenyang, China) with or without contrast. CT parameters of the external dataset were 1-3 mm slice thickness with or without contrast (LightSpeed VCT and Revolution, GE Healthcare, Milwaukee, WI; Aquilion, Toshiba Medical Systems, Otawara, Japan; SOMATOM, Siemens, Erlangen, Germany).
Four blinded attending radiologists in China (> 5 years' experience) independently segmented the boundary of all lung lesions slice-by-slice using ITK-Snap software (v.3.6.0). Detailed segmentation procedure is presented in Supplementary Figure S1. All segmentations underwent quality control for proper annotation by an expert chest radiologist (> 10 years' experience). Images were not segmented if the radiologists did not detect lung lesions.

Feature extraction and model development
Open access Google Colab platform [13] and PyRadiomics [14] were used for DL and radiomics feature extraction, respectively. CT data from two hospitals from China were randomly divided into a training, validation, and test datasets (80:10:10), and the data was processed on servers in China. Dataset from a US hospital served as an external evaluation and processed on servers in the USA.

High-dimensional deep learning features
We used a BigBiGAN-based architecture to train and extract high-dimensional deep learning features of COVID-19 versus non-COVID-19 pneumonia lesions. Two different data inputs were used: (1) original CT with segmentation masks, where the pixels of the pneumonia lesions were retained as the original CT intensity and the pixels outside were set to zero (Supplementary Figure S2); (2) CT images of the whole lung without segmentation mask. The batch size and epoch of the BigBiGAN training were set as 20 and 200, respectively. The 120-dimensional deep learning features were extracted by the encoder module of BigBiGAN when the loss was minimum in the last training epoch, which then served as input for the subsequent classifier models.

Radiomics features
We used PyRadiomics [14], an open source package recommended for standardized radiomics analysis workflow [15], to extract 120-deminsional radiomics features of the segmented lung lesions. We extracted the following radiomics features: 19 first-order statistics features, 16 shape-based 3D features, 10 shape-based 2D features, 24 gray-level co-occurrence features, 16 gray-level run length features, 16 gray-level size zone features, five neighboring gray tone difference features, and 14 gray-level dependence features. Details of the feature extraction are presented on the webpage of PyRadiomics [16].

Classifier models
To determine performance of DL and radiomicsextracted features, we used two widely used classifiers: a linear classifier typically used in supervised learning, and least absolute shrinkage and selection operator (Lasso) often used in radiomics [17][18][19]. In addition, we combined both the DL and radiomics features as a single input to determine performance for each of the two models. Model performance was evaluated on holdout test set, as well as external validation set. The overall study design is illustrated in Fig. 1.

AI augmentation for clinical diagnosis
Three blinded radiologists from an independent hospital in China reviewed the CT images from the test dataset and external validation dataset and performed the first round of diagnosis for COVID-19 versus non-COVID-19 pneumonia. The radiologists were blinded to clinical diagnosis and RT-PCR test results. After a 2-week wash out period, the reviewers were provided model predictions of combined deep learning and radiomics features for a second round of review. Clinical performance with and without model was calculated.

Statistics analysis
The receiver operating characteristic (ROC) curve and area under curve (AUC), sensitivity, and specificity were used to evaluate the diagnostic accuracy for COVID-19 pneumonia. All statistical computing was performed using R language (version 3.4.3, Vienna, Austria). Based on the image features extracted by PyRadiomics and BigBiGAN, the linear classifier and Lasso were implemented by the "lm()" and "glmnet()," respectively, for the significant feature selection. Chisquare and ANOVA tests were used to evaluate the differences in demographics. To determine interobserver variability of radiomics features regarding manual segmentation, a new radiologist performed manual segmentation of lung lesions in 10 random patients. Mann-Whitney U test was performed on the radiomics features extracted from the two set of manual segmentations. P < 0.05 was considered a significant difference. The mean patient age was 45 years (standard deviation 15.6) with no significant difference between male (n = 128) and female (n = 88) (P > 0.05). The median time interval from symptom onset to CT was 8 days for both COVID-19-positive and -negative patients. The chief complaints of the patients were cough and fever, comprising 95.4%. Detailed demographics are shown in Table 1.

Chest CT dataset
CT scan details of the study population are shown in Supplementary Material Appendix A. Within the discovery cohort, ten COVID-19-positive and 12 COVID-19-negative patients who did not have visible lung lesions and were excluded from segmentation. Two patients' lesion segmentations were controversial in radiologists and were excluded. This resulted in a total of 12,001 CT slices (7173 of COVID-19 positive and 4828 of COVID-19 negative) and 24,216 pneumonia lesion segmentations. All slices were randomly divided into training (9, 573), validation (1209), and test (1219 images). The external validation comprised 3799 images (2349 COVID-19 positive and 1450 COVID-19 negative) containing 7883 lesion segmentations of 17 COVID-positive and 15 COVID-negative pneumonia patients. No significant difference was found in the radiomics features extracted from the two sets of manually segmented pneumonia images of the 10 random patients by Mann-Whitney U test (P > 0.05).

Image features and model performance
The AUCs (sensitivity and specificity) of linear and Lasso classifiers using 120-dimensional DL features, 120dimensional radiomics features, and combined 240dimensional DL and radiomics features are shown in Fig. 3. Individual DL performances using the whole lung images as inputs are also shown in Supplementary Figure S3. Although the AUCs (sensitivity and specificity) using the whole lung were higher than the pneumonia lesion on the training dataset (

Clinical use
Diagnostic sensitivity and specificity of 3 radiologists on two test datasets were 74.7% and 80.3%, respectively. With the prediction outputs from combined deep learning and radiomics features, their performance increased with sensitivity and specificity of 91.2% and 91.9%, respectively, as shown in Fig. 5.

Discussion
In this study, we uncover some of the deep learning and radiomics features that contribute to differentiation of COVID-19 from non-COVID-19 viral pneumonia. Features extracted from both deep learning and radiomics showed similar performance with linear and Lasso classifiers, with sensitivity > 73% and specificity > 75% on the external cohort. DL features extracted from pneumonia lesion performed superior to the whole lung on the external validation dataset. Prediction outputs generated from our combined deep learning-radiomics model further augmented human expert performance.
To our knowledge, this is the first study to compare performance of DL versus radiomics models for differentiation of COVID-19 pneumonia. Various studies have described performance of DL models for COVID-19 pneumonia [20,21]. However, specific image features relevant to COVID-19 classification remain opaque. While it is well-known that with large datasets, DL models perform superior to hand-crafted feature extraction [22,23], large data are not always possible in medicine and may be limited by disease prevalence, obstacles to data procurement, and other clinical factors. For smaller data, studies have suggested feature engineering may be a more suitable machine learning strategy with notable advantages of radiomics for medical imaging analysis [15,24,25]. At present, studies that directly compare radiomics and deep learning clinical model performance are relatively unexplored [26]. In this study, we address these questions and further aim to enhance interpretability of such machine learning models.
Recent studies have shown image features learned from a BigBiGAN framework can achieve state-of-the-art performance for image classification [11,12,27]. Unlike traditional generative adversarial network typically used for image synthesis, de-noising, or generation of high-quality images, BigBiGAN has shown robust performance for learning high-dimensional semantic features. We identified 32 deep learning features that differed significantly between COVID-19-positive and -negative lesion images (P < 0.0001). Using PyRadiomics analysis, four radiomics features were selected by the two classifiers (P < 0.0001) to differentiate COVID-19 from other types of viral pneumonia. When we combined both approaches, 6 deep learning features and 5 radiomics features were selected (P < 0.0001) by the two classifiers. These results might suggest more distinguishing features were learned on neural network. Although ROC might suggest NA, not applicable slightly more robust performance for radiomics models, sensitivity and specificity did not differ among deep learning, radiomics, and combined features. Among the four significant radiomics features, we found mean intensity of COVID-19 (− 665.9) lesions to be higher than non-COVID-19 lesions (− 887.0) which might reflect more diffuse opacities or greater degree of fluid or debris affecting the airspaces. Based on NGTDMB, which measures the rate of change in intensity between pixels and its neighborhoods, we also found less intense change between adjacent pixels for COVID-19 (0.39) compared to non-COVID-19 (0.72), which might also indicate that within an affected lung region, there is more diffuse airspace process, sparing fewer of the neighboring alveoli in COVID-19 compared to other types of viral pneumonia. This might also explain why RMS, which measures the magnitude of the image values by calculating contributions of each gray value (absolute), was higher for non-COVID-19 (660.8 vs. 563.1), where greater number of spared alveoli, i.e., air-filled, rather than alveoli fluid-filled by disease, could contribute to a wider magnitude of absolute gray values in the pixels. Uniformity was lower for COVID-19 (0.03) compared to non-COVID-19 (0.06) lesions, with a larger range in irregular texture for COVID-19 and suggested a more heterogeneous lung texture possibly due to diversity in airspace disease phenotypes (consolidation, ground-glass opacities, etc.) that combine varying degrees of edema and vascular and interlobular septal thickening, sometimes described as "crazy paving" on visual inspection [28].
Although it is difficult to directly map image phenotype from DL features alone, signatures from combined DL and radiomics features provide some clues to image-based discrimination for COVID-19 versus non-COVID-19 lung disease. COVID-19 patients showed higher signature scores for irregular intensity changes, heterogeneous intensities, and wider range in textures within the lung lesions compared with non-COVID-19 patients. For example, despite a large area of mixed opacity that might raise suspicion for COVID-19 on visual inspection (Fig. 4b(2)), feature extraction revealed intensity changes that were relatively regular within the lesion, a pattern that was associated with non-COVID-19. In another example, although multiple, nodular opacities in peripheral consolidative pattern on CT might raise suspicion for COVID-19 ( Fig. 4b(3)), strong uniformity of CT intensity values within the lesions suggested a non-COVID-19 process. This was supported by the linear signature score (0.07) that was consistent with non-COVID-19.
When predictions from a combined deep learningradiomics model were available, we observed improved radiologist diagnostic performance with increase in both sensitivity and specificity by 16.5% and 11.6%, respectively, suggesting a potential role for machine learning for augmenting clinician decision support.
There are several limitations to this study. While we used the high-dimensional, semantic features from the encoder module of the BigBiGAN framework, we did not examine other features that can be produced in the framework or by a different architecture. Since there is no specific definition of these deep learning semantic features, in the future, we will explore image encoding processes used to generate each of the deep learning semantic features to further enhance interpretability of these features. While deep learning and radiomics approaches performed comparably with a training cohort of around 180 patients with 9500 images, larger dataset could show more robust performance for deep learning. Finally, while we used  Sensitivity and specificity of the radiologists' diagnosis on the test datasets without (first round of diagnosis) and with (second round of diagnosis) the assistance of our AI semantic features plus radiomics features features when the loss was minimum in the last training epoch, "deep learning features" might vary over training parameters or change with other experimental settings.
In conclusion, we uncover specific deep learning and radiomics features relevant to COVID-19 pneumonia to assist interpretability of machine learning algorithms and contribute to understanding of COVID-19 pneumonia imaging phenotypes. Furthermore, we compare performance of deep learning and radiomics models for COVID-19 pneumonia diagnosis using chest CT and show potential for augmenting radiologist diagnostic performance with the aid of machine learning predictions.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflicts of interest.
Ethics approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The study was approved by the institutional review board of University of Science and Technology of China and Stanford University.
Code availability The code of this study is publicly accessible at https:// github.com/MI-12/Comparison-of-AI-Semantic-features-and-radiomicsfeatures. The source code and instruction on the use of BigBiGAN, radiomics, and classifiers can be found at: https://github.com/MI-12/ Comparison-of-AI-Semantic-features-and-radiomics-features.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.