Background

Since December 2019, the outbreak of a new coronavirus, named novel coronavirus 2019 (COVID-19), has rapidly spread across China and other countries across the globe [1,2,3,4]. As of 19 July, 14,043,176 cases of COVID-19 with 597,583 deaths have been reported [5]. Since the World Health Organization declared the COVID-19 outbreak as a public health emergency of international concern, namely a pandemic, countries around the globe have heightened their surveillance to quickly diagnose potential new cases of COVID-19. Due to increasing outbreak of COVID-19, the early diagnosis of patients is crucial for prompt and effective in preventing and controlling of COVID-19. Presently, nucleic acid testing is generally considered as diagnostic ground truth. However, the stringent requirements of transportation and storage of COVID-19 nucleic acid kits may constitute an unsurmountable challenge for many existing transportation and hospital facilities in crisis. Moreover, the methodology, disease development stages and the method of sample collection could impact the result of nucleic acid testing [6]. The reverse transcription polymerase chain reaction (RT-PCR) could be used for identification of COVID-19, but it is difficult to identify the severity of COVID-19 patients, to predict whether the patient should be transferred to ICU or would need ventilators soon. These factors prolong the time to control the spread of COVID-19 and increase the recovery time of patients.

Chest CT, especially, high-resolution CT, is an important tool to detect the lung changes of 2019 novel coronavirus pneumonia (NCP) and to aid in evaluating the nature and extension of lesions. In a recent report, Ai et al. [7] utilized CT scans to investigate its diagnostic value and consistency in comparison with RT-PCR assay for COVID-19. It has been found that of 1014 patients, 59% had positive RT-PCR results, while 88% had positive chest CT scans which means chest CT has a high sensitivity for diagnosis of COVID-19. Hence, the Chest CT may be treated as a primary tool to detect COVID-19 in epidemic areas. Some other investigators focused on the understanding of virus infection pathogenesis by observing the imaging patterns on chest CT. Bernheim et al. [8] characterized chest CT findings in 121 COVID-19-infected patients in relationship to the time between symptom onset and the initial CT scan. Pan et al. [9] investigated the lung abnormalities by observing the changes in chest CT of patients from initial diagnosis to recovery. It was observed that the lung abnormalities on chest CT showed greatest severity approximately 10 days after initial onset of symptoms. Most of the concern in recent reports is with the diagnosis of the COVID-19 or the clinical observation during the therapeutic treatment [10,11,12,13].

Although for most COVID-19 patients, the clinical symptoms are mild and the prognosis is good, about 20% can develop into severe cases with the symptoms of pneumonia, pulmonary edema, septic shock, metabolic acidosis, acute respiratory distress syndrome or even death [14]. Therefore, the timely diagnosis, accurate assessment with the following symptomatic treatment is very important and is the key to improve the prognosis and reduce the mortality.

It is known that convolutional neural networks (CNNs) have been proved to be powerful in data mining, image classification/detection, and computer vision. Many research groups have applied deep learning methods into COVID-19 computer aided diagnosis [15,16,17]. But to our best knowledge, few studies were focused on the identification of severity of infected patients, although this identification is a crucial evaluation criterion to develop proper therapeutic treatment strategy.

Therefore, developing a rapid, accurate and automatic tool for COVID-19 severity screening is both an urgent and essential task, which could help physicians anticipating the need for ICU admission. Thus, to achieve an accurate and efficient COVID-19 severity diagnosis, we classified features gained from pre-trained CNNs such as Inception v3 [18], ResNet [19] and DenseNet [20] to identify the severity of COVID-19 patients in this study. Amid the crisis in hospitals and due to challenges of training a network from scratch (e.g., necessity of a large dataset), we find this approach to be more practical and reliable.

Results

In this section, three experiments are performed to validate the feasibility of the proposed method. These include holdout validation, k-fold cross-validation, and leave-one-out validation schemes. All the experiments were implemented in Matlab 2019b with Intel Xeon Gold 6252 @2.1 GHz CPU and 16 GB RAM environment, and five classification methods were trained. These classifiers included linear discriminant, linear SVM (support vector machine), cubic SVM, K-nearest neighbor (KNN), and AdaBoost decision trees. We used the default parameters for these classification methods and no extra optimization was performed. For holdout validation experiment, 80% of deep features were randomly selected as training dataset and the remaining 20% were used for testing. Figures 1 reports the results of holdout validation in terms of accuracy, AUC, sensitivity and specificity. Obviously, the linear discriminator (purple square) cannot achieve a good accuracy and AUC performance, while the AdaBoost decision trees (black diamond) and linear SVM (green circle) perform worse than other three classifiers with respect to sensitivity and specificity values. Among five classification methods, the cubic SVM (red star) performs the best for all cases with respect to accuracy, AUC, sensitivity and specificity values. We may also observe that all deep models with cubic SVM classifier are able to achieve favorable results while the DenseNet-201 model does contribute to the best results for most cases.

Fig. 1
figure 1

The performance of classified deep features based on holdout validation: a The accuracy and AUC performance; b The AUC performance; c The sensitivity performance; (d) The specificity performance

In another series of experiments, tenfold cross-validation was performed to validate the performance of severity classification for four deep learning models. The deep features were split into tenfolds. For each fold, nine out-of-fold observations were used to train the classifier and the remaining fold was used to assess the trained classifier. The average test error over all folds was considered as the final result. The performances of cubic SVM with tenfold cross-validation are reported in Table 1. We may observe that four deep learning models are able to achieve high accuracy values in identifying the severe and non-severe COVID-19 cases with accuracy over 91.9%. Among the four topologies, the DenseNet-201 which is believed to be more representative and semantically correct in extracting features, contributes to the best result with an accuracy of 95.2% and the AUC performance of 0.99. While the ResNet-101 which contains more layers outperforms ResnNet-50 in both accuracy and AUC performance. This may because that the deep layers have better capacity in representing subtle changes like ground-glass opacities in chest CT. In addition, the DenseNet-201 also achieves the best performance for sensitivity and specificity, which increases about 7% for sensitivity and improves from 95.84% to 96.87% for specificity, respectively, when compared with Inception-V3. Generally, the high sensitivity means a high positive result (also known as the “true positive” rate) which may be more important than specificity in disease diagnosis under epidemic conditions. Thus, the DenseNet-201 is more preferable than other three architectures for severity identification of COVID-19 in CT scans.

Table 1 Performance of different deep learning models with cubic SVM based on tenfold cross-validation

To further investigate the performance, the leave-one-slide-out validation strategy was performed. In this experiment, all 728 deep feature vectors (out of 729) were fed to train different classifiers and the remaining one sample was used to test. This strategy is the logical extreme of k-fold cross-validation method. This leads to the reduced overall variability and bias than the validation-set method. The accuracy results of leave-one-out validation strategy for different deep learning methods and classifiers are shown in Tables 2, 3. We may observe that the DenseNet-201 features with cubic SVM still perform the best with a classification accuracy of 95.34%.

Table 2 Classification accuracy performance of deep features based on leave-one-out strategy (%)
Table 3 Feature extraction time for feature extraction

To make the deep features of our pipeline more explainable, the attention maps from the last ‘pooling’ layer in DenseNet-201 are depicted in Fig. 2. These attention maps may show the discriminant 2D locations for the identification of COVID-19 severity based on consecutive convolutional filtering and undersampling. These attention spots may or may not correspond to expert understanding. One factor that may improve the attention is to restrict the filtering to lungs via masking CT scans through lung segmentation. In addition, we also give the implement time for deep feature extraction and prediction in Tables 4, 5, respectively.

Fig. 2
figure 2

Two sample attention maps from the last ‘pooling’ layer in DenseNet-201. Whereas the attention seems to be generally rather non-exclusive, it may sometimes not contribute to human interpretation. Restricting deep feature learning or extraction to the lung regions is expected to improve the interpretability of the attention maps

Table 4 Feature extraction time for testing
Table 5 The clinical data analysis of COVID-19 confirmed patients

Discussion

The COVID-19 virus, first found in Wuhan, has spread across the globe and has been formally declared as pandemic by the World Health Organization. No symptoms in the early stages of disease and the community transmission lead to the fast spread of the coronavirus (with estimated reproduction number R0 of 2.2–6.4). Since the outbreak of COVID-19, the nucleic acid testing is treated as the ground truth to identify the present of the virus. But some recent reports reveal that the accuracy of nucleic acid testing COVID-19 is about 30–50% [6]. Hence, the Chinese government has changed the diagnostic protocol to switch to CT scans for diagnosis of suspected cases [21]. Compared to the conventional X-ray, CT scans allow radiologists to inspect internal structures with much more details. Figure 3 shows the CT and DR images of a 76-year-old male COVID-19 patient with fever, cough and expectoration. It illustrates that multiple patchy regions with solid components in bilateral lung lobes can be easily detected in CT slides than in DR images. Thus, although no diagnostic test may provide complete certainty, and although this work is focused on severity identification, the CT scan seems to be an acceptable alternative diagnostic protocol to identify the COVID-19. A practical challenge that remains is the thorough disinfection of the CT machine after each scanning session is absolutely necessary.

Fig. 3
figure 3

CT and DR images of a 76-year-old male with fever, cough and expectoration: a Chest CT scan. bd Follow-up DR images

Deep learning, which has proven to be a powerful tool in medical image processing, has been employed in COVID-19 identification or diagnosis in recent reports. Some researchers tried to use the deep model to discriminate between COVID-19 patients and bacteria pneumonia patients/healthy ones. Xu et al. [15] developed a COVID-19 screening system which can identify COVID-19, Influenza-A viral pneumonia and healthy cases. A total of 618 CT samples were collected in the study, and the proposed deep model could achieve 86.7% accuracy. Song et al. [13] proposed DRENet architecture for COVID-19 screening, which achieved 96% accuracy and 0.99 AUC among 88 COVID-19 patients with 777 images and 86 healthy persons with 708 images. Li eta al. [12] exploited ResNet50 to extract deep features to identify COVID-19 and other non-pneumonia cases. The dataset was consisted of 4356 chest scans from 3322 patients, and the proposed method achieved 90% sensitivity and 96% specificity with AUC of 0.96. Some researchers used the deep model to segment or detect the interested regions. Fan et al. [15] developed a deep learning system for automatic segmentation and quantification of COVID-19-infected regions, where their proposed VB-Net achieved 91.6% ± 10% dice similarity coefficients between automatic and manual segmentations. Gozes et al. [16] proposed an AI-based automated CT image analysis tools for detection and quantification of COVID-19. Zheng et al. [22] exploited a U-net to locate the infected region whose results were fed to a 3D deep neural network (DeCoVNet) to predict the probability of COVID-19 infection.

Although a lot of efforts have been recently focused on the automatic identification of COVID-19, few studies pay attention to automatic or semi-automatic severity assessment of COVID-19 which can track and measure the disease in a quantitative way. For instance, Tang [23] employed the random forest model to assess the severity of COVID-19 CT images from 176 patients, achieving 87.5% accuracy and 0.91 AUC.

To identify the severity of COVID-19 rapidly, and to provide an efficient and accurate prognosis to guide the follow-up therapeutic treatment, the classification of deep features for severity identification was developed in this work. The experimental results demonstrate that the proposed approach has a good ability in discriminating the severe versus non-severe cases of COVID-19 with an accuracy of 95.2%, a sensitivity of 91.87% and the specificity of 96.87%, respectively. The proposed method can be applied for severity screening.

Although the proposed method shows a promising application, there are some limitations that should be mentioned: (a) The CT data were collected only from three hospitals within one province; more variable samples at different disease stages or cases from other regions should be included in our future dataset, (b) the number of training samples was rather limited, especially the severity samples (that is why we abandon the idea of training a network from scratch), and (c) only a few physicians were involved in this dataset labeling and identification; the impact of inter-observer variability should be studied when a larger dataset is curated by more radiologists to more comprehensively represent the uncertainties of COVID-19 in CT scans.

Future researches will focus on the following aspects: (a) the volume CT scans are explored to achieve a more reliable and accurate COVID-19 severity assessment by considering the overall evaluation from 3D CT data; (b) the pre-processing method will be introduced to locate or segment the interested region to avoid the confusion brought by clothes or other artifacts; (c) the deep network will be applied on the observation of CT scan of the COVID-19 patients in their remission and recovery.

Conclusion

In summary, our study demonstrated the feasibility of classification of pre-trained deep features to assist physicians to identify the severity of COVID-19. By achieving a good performance on severe and non-severe diagnosis, the proposed pipeline may enable a rapid identification and help the physicians make more reliable decisions for treatment planning.

Methods

Patients

We collected the CT volume data of 202 COVID-19 patients from three hospitals in Anhui Province, China, captured from January 24 to February 12, 2020. These cases were provided by the First Affiliated Hospital of Bengbu Medical college, the First Affiliated Hospital of Anhui Medical University, and Fuyang Second People’s Hospital. All collected cases satisfied the following instructions: (a) the result of RT-PCR was positive for throat swab, and sputum or bronchoalveolar lavage (BAL) was confirmed; (b) the availability of thin slice CT images; (c) the image quality of CT image was sufficient for radiological evaluation. Then, the patients in accordance with any of the following conditions were further marked as severely ill patients:

  1. (a)

    shortness of breath with respiratory rate no less than 30 breaths/min;

  2. (b)

    the oxygen saturation no more than 93% in a resting state;

  3. (c)

    partial arterial oxygen pressure (PaO2) or fractional inspired oxygen concentration (FiO2) no more than 300 mmHg;

  4. (d)

    significant progression of pulmonary lesions (over 50%) within 24–48 h;

  5. (e)

    respiratory failure with the requirement of mechanical ventilation;

  6. (f)

    occurrence of shock;

  7. (g)

    multiple organ failure.

The remaining patients were regarded as non-severely ill. Then, radiologists selected 729 axial slices from these 202 CT volumes to build the dataset. Finally, 41 severe cases with 246 axial slices and 161 non-severe cases with 483 axial slices were included in this dataset.

This retrospective study (enrolled medical datasets) was approved by the ethics committees of the participating hospitals.

Clinical information

There are 110 males and 92 females (aged from 5 to 86 years) in the dataset with average age of 46.4 ± 15.5. A total of 92 cases have travel history to epidemic area or close contact history of COVID-19 patient. As many as 53 patients had underlying coexisting illness while no coexisting illness was reported for the remaining 149 cases. Most of the patients exhibited clinical symptoms or physical findings such as fever, cough, sputum production, sore throat, nausea or headache, myalgia or arthralgia and shortness of breath. Specifically, 14 patients had high fever and 169 patients had low fever. Among most cases, the abnormal biomedical indicators or laboratory findings were generally reported, such as the abnormality of white blood cells (WCBs), neutrophils, lymphocytes and the increase of C-reactive protein (CRP) and nuclear cells. The details of clinical characteristics are shown in Table 5.

Imaging protocol and analysis

In most COVID-19 cases, the bilateral incidences of consolidation, ground-glass opacities and the crazy paving pattern can be found in the lungs, where the limited or scattered nodular shadowing is observed in non-severe cases, while the flaky or widespread lesion is observed in the severe cases. Moreover, compared with non-severe cases, bronchial wall thickening, lymph node enlargement, pleural effusion, and the air bronchus-charging sign with thickened blood vessel are often observed in severe cases. Figure 4 shows some thumbnails of severe and non-severe COVID-19 chest CT scans. Figure 5 shows typical examples of severe and non-severe CT chest slides in different planes.

Fig. 4
figure 4

Sample CT scans of COVID-19-infected patients: a non-severe cases; b severe cases

Fig. 5
figure 5

Typical examples for severe and non-severe CT chest slides in axial, sagittal and coronal views: a non-severe cases; b severe cases

Feature extraction from the pre-trained deep learning models

As training of deep networks from scratch would need a large and well-curated dataset, and the fine-tuning strategy has the advantage over the off-the-shelf when the scan labels are plentiful. Thus in this work, with limited dataset amount and labels, the off-the-shelf strategy was exploited. Specifically, the pre-trained deep models such as Inception [18], ResNet [19] and DenseNet [20] models (trained by 20.0 million images by ImageNet dataset) was exploited in this work. The image transformation in each successive layers of the DenseNet-201 network is shown in Table 6.

Table 6 The DenseNet-201 architectures

The input image of dimension 224 × 224 × 3 is given to the first convolutional layer which consists of 64 kernels of size 7 × 7 with 2 stride and 3 padding. The stride is defined as the number of pixels shift by the filter in the image matrix. By convolving the image with 64 kernels, the output image obtained from the convolution layer is of size 112 × 112 × 64.

The final fully connected layer FC 1000 layer has dimension of 1000. In the proposed method, FC-1000 layer features from DenseNet-201 are extracted and fed into the various classifiers for classification task. The output of single convolutional layer is given by Eq. (1).

$$h_{i,j} = f\left( {\mathop \sum \limits_{k}^{P} \mathop \sum \limits_{l}^{Q} w_{k,l} x_{i + k,j + l} + b_{k,l} } \right),$$
(1)

where h represents the neuron output, x denotes the input, w represents the weight and b is the bias parameter. Here P, Q represent the size of weight parameters, k, l are parameter indices and i, j are input indices. Each convolutional layer follows rectified linear unit activation (ReLU), normalization and max pooling operations, and the ReLU is used as activation function in the DenseNet-201.

Figure 6 illustrated the details and the pipeline of the proposed method. During the training step, the pre-trained deep model (DenseNet-201) was employed to extract the deep features from COVID-19 CT scans. Subsequently, the binary SVM classifier with cubic kernel was trained to perform the classification task of severe versus not severe distinction. In the testing step, the unseen COVID-19 scan sample was input to predict the severity with the help of its deep feature and the trained classifier.

Fig. 6
figure 6

The pipeline of the proposed method