1 Introduction

The novel coronavirus (COVID-19) is an infectious disease that has spread rapidly to most countries worldwide. To date (May 27, 2020), over 5,543,761 confirmed cases have been reported, and the disease has caused over 150,159 deaths. This situation has attracted concerns about serious public health emergencies from many health organizations. The current tests and diagnosis methods available for COVID-19 are based primarily on reverse transcription polymerase chain reaction (RT-PCT) [1, 2]; consequently, obtaining test and diagnosis results requires a minimum of 4–6 h. Considering the rate at which COVID-19 spreads, 4–6 hh is a long time to obtain the results. In addition, a shortage of RT-PCR test kits has been a huge challenge for many countries.

An alternative solution for clinical assessment of COVID-19 is to adopt computed tomography (CT) or X-ray scans, which are both faster and cheaper than RT-PCR. In addition, CT images are readily available and show an accuracy similar to that of RT-PCT. Thus, CT medical imaging could play a vital role in assessing patients who are positive for COVID-19. Several works [3,4,5] have demonstrated promising results regarding the effectiveness of using CT scans for COVID-19. On the other hand, there is a high demand for rapid severity assessments of COVID-19 using CT scans during the early stage of the disease because such assessments are beneficial for determining follow-up treatments. Generally, clinical severity assessments for patients include 2 categories: slight and severe, which leads to different treat plans. Figure 1 shows CT image examples from 3 cases with normal assessment and another slight or severe COVID-19 assessments. Compared with normal cases, the severe ones represent more obvious abnormal lesion areas in a larger scale. Besides, the symptoms of these three patients are given in Table 1. Shi et al. [6] conducted descriptive studies on 81 patients with COVID-19 in Wuhan, China. Pan et al. [7] reported on lung changes over time from chest CT images during COVID-19 recovery. Based on these radiological findings, chest CT images and clinical symptoms, including fever, cough and dyspnoea, of patients with COVID-19 can be beneficial for clinical assessments. Zhao et al. [8] compiled a COVID-CT datasetFootnote 1 and proposed a deep learning method to predict whether a patient is affected with COVID-19. Although this method achieves promising performance based on this open public dataset, it cannot predict specific severity assessments. Currently, many AI (artificial intelligence)-assisted diagnosis works have been proposed that focus on solving this classification task to determine whether a patient has been infected with COVID-19. However, most of these networks are designed to perform classification based on a single CT image, and they ignore prior information, such as a patient’s clinical symptoms. As shown in Fig. 2a, the workflows of these networks typically consist of feature extraction from a single CT image followed by classification prediction for disease assessment. However, more specific diagnosis of clinical severity, including slight and severe assessments, is worthy of attention and is beneficial for determining follow-up treatments.

Fig. 1
figure 1

CT image examples of three patients with COVID-19. The images in the first row are from a normal patient; the images in the second row are from a patient classified as slight; and the images in the third row are patient classified as severe. The red arrows denote several potential abnormal lesion areas

Fig. 2
figure 2

The workflows of classification frameworks based on a a single CT image and b our proposed network (FaNet)

Table 1 Symptoms of three patient cases in Fig. 1

In this paper, we explore a fast assessment method for both diagnosis and severity assessments based on 3D CT imaging and clinical symptoms. The workflow is shown in Fig. 2b. First, to consider additional spatial tomography information, we apply 3D CT image sequences instead of single CT images. This may make the operation more convenient for radiologists, who can simply input continual chest CT image sequences from patients without having to select a specific single CT image in advance. Second, the clinical symptoms from patients can function as additional prior information to improve assessment performance. Such clinical symptoms are easily and rapidly accessible compared with laboratory results. By employing both 3D CT image sequences and clinical symptoms, we are able to achieve rapid both diagnosis and severity assessment in the early stage of COVID-19 disease, which can act as a point of reference to assist radiologists in determining follow-up treatments.

The remainder of this paper is organized as follows: Related works are shown in Section. 2. We describe our method in Section 3. First, we describe the details of 3D CT image sequences and clinical symptoms and then present the network workflow details, where four modules are introduced. In the next section, we describe the COVID-19 datasets and the network training implementation and present the experimental results. In Section 5, we highlight several discussion points regarding our work and future research directions. Finally, we conclude the paper in Section 6.

2 Related works

Because of their fast data acquisition, X-ray and CT scans are widely applied to acquire imaging evidence for radiologists. Because COVID-19 shows quite similar manifestations in images, AI-assisted diagnosis is in high demand in medical imaging. In this section, we introduce several works based on X-ray and CT-based scans intended to perform COVID-19 diagnosis.

2.1 X-ray-based screening of COVID-19

Although the sensitivity of X-ray images is less than that of CT images for chest sites, it is usually employed as the typical first-line imaging modality. Via a chest X-ray, patients infected with COVID-19 frequently exhibit bilateral low-zone consolidation that peaks 10–12 days from symptom onset [9]. Several approaches to COVID-19 classification have been explored. Narin et al. [10] proposed three different classical CNN-based models, ResNet50 [11], InceptionV3 [12] and InceptionResNetV2 [13], to predict COVID-10-infected patients from chest X-ray radiographs. Apostolopoulos et al. [14] introduced transfer learning [15] with CNNs for automatic detection of COVID-19 using a collection of 1,427 X-ray images. Zhang et al. [16] studied a deep learning method with anomaly detection using 14 public chest X-ray datasets and achieved an accuracy of 96% for COVID-19 cases. Hemdan et al. [17] compared the classification performances of 7 deep models, including VGG19 [18], DenseNet201 [19], ResNetV2, Inception V3, InceptionResNetV2, Xception [20] and MobileNetV2 [21] and found that VGG19 and DenseNetV2 achieved better accuracy, while MobileNetV2 obtained the fastest testing time. Sethy et al. [22] conducted statistical analysis and showed that ResNet30 plus SVM achieved a superior classification performance; their best model achieved an accuracy of 95.3% for COVID-19 detection. Zhang et al. [16] developed detection models using 1,531 X-ray images that detected COVID-19 cases with 96% accuracy and non-COVID-19 cases with an accuracy of 70.65%.

2.2 CT-based screening of COVID-19

Because characteristic CT imaging patterns are employed in the diagnosis of COVID-19 patients in Hubei, China, CT findings have been gradually become recommended because they supply major evidence for early diagnosis and for assessing the course of the disease [23]. Li et al. [24] proposed CT visual quantitative analysis to explore the latent relationship between COVID-19 classification and imaging manifestations. Wang et al. [5] proposed an Inception migration-learning method and achieved an overall accuracy of 73.1% on the testing dataset. Song et al. [25] developed a deep learning-based CT diagnosis system to identify patients infected with COVID-19 that achieved an excellent AUC of 0.99. Barstugan et al. [26] applied support vector machines (SVMs) [27] and several feature extraction methods, including the grey level co-occurrence matrix (GLCM) [28, 29], local directional pattern (LDP) [30,31,32], grey level run length matrix (GLRLM) [33, 34], grey level size zone matrix (GLSZM) [35], and discrete wavelet transform (DWT) [36] algorithms, and obtained a classification accuracy of 99.68%.

3 Methods

In this section, we first introduce 3D CT image sequences and clinical symptoms. Then, we provide an overview of the model framework of FaNet. Finally, the parameter selection details are elaborated.

3.1 3D CT image sequences and clinical symptoms

We applied both 3D CT image sequences and clinical symptoms from COVID-19 patients to obtain fast clinical assessments. Instead of selecting single CT images to accomplish prediction, we considered that image sequences could provide more spatial information. In addition, in practice, this approach allows radiologists to process the CT image directly without having to select a specific single image. The usage of CT image sequences can also improve fault tolerance for clinical assessment. Similar to video classification [37, 38] or action recognition [39,40,41], we output specific classes based on the image sequences. In terms of clinical symptoms, several works [42, 43] have shown that latent relationships exist between the ultimate clinical assessment and COVID-19 patient symptoms. Meanwhile, the clinical symptoms are easily accessible, which helps in acquiring fast clinical assessment when clinical symptoms are treated as auxiliary information for CT image sequences. Figure 3 shows the statistical distribution of clinical symptoms for clinical severity assessment based on 209 COVID-19 patients (194 slight cases and 15 severe cases). Based on observation, cough and severe assessment are closely related. Among these patients, some of them expressed no obvious symptoms. For these non-symptomatic patients, the CT image sequence could be the critical discriminatory information. By introducing information from both CT images and symptoms, we are able to perform a fast clinical assessment for COVID-19.

Fig. 3
figure 3

The statistical distribution of clinical symptoms (fever, cough, muscle ache, fatigue, headache, nausea, diarrhea, stomachache and dyspnea) for clinical severity assessments was based on confirmed cases for COVID-19

A CT image sequence with a data shape of h × w × c × k first needs to be processed into a 3D matrix with a data shape of h × w × k, where h, w and c represent the height, width and channels of a CT image, respectively, and k represents the length of the image sequence. Because the number of channels for the CT image is 1, this process can be considered as dimensionality reduction, unlike in natural images, where the number of channels in the processed matrix must be considered as k. Thus, this task can be considered as a problem similar to the image classification [11, 18, 19, 44,45,46].

For the clinical symptoms, we adopted 11 symptoms to act as prior information, including fever, cough, muscle ache, fatigue, headache, nausea, diarrhoea, stomachache and dyspnea. If the patients is with the certain symptom, the certain symptoms are encoded as 1. Otherwise, the responding symptoms are set as 0. In addition, we adopted patient gender and age as parts of the extra information joined the symptoms. Then, we converted this prior information into a vector for each COVID-19 patient.

3.2 Fast assessment network (FaNet)

To enable fast and accurate diagnosis and severity assessments for COVID-19, we designed our networks based on both CT image sequences x and clinical symptoms y from patients. As shown in Fig. 2b, the model framework consists of 4 modules: encoding for symptoms, feature extraction from CT image sequences, fusion, and prediction. The symptom and feature extraction modules use two feature streams—the clinical symptoms and the CT image sequences, respectively. Next, the fusion module fuses the outputs of these two streams into a single fused feature. Finally, the prediction module predicts the clinical assessment based on the fused feature. As shown in Fig. 4, more framework details are given. Note that our network could conduct two tasks, including diagnosis and severity assessment. The entire framework can be formulated as follows:

$$ Task \quad 1: \quad G(x,y) = \left\{ \begin{array}{ll} 0, & Non-COVID-19 \\ 1, & COVID-19 \end{array}. \right. $$
(1)

where G(⋅) represents our network. “Task 1” is related to the diagnosis assessment for COVID-19. Similar to “Task 1”, the severity assessment of COVID-19 is represents as “Task 2” as follows:

$$ Task \quad 2: \quad G(x,y) = \left\{ \begin{array}{ll} 0, & Normal \\ 1, & Slight\\ 2, & Severe \end{array}. \right. $$
(2)

where G(⋅) represents our network.

Fig. 4
figure 4

Network structure details of the proposed FaNet. “×” denotes element-wise product operation and “ + ” denotes element-wise add operation. Symptoms, so as genders and ages of patients, are shared in feature extraction as the pattern of prior information. The channel average pooling and max pooling are used to shrink the multi-dimensional tenor into a vector, which replaces fully-connection layers to predict the certain diagnosis and severity assessments

3.2.1 Symptom-fused channel attention modules

To fuse symptom prior information, we combine the shared symptoms information during the feature extraction. Inspired by channel attention [47, 48] and self-attention [49, 50], the symptom-fused channel attention module (SCAM) is designed. Firstly, two convolution layers in SCAM are employed to extract shallow feature Fs. The image size reduces half through max pooling and The channel average pooling is used to gain channel-wise feature Fc.

$$ F_{c} = H_{cap}(F_{s}) $$
(3)

where Hcap denotes the channel average pooling operation. For the shared symptoms, a convolution layer with filter size 1 × 1 is used to keep the length of expanded symptom vector Fe same with Fc.

$$ F_{e} = H_{e}(y) $$
(4)

where He denotes the convolution operation and y denotes the original symptom vector. In terms of channel attention FCA for Fs, it could be formulated as follows:

$$ F_{CA} = Sigmoid(H_{ca}(Sigmoid(F_{e}) * F_{c})) $$
(5)

where Hca denotes the convolution layer with filter size 1 × 1 and Sigmoid(⋅) denotes the activation functions to map value into [0,1]. “*” denotes element-wise product operation. To avoid information loss, the skip connection is utilized. At last, the output of i-th SCAM \(M^{i}_{out}\) is formulated as follows.

$$ M^{i}_{out} = M^{i}_{in} + F_{s} * F_{CA} $$
(6)

where \(M^{i}_{in}\) denotes the input information for i-th SCAM and “*” denotes element-wise product operation.

3.2.2 Prediction module

In our prediction module, the channel pooling operations are introduced to replace fully connection layers to shrink parameters. For better performance, the channel average and max pooling are both employed in prediction module. To combine these channel-wise features, concatenation are used. Then, a convolution with filter size to squeeze the combined channel-wise features half. Finally, a convolution is used to predict the diagnosis assessment and another convolution for predicting severity assessment. This prediction processes could be formulated as follows:

$$ F_{SK} = H_{SK}(Concatenation(H_{cap}(M^{n}), H_{cmp}(M^{n}))) $$
(7)

where FSK denotes the shrink channel-wise features. HSK denotes the convolution to shrink combined channel-wise features. Hcap and Hcmp denotes the channel average pooling and channel max pooling. Mn denotes the output data of the last SCAM. Based on FSK, the prediction outcome of diagnosis assessment z1 is formulated as

$$ z_{1} = H_{d}(F_{SK}) $$
(8)

where Hd denotes the convolution layer with filter size 1 × 1. Similar to z1, the prediction output of severity assessment z2 is formulated as:

$$ z_{2} = H_{s}(F_{SK}) $$
(9)

where Hs denotes the convolution layer with filter size 1 × 1.

3.3 Parameter selection

In this part, the parameter selection details are illustrated. For the input data, the input CT image data shape are set as 512 × 512 × 160, where h = 512, w = 512 and h = 160. Due to the CT image slice thickness is 1mm, the length of sequences could cover most of the lung sites for patients. The convolution before SCAM is fixed with kernel size 1 × 1 × 160 × 32, which shrink the channel number from 160 to 32. As is shown in Tables 2 and 3, the network parameter details are described. In terms of the number of SCAM, we set this parameter as 5, which is validated to gain best performance in Section 4.4.

Table 2 Network parameter details for symptom-fuse channel attention module (SCAM)
Table 3 Network parameter details for prediction module

4 Experiments

In this section, we evaluate the performance of our method. First, we describe the datasets, which include data from 209 COVID-19 patients and 207 normal patients. Next, the data augmentation and network training implementation details are elaborated. Finally, we report the experimental results and ablation studies.

4.1 Patient data studies

We acquired 209 COVID-19 patient data and 207 normal patient data from Guizhou General People’s Hospital and the second Xiangya Hospital. The ages of the patients ranged from 8 to 84. The lengths of the CT image sequences ranged from 202 to 653. We split the data into groups, with 300 patients for training and 116 for testing. Because one input to our method consists of CT image sequences, these patient data were cut into multiple sequences of the same length 160 with uniform sampling, which covers the most parts of lung site. The other input is the vector of clinical symptoms for each corresponding image sequence. The CT images were obtained using a Siemens CT scanner with a slice thickness of 1.0 mm and a reconstruction matrix size of 512 × 512. The scan voltage ranges from 110 to 130 KeV. In addition, the distances from the source to the detector and to the patient were set to 940 and 535, respectively. To reduce file number of CT images, the patient data was exported as video files with an image size of 512 × 512 using Radiant software.

4.2 Data augmentation and network training implementation

We conducted data augmentation for our collected patient data. For patient data testing, we randomly selected 116 image sequences total patient’s data, which would not exist in training datasets. Regarding the training data, we randomly selected the initial index of the image sequence from the patient data during each epoch, which can improve the data diversity. Regarding data augmentation, we conducted horizontal flip, vertical flip, and rotation (90, 180 and 270) operations to acquire richer samples.

The Adam optimizer [51] with an initial learning rate of 0.001 was adopted to minimize the loss function. We trained the entire network using 1,000 epochs during the ablation studies and to conduct parameter selection. The network was implemented in PyTorch on a computer equipped with a TITAN 2080Ti GPU. The runing time on the test dataset for each clinical assessment was less than 104 s.

4.3 Experimental results

To validate the effectiveness of our proposed method, we design several versions based on our proposed method shown in Table 4. Inspired by deep neural networks for image classification in natural domain, we design our proposed methods in 3D CT image sequences and introducing the clinical symptom prior information. Thus, we compare our method with several methods, including AlexNet [52], ResNet [11], MobileNet [21], VGG [18], SeNet [53] and DenseNet [19], to evaluate the performance. Due to the new scene in CT images for these comparison methods, we redesign these methods with similar parameter counts (shown in Table 5) and adopt the fully connection layers for prediction outcomes. Note that the application of both channel average and max pooling avails the significant reduction of the amount of network parameters compared to fully connection layers. After training models for the same number of epochs, the accuracy both on diagnosis assessment and severity assessment of FaNet was considerably better than that of the other methods on the test datasets (shown in Table 6).

Table 4 Several methods based on FaNet and their descriptions
Table 5 Parameter counts for different methods
Table 6 Accuracy on diagnosis and severity assessments for different methods

4.4 Ablation studies

We conducted an ablation study to compare model performances under different number of symptom-fused channel attention module and validate the effectiveness of introducing symptom information.

4.4.1 The number of symptom-fused channel attention module

The ablation studies are conducted to compare the performance under different number (3, 4, 5, 6, 7) of SCAM. As is shown in Fig. 5, we find the accuracy on both diagnosis assessment and severity assessment is best under 5 SCAMs when same training epochs are adopted. Besides, the networks parameter counts increases when applying more SCAMs. While, the more parameter counts may lead to worse fitting results under limit datasets.

Fig. 5
figure 5

Accuracy for different parameter selection on the number of symptom-fused channel attention module (SCAM)

4.4.2 The effectiveness of clinical symptom prior information

We design two version methods, including FaNet-Res and FaNet-Rca, which are not fused with symptom prior information under same training epochs. As is shown in Fig. 6, FaNet achieves obviously better performance than other two methods. The simple channel attention may not leads to improve the accuracy. While, the self-attention introduced the symptom could avail the performance, which proves the effectiveness of clinical symptom prior information. Besides, we directly input random symptom information for trained FaNet and it leads worse performance (shown in Table 7), which also proves the symptom plays an important role in improving performance.

Fig. 6
figure 6

Accuracy comparison among FaNet-Res, FaNet-Rca and FaNet

Table 7 Accuracy comparison between FaNet-Random and FaNet

4.4.3 The effectiveness of channel pooling in prediction module

The prediction module in FaNet are equipped with both channel average and max pooling operations to shrink the feature maps so as to help following two prediction tasks. Based on this original version, we design another two versions, including FaNet-WA and WM, which only adopt only channel average pooling or channel max pooling operation. To validate the effectiveness of channel pooling, we train these three models under same epochs. As is shown in Fig. 7, FaNet gains better performance than other models, which demonstrates that the both employing channel average and max pooling operations in our proposed network could achieve better performance rather than applying only one channel pooling operation.

Fig. 7
figure 7

Accuracy comparison among FaNet-WA, FaNet-WM and FaNet

5 Limitations and future works

We first discuss the setting regarding the length of the CT image sequences. To cover the lung site, we set the length of CT sequence k as 160. A longer length would require more time to process the CT image sequence; thus, a tradeoff exists between accuracy and running time. Second, motivated by the need for fast clinical assessments of COVID-19, we introduced the clinical symptoms from patients as prior information based on CT image sequences because that symptom information is quickly and readily accessible. However, we note that other prior information could also be considered that might achieve better performance, such as the laboratory results from real-time RT-PCR and sequencing of nucleic acids from the virus [54]. In future works, this problem could be explored by introducing other network technologies such as recurrent neural networks [55] to extract more latent spatial information. What’s more, the total amount of data is somewhat limited since the experiment in this work involved the data of only 416 patients with or without COVID-19. In future work, we will collect more patient data and seek to further improve these results. Due to the addition of new data, we would consider more strategies to avoid potential overfitting, such as cross validation or introducing the weight regularization.

6 Conclusion

In conclusion, this paper proposed a fast assessment network for both diagnosis and severity assessment of COVID-19. Based on previous findings for patients with COVID-19, CT image scans can form an effective solution for rapid clinical assessments. On the other hand, patient clinical symptoms also show a latent relationship with the final assessment. Thus, we explored a fast severity assessment network that considers both 3D CT image sequences and clinical symptoms. The CT image sequences are much simpler for radiologists to use because they do not need to select a specific individual CT image and the data of the symptoms can be easily accessed. Ablations studies validate the effectiveness of introducing symptom prior information and network designing. The experimental results illustrate that FaNet achieves fast clinical assessment for COVID-19 with an accuracy of 98.28% on diagnosis assessment and 94.83% on severity assessment. In future work, we will seek to further improve current results.