Preprocessing
Let us set the original CCT image set to be \({S}_{1}\), which is composed of n CCT images as
$$ S_{1} = \left\{ {s_{1} \left( 1 \right),s_{1} \left( 2 \right), \ldots ,s_{1} \left( i \right), \ldots s_{1} \left( n \right)} \right\}. $$
(1)
First, we compress the three-channel color image to gray image, and get the grayscale image set \({S}_{2}\) as
$$ \begin{aligned} S_{2} & = G\left( {S_{1} |{\text{RGB}} \to {\text{Grayscale}}} \right) \\ & = \left\{ {s_{2} \left( 1 \right),s_{2} \left( 2 \right), \ldots ,s_{2} \left( i \right), \ldots ,s_{2} \left( n \right)} \right\}. \\ \end{aligned} $$
(2)
Second, the histogram stretching (HS) method was firstly employed to increase the image’s contrast. For i-th image \({s}_{2}(i)\), the new histogram stretched image \({s}_{3}(i)\) was obtained as
$$ s_{3} \left( {i|\alpha ,\beta } \right) = \frac{{s_{2} \left( {i|\alpha ,\beta } \right) - \varepsilon_{2}^{\min } \left( i \right)}}{{\varepsilon_{2}^{\max } \left( i \right) - \varepsilon_{2}^{\min } \left( i \right)}}, $$
(3)
where \(1 \le \alpha \le 1024,1 \le \beta \le 1024\). Here, \(\left( {\alpha , \beta } \right)\) means coordinates of pixel of the image \(s_{2} \left( i \right)\), and \(\varepsilon_{2}^{\min } \left( i \right)\) means the minimum value of CCT image \(s_{2} \left( i \right)\). \(\varepsilon_{2}^{\max } \left( i \right)\) means the maximum value of image \(s_{2} \left( i \right)\).
$$ \varepsilon_{2}^{\min } \left( i \right) = \mathop {\min }\limits_{{\left( {\alpha , \beta } \right)}} \left[ {s_{2} (i|\alpha ,\beta )} \right] $$
(4a)
$$ \varepsilon_{2}^{\max } \left( i \right) = \mathop {\max }\limits_{{\left( {\alpha , \beta } \right)}} \left[ {s_{2} (i|\alpha ,\beta )} \right]. $$
(4b)
In all, we get the histogram stretched dataset \(S_{3}\) as
$$ \begin{aligned} S_{3} & = {\text{HS}}\left( {S_{2} } \right) \\ & = \left\{ {s_{3} \left( 1 \right),s_{3} \left( 2 \right), \ldots ,s_{3} \left( i \right), \ldots s_{3} \left( n \right)} \right\}. \\ \end{aligned} $$
(5)
Third, we crop the images to remove the texts at the margin area, and the checkup bed at the bottom area. Thus, we get the cropped dataset \(S_{4}\) as
$$ \begin{aligned} S_{4} & = C\left( {S_{3} ,\left[ {{\text{top}},{\text{bottom}},{\text{left}},{\text{right}}} \right]} \right) \\ & = \left\{ {s_{4} \left( 1 \right),s_{4} \left( 2 \right), \ldots ,s_{4} \left( i \right), \ldots s_{4} \left( n \right)} \right\}, \\ \end{aligned} $$
(6)
where C represents crop operation, and the parameter vector \([\mathrm{top},\mathrm{bottom},\mathrm{left},\mathrm{right}]\) means to the range to be removed from top, bottom, left, and right directions. In our study, we set \(\mathrm{top}=\mathrm{bottom}=\mathrm{left}=\mathrm{right}=150\).
Fourth, we downsampled the image \(s_{4} \left( i \right)\) to size of \(\left[ {\varpi ,\varpi } \right]\), and we now get the resized image set \(S_{5}\) as
where
means downsampling operation. \(\varpi = 128\) in this study. Figure 1 shows the above four preprocessing steps.
Table 3 compares the size and storage per image at every preprocessing step. We can see here after the five-step preprocessing procedure, each image will only cost about 0.52% of its original storage. The byte compression ratio (BCR) was calculated as: \({\text{BCR}} = {\text{byte}}\left( {s_{5} } \right) \div {\text{byte}}\left( {s_{1} } \right) = 65,536 \div 12,582,912 = 0.52\%\).
Table 3 Image size and storage per image at each preprocessing step Figure 2 shows two samples of our collected and preprocessed dataset \({S}_{5}\), from which we can clearly observe the clinical biomarkers of COVID-19. Cui et al. [19] reported the preliminary CT findings of COVID-19 in their publication. Tuncer et al. [20] used chest CT images, and then developed a local binary pattern and iterative ReliefF algorithm. There are more open publications that show it is feasible to develop effective AI systems based on CCT images.
Basics of DCNN
Deep convolutional neural network (DCNN) is a king of new artificial neural network. Its main feature is to use multiple layers to build a deep neural network. Generally, DCNN is composed of conv layers (CLs), pooling layers (PLs), and fully connected layers (FCLs) [21,22,23,24,25]. Figure 3 presents a simplistic instance consisting of 2 CLs, 2 PLs, and 2 FCLs. On the right part of Fig. 3, The blue rectangle means FCL block, and red rectangle means the softmax function. DCNNs could reach better performances than old-dated AI methods, because they learn the feature from the data during the training procedure. There is no need to consume much time in feature engineering.
The essential operation in DCNN is convolution. The CL performed 2D convolution along the width and height directions. Note that the weights in CNN are initialized with random, and then learnt from data itself by network training. Figure 4 illustrates the pipeline of input feature maps passing across a CL. Assume there is an input matrix, \(J\) kernels (\(K_{1} ,K_{2} , \ldots ,K_{j} , \ldots ,K_{J}\)), and an output O, with theirs sizes \({\mathbb{S}}\) defined as
$$ {\mathbb{S}}\left( x \right) = \left\{ {\begin{array}{*{20}c} {W_{I} \times X_{I} \times C_{I} } & {x = I} \\ {W_{K} \times X_{K} \times C_{K} } & {x = K_{j} \left( {j = 1, \ldots ,J} \right)} \\ {W_{O} \times X_{O} \times C_{O} } & {x = O} \\ \end{array}, } \right. $$
(8)
where \(\left( {W, X, C} \right)\) represent the size of height, width, and channels of the matrix, respectively. Subscript I, K, and O represent input, kernel, and output, respectively. J denotes total number of filters. Note that
which means the channel of input \(C_{I}\) should equal the channel of kernel \(C_{K}\), and the channel of output \(C_{O}\) should equal the number of filters J.
Assume those filters move with padding of B and stride of A, we can get their relationship by simple math as:
$$ W_{O} = 1 + \frac{{\left( {2 \times B + W_{I} - W_{K} } \right)}}{A} $$
(10a)
$$ X_{O} = 1 + \frac{{\left( {2 \times B + X_{I} - X_{K} } \right)}}{A}, $$
(10b)
where \( {\lfloor }.{ \rfloor }\) represents the floor function. Afterward, CL’s outputs are hurled into a nonlinear activation function (NLAF) \(\sigma\), that usually selects the rectified linear unit (ReLU) function.
$$ \begin{aligned} \sigma_{{{\text{ReLU}}}} \left( x \right) & = {\text{ReLU}}\left( x \right) \\ & = \max \left( {0,x} \right). \\ \end{aligned} $$
(11)
ReLU is preferred to traditional NLAFs such as hyperbolic tangent (HT) and sigmoid (SM) function
$$ \begin{aligned} \sigma_{{{\text{HT}}}} \left( x \right) & = \tanh \left( x \right) \\ & = \frac{{\left( {e^{x} - e^{ - x} } \right)}}{{\left( {e^{x} + e^{ - x} } \right)}} \\ \end{aligned} $$
(12)
$$ \sigma_{{{\text{SM}}}} \left( x \right) = \left( {1 + e^{ - x} } \right)^{ - 1}. $$
(13)
Improvement 1: Use SP to replace MP and AP
The activation maps (AMs) after each block within DCNN are usually too large, i.e., the size of their width, length, and channels are too large to handle, which will cause (i) overfitting of the training and (ii) large computational costs.
Pooling layer (PL) is a form of nonlinear downsampling (NLDS) method to solve above issue. Further, PL can provide invariance-to-translation property to the AMs. For a \(2 \times 2\) region, suppose the pixels within the region \(\overline{\varphi }\) are
$$ \overline{\varphi } = \left[ {\begin{array}{*{20}c} {\varphi_{1,1} } & {\varphi_{1,2} } \\ {\varphi_{2,1} } & {\varphi_{2,2} } \\ \end{array} } \right]. $$
(14)
The average pooling (AP) calculates the mean value in the region \(\overline{\varphi }\). Assume the output value after NLDS is z, we can have
$$ \begin{aligned} z_{{\overline{\varphi }}}^{AP} & = {\text{average}}\left( {\overline{\varphi }} \right) \\ & = \frac{{\varphi_{1,1} + \varphi_{1,2} + \varphi_{2,1} + \varphi_{2,2} }}{{\left| {\overline{\varphi }} \right|}}, \\ \end{aligned} $$
(15)
where \(\left| {\overline{\varphi }} \right|\) means the number of elements of region \(\overline{\varphi }\). Here, \(\left| {\overline{\varphi }} \right| = 4\) if we use a \(2 \times 2{ }\) NLDS pooling. Using Fig. 5 as an example, and assuming the region \(\dot{\varphi }\) at 2nd row 1st column of the input AM, I is chosen, i.e., \(\dot{\varphi } = I\left( {row = 2,col = 1} \right)\); thus, we have \(\begin{array}{*{20}c} {z_{{\dot{\varphi }}}^{AP} = {\text{average}}\left( {\dot{\varphi }} \right) = \left( {4 + 4 + 3 + 9} \right) \div 4 = 20 \div 4 = 5} \\ \end{array}\).
The max pooling (MP) operates on the region \({\overline{{\varphi }}}\) and selects the max value. Note that both AP and MP work on every slice separately.
$$ \begin{aligned} z_{{\overline{\varphi }}}^{MP} & = \max \left( {\overline{\varphi }} \right) \\ & = \max \nolimits_{i,j = 1}^{2} \varphi_{i,j}. \\ \end{aligned} $$
(16)
In Fig. 5, \(\begin{array}{*{20}c} {z_{{\dot{\varphi }}}^{MP} = \max \left( {\dot{\varphi }} \right) = \max \left( {4 + 4 + 3 + 9} \right) = 9} \\ \end{array}\).
In practice, scholars observed that the AP did not work well, because all pixels in the region \(\overline{\varphi }\) are within the arguments of the NLDS function; hence, it could down-weight (DW) intense activation owing to numerous near-zero pixels. For example, in our region \(\dot{\varphi }\), the strongest value \(9\mathop{\to}\limits^{{\rm DM}}5\). On the other hand, MP deciphers above DW problem; however, it simply overfits the training set and causes the lack-of-generalization (LoG) problem.
The stochastic pooling (SP) was introduced to conquer the DW, overfitting, and LoG problems caused by MP and AP. Instead of computing the average or the max, the output of the SP \(z^{{{\text{SP}}}}\) is calculated via sampling from a multinomial distribution generated from the activations of each region \(\overline{\varphi }\). Three steps of SP are depicted below:
-
(1)
Estimate the probability \(\theta_{i,j} \in {\Theta }\) of each entry \(\left\{ {\varphi_{i,j} ,i,j = 1,2} \right\}\) within the region \(\overline{\varphi }\).
$$ \theta_{i,j} = \frac{{\varphi_{i,j} }}{{{\text{sum}}\left( {\overline{\varphi }} \right)}},\quad i,j = 1,2 $$
(17a)
$$ \mathop \sum \limits_{i,j = 1}^{2} \theta_{i,j} = 1 $$
(17b)
in which, (i, j) is the element index of region
\(\overline{\varphi }\). In matrix format, equation (17a) can be rewritten as
$$ \Theta = \overline{\varphi }/\sum \left( {\overline{\varphi }} \right). $$
(18)
-
(2)
Select a location \(\beta\) within \(\overline{\varphi }\) in accordance with the probability \({ }\left\{ {\theta_{i,j} } \right\}\).
$$ \beta \sim {\text{Prob}}\left( {\theta_{1,1} ,\theta_{1,2} , \theta_{2,1} , \theta_{2,2} } \right). $$
(19)
-
(3)
The output is the value at location \(\beta\).
$$ z_{{\overline{\varphi }}}^{{{\text{SP}}}} = \varphi_{\beta } $$
(20)
Use the region \(\dot{\varphi }\) in Fig. 5 as example, SP first calculates the probability map (PM),
$$ \begin{array}{*{20}c} {\Theta \left( {\dot{\varphi }} \right) = \left[ {\begin{array}{*{20}c} 4 & 4 \\ 3 & 9 \\ \end{array} } \right]/\sum \left( {\left[ {\begin{array}{*{20}c} 4 & 4 \\ 3 & 9 \\ \end{array} } \right]} \right)} \\ { = \left[ {\begin{array}{*{20}c} {0.2} & {0.2} \\ {0.15} & {0.45} \\ \end{array} } \right]} \\ \end{array} $$
(21a)
$$ \beta \left( {\dot{\varphi }} \right) = \left( {2,2} \right). $$
(21b)
Using the probability map, we randomly select the position \(\beta = \left( {2,2} \right)\) associates with probability of \(\theta_{2,2} \left( {\dot{\varphi }} \right) = 0.45\). Thus, the SP output of \(\dot{\varphi }\) is \(z_{{\dot{\varphi }}}^{{\rm SP}} = \dot{\varphi }_{\beta } = \dot{\varphi }_{{\left( {2,2} \right)}} = 9\). In all, SP uses non-maximal activations from the region \(\overline{\varphi }\), instead of outputting the greatest value.
Improvement 2: batch normalization transform
The motivation of batch normalization transform (BNT) is the so-called internal covariant shift (ICS), which means the effect of randomness of the distribution of inputs to internal DCNN layers during training. The phenomenon of ICS will worsen the DCNN’s performance.
This study introduced BNT to normalize those internal layer’s inputs \(A = \left\{ {a_{i} } \right\}\) over every mini-batch (suppose its size is m), in order to guarantee the batch normalized output \(B = \left\{ {b_{i} } \right\}\) have a uniform distribution. Mathematically, BNT is to learn a function from
$$ \underbrace {{\left\{ {a_{i} ,i = 1,2, \ldots ,m} \right\}}}_{A} \mapsto \underbrace {{\{ b_{i} ,i = 1,2, \ldots ,m\} }}_{B}. $$
(22)
The empirical mean \(\mu\) and empirical variance \(\sigma^{2}\) over training set \({\text{A}}\) can be calculated as
$$ \mu_{A} = \frac{1}{m}\left( {\mathop \sum \limits_{i = 1}^{m} a_{i} } \right) $$
(23)
$$ \sigma_{A}^{2} = \frac{1}{m}\mathop \sum \limits_{i = 1}^{m} \left( {a_{i} - \mu_{A} } \right)^{2}. $$
(24)
The input \(a_{i} \in A\) was first normalized to 
where \(\Delta\) in denominator in Eq. (25) is to enhance the numerical stability. The value of \(\Delta\) is a small constant. \(\Delta = 10^{ - 5}\) in this study. Now
has zero-mean and unit-variance characteristics. In order to have a more expressive deep neural network [26], a transformation is usually carried out as
where the parameters \({\mathfrak{C}}\) and \({\mathfrak{D}}\) are two learnable parameters during training. The transformed output \(b_{i} \in B\) is then passed to the next layer and the normalized
remains internal to current layer.
In the inference stage, we do not have mini-batch anymore. So instead of calculating empirical mean and empirical variance, we will calculate population mean \(\underline{\mu }\) and population variance \(\underline{{\sigma^{2} }}\), and we have the output \(\underline{{b_{i} }}\) at inference stage as
$$ \underline{{b_{i} }} = {\mathfrak{C}} \times \left( {\frac{{a_{i} - \underline{\mu } }}{{sqrt\left( {\underline{{\sigma^{2} }} + \Delta } \right) }}} \right) + {\mathfrak{D}}. $$
(27)
We proposed to use convolution block (CB) to be one of the building blocks of our DCNN. The CB consists of one conv layer and one batch normalization layer.
Improvement 3: fully connected block
In traditional DCNN, the fully connected layer (FCL) serves the role of classifier. We plan to replace FCL with fully connected block (FCB), which will include one dropout layer (DOL) and one FCL layer. Srivastava et al. [27] proposed the concept of dropout neurons (DON) and DOL by randomly drop neurons and set to zero their neighboring weights \(s\) from the DCNN during training.
The neuron’s incoming and outgoing connections are freezing, after it is dropped out. Figure 6 illustrates the illustration of neurons in DOL. The selections of dropout are random with a retention probability (\(\theta_{rp}\)).
$$ \mathop {\tilde{s}}\limits_{{{\text{training}}}} = \left\{ {\begin{array}{*{20}c} s & {{\text{with}}\, \theta_{rp} } \\ 0 & {{\text{otherwise}}} \\ \end{array}. } \right. $$
(28)
where \(\theta_{rp} = 0.5\), and \(\tilde{s} \) means the weights of dropped out neurons.
During inference, we run the entire DCNN without dropout, but the weights of FCLs of FCBs are downscaled (viz., multiplied) by \(\theta_{rp}\).
$$ \mathop {\tilde{s}}\limits_{{{\text{inference}}}} = \theta_{rp} \times s. $$
(29)
Figure 7 shows a toy DCNN example with four FCL layers. Suppose we have \(N\left( k \right)\) neurons at k-th layer, and assume \(N\left( 1 \right) = 12\), \(N\left( 2 \right) = 10\), \(N\left( 3 \right) = 8\), \(N\left( 4 \right) = 4\). Thus, we have in total \(\mathop \sum \nolimits_{k = 1}^{4} N\left( k \right) = 34\) nodes. Suppose we do not consider incoming and outgoing weights, and do not consider the number of biases, the size of learnable weights \(S^{b} \left( {i,j} \right)\) as number of weights between layer \(i\) and layer \(j\) before dropout, roughly calculating, can be written as \(S^{b} \left( {1,2} \right) = 12 \times 10 = 120\), \(S^{b} \left( {2,3} \right) = 10 \times 8 = 80\), \(S^{b} \left( {3,4} \right) = 8 \times 4 = 32\). In total, we have the total number of learnable weights before dropout as \(S^{b} = \mathop \sum \nolimits_{k = 1}^{3} S^{b} \left( {k,k + 1} \right) = 232\). Using \(\theta_{rp} = 0.5\), the size of learnable weights after dropout between layer \(i\) and layer \(j\) is symbolized as \({S}^{a}(i,j)\), and we can calculate the total number of learnable weights as \( S^{a} = \mathop\sum\nolimits_{k = 1}^{3} S^{a} \left( {k,k + 1} \right) = S^{a} \left( {1,2} \right) + S^{a} \left( {2,3} \right) + S^{a} \left( {3,4} \right) = 30 + 20 + 8 = 58 \).
The compression ratio of learnable weights (CRLW), roughly, can be calculated by \(58/232 = 0.25\), which is the squared value of retention probability \(\theta_{rp}\).
$$ {\text{CRLW}} = \frac{{S^{a} }}{{S^{b} }} = \theta_{rp}^{2}, $$
(30)
where \(S^{a}\) and \(S^{b}\) means the number of learnable weights after and before dropout, respectively.
Proposed DCNN and its Implementation
We create a new five-layer DCNN with stochastic pooling for COVID-19 detection (5L-DCNN-SP-C) with three CBs and two FCBs. The structure of proposed 5L-DCNN-SP-C is shown in Fig. 8, where SP is added after each activation map. The reason why set three CBs and two FCBs are by manual trial-and-error method. In the experiment, we will compare this setting (3 CBs + 2 FCBs) against other setting.
The hyperparameters of each layer/block of proposed 5L-DCNN-SP-C are listed in Table 4, where \(\left( {\alpha \beta \times \beta /\gamma } \right)\) means \(\alpha\) filters with size of \(\beta \times \beta\), followed by pooling layer with pooling size of \(\gamma\). Meanwhile, W and B represent the size of weight matrix and bias vector, respectively. The last column in Table 4 shows the activation map (AM).
Table 4 Details of each layer in proposed 5L-DCNN-SP-C Ten runs of tenfold cross-validation were employed. Suppose confusion matrix \({\mathbb{C}}\) is defined as
$$ {\mathbb{C}}\left( {k,r} \right) = \left[ {\begin{array}{*{20}c} {c_{11} } & {c_{12} } \\ {c_{21} } & {c_{22} } \\ \end{array} } \right], $$
(31)
where \(\left( {c_{11} ,c_{12} ,c_{21} ,c_{22} } \right)\) represent TP, FN, FP, and TN, respectively. k is the index of trial (in each trial, onefold was used as test, and all the other folds were used as training), and r is the index of run.
Note that \({\mathbb{C}}\) will be calculated based on each test fold, and summarized across all 10 trials. Then, we get the
$$ {\mathbb{C}}\left( r \right) = \mathop \sum \limits_{k = 1}^{10} {\mathbb{C}}\left( {k,r} \right). $$
(32)
Now we can calculate six indicators \(\vec{\eta }\left( r \right)\) based on the confusion matrix over r-th run \({\mathbb{C}}\left( r \right)\).
$$ {\mathbb{C}}\left( r \right) \mapsto \underbrace {{\left[ {\eta_{1} \left( r \right),\eta_{2} \left( r \right), \ldots ,\eta_{6} \left( r \right)} \right]}}_{{\vec{\eta }\left( r \right)}}, $$
(33)
where \(\eta_{1}\) is sensitivity, \(\eta_{2}\) is specificity, \(\eta_{3}\) is precision, and \(\eta_{4}\) is accuracy. Ignoring variable r, we have:
$$ \eta_{1} = \frac{{c_{11} }}{{c_{11} + c_{12} }} $$
(34a)
$$ \eta_{2} = \frac{{c_{22} }}{{c_{22} + c_{21} }} $$
(34b)
$$ \eta_{3} = \frac{{c_{11} }}{{c_{11} + c_{21} }} $$
(34c)
$$ \eta_{4} = \frac{{c_{11} + c_{22} }}{{c_{11} + c_{12} + c_{21} + c_{22} }} $$
(34d)
\(\eta_{5}\) is F1 score.
$$ \eta_{5} = 2 \times \frac{{\eta_{3} \times \eta_{1} }}{{\eta_{3} + \eta_{1} }} = \frac{{2 \times c_{11} }}{{2 \times c_{11} + c_{12} + c_{21} }} $$
(35)
and \(\eta_{6}\) is Matthews correlation coefficient (MCC)
$$ \eta_{6} = \frac{{c_{11} \times c_{22} - c_{21} \times c_{12} }}{{\sqrt {\left( {c_{11} + c_{21} } \right) \times \left( {c_{11} + c_{12} } \right) \times \left( {c_{22} + c_{21} } \right) \times \left( {c_{22} + c_{12} } \right)} }}. $$
(36)
The mean and standard deviation (SD) of all six measures \(\vec{\eta }\) will be calculated over all ten runs.
$$ {\text{mean}} \left( {\eta_{m} } \right) = \frac{1}{10} \times \mathop \sum \limits_{r = 1}^{10} \eta_{m} \left( r \right) $$
(37a)
$$ {\text{SD}}\left( {\eta_{m} } \right) = \sqrt {\frac{1}{9} \times \mathop \sum \limits_{r = 1}^{10} \left| {\eta_{m} \left( r \right) - {\text{mean}}\left( {\eta_{m} } \right)} \right|^{2} }, $$
(37b)
where \(1 \le m \le 6\) represents the index of measures.