Improving ductal carcinoma in situ classification by convolutional neural network with exponential linear unit and rank-based weighted pooling

Ductal carcinoma in situ (DCIS) is a pre-cancerous lesion in the ducts of the breast, and early diagnosis is crucial for optimal therapeutic intervention. Thermography imaging is a non-invasive imaging tool that can be utilized for detection of DCIS and although it has high accuracy (~ 88%), it is sensitivity can still be improved. Hence, we aimed to develop an automated artificial intelligence-based system for improved detection of DCIS in thermographs. This study proposed a novel artificial intelligence based system based on convolutional neural network (CNN) termed CNN-BDER on a multisource dataset containing 240 DCIS images and 240 healthy breast images. Based on CNN, batch normalization, dropout, exponential linear unit and rank-based weighted pooling were integrated, along with L-way data augmentation. Ten runs of tenfold cross validation were chosen to report the unbiased performances. Our proposed method achieved a sensitivity of 94.08 ± 1.22%, a specificity of 93.58 ± 1.49 and an accuracy of 93.83 ± 0.96. The proposed method gives superior performance than eight state-of-the-art approaches and manual diagnosis. The trained model could serve as a visual question answering system and improve diagnostic accuracy.


Introduction
Ductal carcinoma in situ (DCIS), also named intra-ductal carcinoma is a pre-cancerous lesion of cells that line the breast milk ducts, but have not spread into the surrounding breast tissue. DCIS is considered the earliest stage of breast cancer (Stage 0) [1], and although cure rates are high the patients still need to be treated, since DCIS can become invasive. Note that there are four other stages: Stage 1 describes invasive breast cancer, the cancer cells of which are invading normal surrounding breast tissues. Stages 2 and 3 describe breast cancers that have invaded regional lymph nodes and Stage 4 represents metastatic cancer which spreads beyond the breast and regional lymph nodes to other distant organs [2]. Upon diagnosis of DCIS, treatment options include breast-conserving surgery (BCS), usually in combination with radiation therapy [3] or mastectomy.
Breast thermography (BT) is an alternative imaging tool to mammography, which is the traditional diagnostic tool for DCIS. Unlike mammography (which uses ionizing radiation to generate an image of the breast), BT utilizes infra-red (IR) images of skin temperature to assist in the diagnosis of numerous medical conditions, and has been suggested to detect breast cancer up to 10 years earlier than mammography [4]. Furthermore, due to its use of ionizing radiation, mammography can increase the risk of breast cancer by 2% with each scan [5].
Automatic interpretation of DCIS [6] by BT images consists of three phases: (1) segmentation of the region of interest, separating the breast from the image; (2) feature extraction, choosing distinguishing features that can help recognize the suspicious lesion; (3) classification, identifying the image as DCIS or healthy.
Previous studies have developed a number of effective artificial intelligence (AI) methods for DCIS detection using BT. Milosevic et al. [7] utilized 50 IR breast images to develop a co-occurrence matrix (COM) and run length matrix (RLM) as IR image descriptors. In the classification stage, a support vector machine (SVM) and naive Bayesian classifier (NBC) were used. Their methods are abbreviated as CRSVM and CRNBC. In addition, Nicandro et al. [8] employed NBC, whereas Chen [9] utilized wavelet energy entropy (WEE) as features to classify breast cancers with promising results. Zadeh et al. [10] combined self-organizing map and multilayer perceptron abbreviated as SMMP and Nguyen [11] introduced Hu moment invariant (HMI) to detect abnormal breasts. Finally, Muhammad [12] combined statistical measure and fractal dimension (SMFD), and Guo [13] proposed a wavelet energy support vector machine (WESVM) to detect breast cancer.
Nevertheless, the above methods require laborious feature engineering (FE), i.e., using domain knowledge to extract features from raw data. To help create an improved, automated AI model quickly and effectively, we proposed to use recent deep learning (DL) technologies, viz, convolutional neural networks (CNNs), which are a broad AI technique combining artificial intelligence and representation learning (RL).
Our contributions lie in four parts: (1) we proposed a novel 5-layer CNN; (2) we introduced exponential linear unit to replace traditional rectified linear unit; (3) we introduced rank-based weighted pooling to replace traditional pooling methods and (4) we used data augmentation to enhance the training set, so as to improve the test performance. Table 14 in "Appendix A" gives the abbreviations and their explanations for ease of reading.

Physical fundamentals
BT is a sub-science field within IR imaging sciences. IR cameras detect radiation in the long IR range (9-14 µm), with the thermal images generated being dubbed thermograms. Physically, Planck's law stated the spectral of a body for frequency ω at absolute temperature T is given as where B stands for the spectral radiance, o the Planck constant, k B the Boltzmann constant, and l s the light speed,. If replacing frequency ω by wavelength λ using l s λω, above equation can be written as: Both charge-coupled device (CCD) and complementary metal-oxide-semiconductor (CMOS) sensors in optical cameras detect visible light, and even near-infra-red (NIR) by utilizing parts of the IR spectrum. Basically, they could produce true thermograms with temperatures beyond 280°C.
In our breast thermogram cases, the thermal imaging cameras have a range of 15-45°C, and a sensitivity around 0.05°C. Furthermore, three emitted components (ECs) help generate the following breast thermogram images: (1) EC of the breast, (2) EC of the surrounding medium, and (3) EC in the neighboring tissue.

Physiological fundamentals
In healthy tissue, the major regulation and control of dermal circulation is neurovascular, i.e. through the sympathetic nervous system. Its sympathetic response includes both adrenergic and cholinergic. The former causes vasoconstriction (VC, narrowing of blood vessels); conversely, the latter leads to vasodilation (VD, widening of blood vessels). The difference between VC and VD is presented in Fig. 1.
In the early stages of cancer growth, cancer cells produce nitric oxide (NO), resulting in VD. Tumor cells then initiate angiogenesis, which is necessary to sustain breast tumor growth. Both VD and angiogenesis lead to increased blood flow; therefore, the increased heat released as a result of increased blood flow to the tumor results in hotter areas than healthy skin.
The thermogram of a healthy person is symmetrical across the midline. Asymmetry in the thermogram might signify an abnormality, or even a tumor. Therefore, the thermogram illustrates the status of the breast and presence of breast diseases by identifying asymmetric temperature distribution.
Despite this, previous studies [7,8,14] have not measured asymmetry directly. As an alternative, those papers employed texture or statistical measures. As a result, this study did not use asymmetry information, and treated each side image (left breast or right breast) as individual images.
Since our dataset is multi-source, we normalized all the collected images using preprocessing techniques. These included: (I) crop: remove background contents and only preserve the breast tissue and (II) resize: all images were re-sampled to the size of [128 × 128 × 3]. Suppose original image is x 1 (t), t ∈ [1, 480]. After Step I, we have where (l t , r t , t t , b t ) are four parameters denotes left, right, top, and bottom margins of t-th image to be cropped. Finally, after Step II, we have all the images x 3 (t) ∈ X 3 Note some BT images used different pseudo colormaps (PCMs). For example, some used yellow to denote high temperature while some used red; conversely, some used blue to denote low temperature while some used green. We did not apply the same PCM to all BT images within our dataset for four reasons: (1) we expected our AI model would learn to determine a diagnosis based on color difference, not the color itself; (2) humans can make a diagnosis regardless of the PCM configuration, so we believed AI can do the same; (3) we expected our AI model can be universal, i.e., PCMindependent and (4) mixing of PCM color schemes in the training set can help make our AI model more robust when analyzing the test set, i.e., it does not require a particular PCM scheme. Figure 2 shows a DCIS case, where we can clearly see the temperature difference of the lesion and the surrounding healthy tissues. All the images included in our dataset were checked by agreement of two professional radiologists (R 1 , R 2 ) with more than 10 years of experience. If their decisions [H (R 1 ), H (R 2 )] agreed, then the images were labelled correspondingly, otherwise, a senior radiologist (R 3 ) was consulted to achieve a consensus: Here H is the labelling result, M denotes the majority voting, H [x 3 (t), (R 1 , R 2 , R 3 )] denotes the labelling results by all three radiologists.

Improvement 1: exponential linear unit
The activation function mimics the influence of an extracellular field on a brain axon/neuron. The real activation function for an axon is quite complicated, and can be written as where n means the index of axon's compartment model, c the membrane capacity, R n the axonal resistance of compartment n, V e n the extra-cellular voltage outside compartment n relative to the ground [18]. This is difficult to determine in an "artificial neural network", and thus AI scientists designed some simplistic and ideal activation functions (AFs), which have no direct connection with the axon's activating function, but those AFs work well for ANNs [19].
An important property of AF is nonlinearity. The reason is stacks of linear function will also be linear, and those kinds of linear AFs can only solve trivial problems and cannot make decisions. Only nonlinear AF can allow neural networks to solve non-trivial problems, such as decision-making. Similar ideas were mentioned as "even our mind is governed by the nonlinear dynamics of complex systems" by Mainzer [20].
Suppose the input is t, traditional rectified linear unit (ReLU) [21] f ReLU is defined as with its derivative as When t < 0, the activation of f ReLU values are set to zero, so ReLU cannot train the networks via gradient-based learning. Clevert et al. [22] proposed the exponential linear unit (ELU) ELU's derivative is The default value of γ 1. Figure 3 represents the shapes of five different but common AFs. Each subplot has the same range on the x-axis and y-axis for easy comparison. Information regarding the three AFs (Sigmoid, HT, and LReLU) can be found in "Appendix B".

Improvement 2: rank-based weighted pooling
The activation maps (AMs) after conv layer are usually too large, i.e., the size of their width, length, and channels are too large to handle, which will cause (1) overfitting of the training set and (2) large computational costs. Instead pooling layer (PL) is a form of nonlinear downsampling (NLDS) used to solve the above issue. Further, PL can provide invariance-totranslation properties to the AMs.
For a 2 × 2 region, suppose the pixels within the region ϕ i j , (i 1, 2, j 1, 2) are Strided convolution (SC) can be regarded as a convolution followed by a special pooling. If the stride is set to 2, the output of SC is: The shortcoming of SC is that it will miss stronger activations if ϕ 1,1 is not the strongest activation. The advantage of SC is the convolution layer only needs to calculate 1/4 of all outputs in this case, so it can save computation.
L2P calculates the l 2 norm [23] of a given region . Assume the output value after NLDS is y, L2P output y L2P is defined as y L2P sqrt In this study, we add a constant 1/| |, where | | means the number of elements of region . Here | | 4 if we use a 2 × 2 NLDS pooling. This added new constant 1/4 does not influence training and inference.
The average pooling (AP) calculates the mean value in the region as The max pooling (MP) operates on the region and selects the max value. Note that L2P, AP and MP work on every slice separately.
Rank-based weighted pooling (RWP) was introduced to overcome the down-weight (DW), overfitting, and lack of generation (LG) caused by the above pooling methods (L2P, AP, and MP). Instead of computing the l 2 norm, average, or the max, the output of the RWP y RWP is calculated based on the rank matrix. First, rank matrix (RM) R {r m } is calculated based on the values of each element ϕ m ∈ , usually lower ranks r ∈ R are assigned to higher values (ϕ) as In case of tied values (ϕ m1 ϕ m2 ), a constraint is added as Second, (ER) map E {e m } is defined as where α is a hyper-parameter. α 0.5 for all RWP layers, so we do not need to tune α in this study. Equation (18) can be updated as Third, RWP [24] is defined as the summation of ϕ i j and e i j as below Figure 7 in "Appendix C" gives a schematic comparison of L2P, AP, MP, and RWP. Step 2 for = 1: % is row index For better understanding, a pseudocode of RWP is presented in Table 1. We suppose there is an activation map X AM with size of [R, C], where R means the number of rows, and C means the number of columns. Note row index is set to r and column index c. The RWP output of X AM is symbolized as y RWP with size of R 2 , C 2 . Table 2 itemizes the equations of every pooling methods.

Improvement 3: L-way data augmentation
Traditional data augmentation is a strategy that enables AI practitioners to radically increase the diversity of training data, without collecting new data actually. In this study, we proposed a L-way data augmentation (LDA) technology to further increase the diversity of the training data. The whole preprocessed image set X 3 , from Eq. (4), will separate into K folds: where k represents the fold index. At k-th trial, fold k will be used as the test set D k , and other folds will be used as the training set C k : If we do not consider the index k, and just simplify the situations as X 3 → {C, D}, for each training image c (k) ∈ C, k 1, . . . , |C|, we will do the following eight DA techniques. Here we suppose each DA technique will generate W new images.
(1) Gamma correction (GC). The equations are defined as: where η GC j ( j 1, . . . , W ) are GC factors. (2) Rotation. Rotation operation rotates the original image to produce W new images [25]: where η RO j ( j 1, . . . , W ) are rotation factors. (3) Scaling. All training images c(k) were scaled [25] as where η SC j ( j 1, . . . , W ) are scaling factors. (4) Horizontal shear (HS) transform. W new images were generated by HS transform where η HS j ( j 1, . . . , W ) are HS factors. (5) Vertical shear (VS) transform. VS transform was generated similarly to HS transform where where is the maximum shift factor. (7) Color jittering (CJ). CJ shifts the color values in original images [26] by adding or subtracting a random value. The advantage of CJ is it can help bring in randomness change to the color channels, so it can aid production of fake color images: where CC means color channel. means maximum color shift value. (8) Noise injection. The 0-mean 0.01-variance Gaussian noises [27] were added to all training images to produce W new noised images: where NO denotes the noise injection operation. (9) Mirror and concatenation. All the above L/2 results are mirrored, we have where M represents the mirror function. All the results are finally concatenated as The size of

Proposed models and algorithm
We proposed five models in total in this study.  Fig. 4a shows the activation maps of the proposed Model-0. Here the size of input was S 0 128 × 128 × 3, the first conv block is composed of one conv layer, one activation function layer, and one pooling layer. After conv layer, S 1 128 × 128 × 32. Then after the activation function layer, the output is the same as S 1 . After the pooling layer, the size is S 2 64 × 64 × 32. The conv block then repeats three times, we have S 3 64 × 64 × 64 and S 4 32 × 32 × 64 for the second conv block, S 5 32 × 32 × 128, and S 6 16 × 16 × 128 for the third conv block, S 7 16 × 16 × 256 and S 8 8 × 8 × 256 for the four conv block. Then S 8 was flattened and passed through the first fully connected layer with output as S 9 1 × 1 × 50. The output of the second fully connected layer was S 10 1 × 1 × 2.

Measures
The randomness effect of each run reduced performance reliability, so we used K -fold cross validation to analyze unbiased performances. The size of each fold is |X 3 |/K . Due to there being two balanced classes (DCIS and HB), each class will have |X 3 |/(2 × K ) images. The split setting of one trial is shown in Table 6. Within each trial, (K − 1) folds were used as training, and the rest fold were used as test. After combining all K trials, the test image grew to |X 3 |. If above K -fold cross validation repeats Z runs, the performance will be reported on |X 3 | × Z images.
Suppose the ideal confusion matrix E ideal over the test set at k-th trial and z-th run is where the constant 2 is because our dataset is a balanced, i.e., DCIS class has the same size of HB. After combining K trials, the ideal confusion matrix is at z-th run is In realistic inference, we cannot get the perfect diagonal matrix as shown in Eq. (36), suppose the z-th run real confusion matrix is where 0 ≤ a, b, c, d ≤ |X 3 |/2. The four variables (a, b, c, d) represent TP, FN, FP, and TN, respectively. Here P means DCIS and N means healthy breast (HB). Four simple measures ν 1 , ν 2 , ν 3 , ν 4 can be defined as where ν 1 (z), ν 2 (z), ν 3 (z), ν 4 (z) means sensitivity, specificity, precision, and accuracy at z-th run, respectively. Besides, F1 score ν 5 (z), Matthews correlation coefficient (MCC) ν 6 (z), and Fowlkes-Mallows index (FMI) ν 7 (z) can be defined as: After averaging Z runs, we can calculate the mean (M) and standard deviation (SD) of all k-th (∀k ∈ [1, 7]) measures as The result is reported in the format of M ± SD. For ease of typing, we write it in short as MSD. Table 4 shows the parameter setting of variables in this study. The values were obtained using trial-and-error. The total size    Table 6 shows the K-fold cross validation setting, which was used in the experiment to report unbiased performances [28]. For each trial, the training image set contained 216 DCIS and 216 HB images. Then after L-way data augmentation, the LDA training set contained 103,896 images for each class, and thus together |DA(C)| 207, 792 images. The size of the test set during each trial was only 48 images. Combining 10 trials, the final combined test set is the same as the original dataset of 480 images.

Statistical result of proposed model-4
The ten runs of our Model-4 results are shown in Table  7

Model comparison
We next compared the Model-4 CNN-BDER result with other four models (Model-0 BCNN, Model-1 CNN-BD, Model-2 CNN-BDE, and Model-3 CNN-BDR). The comparison results are shown in Table 8. Here, Model-4 CNN-BDER yielded the best results among all five models. Note that ν 2 and ν 3 of Model-3 CNN-BDR are quite close to those of Model-4 CNN-BDER, but considering the results were obtained using an average of ten runs, we can still conclude that Model-4 CNN-BDER has higher accuracy than Model-3 CNN-BDR in terms of all seven indicators. Kruskal-Wallis test was preformed based on Model-4 against Model-(m), where m 0, 1, 2, 3. The p value result matrix P is listed in Table 9. The null hypothesis is the indicator vector ν n (n 1, . . . , 7) of Z runs of Model-(m) and that of Model-4 come from the same distribution, and the alternative hypothesis that not all samples are obtained from the same distribution. Then we recorded the corresponding p value as p(m, n). The final matrix P [ p(m, n)], m 0, . . . , 3, n 1, . . . , 7. Note here we chose Z 30. The reason is our data are not normally distributed (see Table 7), so it is important to obtain a larger sample set.  The first row and second row of Table 9 show that all p values are < 0.05. So, the test rejects the null hypothesis at the 5% significance level, indicating that Model-4 is significantly better than Model-0 and Model-1 for all seven indicators. For the third row, the p values show that Model-4 is significantly better than Model-2 for all indicators other than sensitivity ν 1 . For the last row, the p values show that Model-4 is significantly better than Model-3 for all indicators other than specificity ν 2 and precision ν 3 . Table 10 presents the results of not using LDA, showing decreased accuracy compared to those using LDA and highlights the effectiveness of our proposed LDA. The future research direction is to explore more types of DA techniques and increase the diversity of LDA, hence, improving the generalization ability of our AI models. Note that Model-0 BCNN and Model-1 CNN-BD without LDA obtain performances lower than 90%, which are worse than traditional AI methods that do not utilize deep learning. This means deep learning with big data can improve performance, if we do not have big data (not using data augmentation means our training set is only 432 images as shown in Table 6), then deep models may not compete with traditional shallow models. Figure 5 summarizes and compares all ten models, where LDA and NLDA represent use and non-use of LDA, respectively. From Fig. 5 we can clearly observe that our Model-4 CNN-BDER using LDA can obtain the best performance among all six models.

Effect of LDA
Here we do not run hypothesis test, since all the models without LDA show reduced performance than the mod-els with LDA. We have already proven that the statement "Model-4 is better than Models-(0-3)" is statistically significant, so we can conclude that Model-4 is better than Models without LDA.

Comparison to state-of-the-art approaches
Our proposed Model-4 CNN-BDER was compared with state-of-the-art approaches. First, we used the 40-image dataset in Ref. [12]. The comparison results are presented in Table 11. Note here the performance of our Model-4 differs from previous experiments, because we analyzed a smaller dataset (40-images). The reason why our method is better than SMFD [12] is because SMFD, i.e., statistical measure and fractal dimension, can help extract statistical and global texture information, but it is inefficient in extracting local information.
The results in Table 12 showed that our Model-4 CNN-BDER method performed better than eight state-of-the-art approaches. Except MCC ν 6 , the other six indicators of our method are greater than 93%. While, the second best method is SMFD [12], whose seven indicator values are all less than 91%. SMFD [12] can help extract statistical and global texture information, but it is inefficient when  extracting local information. WEE [9] has a similar problem, since wavelet energy entropy uses wavelet to extract multi-resolution information, and a higher decomposition level of wavelet can extract finer-resolution. But it is difficult to run high-level decomposition in practice. Hence, the information from WEE [9] is mostly at a coarse level. CRNBC [7] and CRSVM [7] used co-occurrence matrix (COM) and run length matrix (RLM) as the feature extraction method, and employed naive Bayesian classifier (NBC) and support vector machine (SVM) as classifiers. COM computes the distribution of co-occurring pixel values at given offsets, while RLM computes the size of homogeneous runs for each grey level. Both features are easy to implement for computer scientists, but their capability of distinguishing tumors from surrounding healthy issues needs to be verified. Also, NBC and SVM are traditional classifiers, whose performances are not as high compared to recent deep learning approaches. SMMP [10] combined self-organizing map (SOM) and multilayer perceptron (MLP) methods. SOM used unsupervised learning to generate a low-dimensional discretized representation of the input space from the training image samples, while MLP has only one hidden layer that may limit its expressivity power. WESVM [13] used wavelet energy support vector machine as the classifier. However, wavelet energy is not a popular feature descriptor, whose improvements and modifications on wavelet energy are still in progress. The two worst methods are NBC [8] and HMI [11]. The former assumes the presence/absence of a feature of a class is unrelated to the presence/absence of any other features; however, this assumption is difficult to fulfil in practice. The latter employed seven Hu moment invariants as feature descriptors, which may be insufficient to capture information regarding breast cancer masses. The performance can be improved by combining with other feature descriptors. In all, Table 12 shows the improved performance of our Model-4 CNN-BDER method.

Comparison to manual diagnosis
Three experienced radiologists (P 1 , P 2 , P 3 ) were invited to independently inspect our dataset of 480 thermogram images. None of the radiologists had observed any of the images in advance.
The results of three radiologists are itemized in Table 13. The first radiologist (P 1 ) obtained a sensitivity of 71.67%, a specificity of 74.17%, a precision of 73.50%, and an accuracy of 72.92%. The second radiologist (P 2 ) obtained the four indicators as 81.25%, 73.75%, 75.58%, and 77.50%, respectively. The third radiologist P 3 obtained the four mea-sures as 75.42%, 82.50%, 81.17%, and 78.96%, respectively. Comparing Table 13 with our method Model-4, which is also illustrated in Fig. 6, from which we can see that our proposed CNN-BDER method can give higher performance than manual diagnosis. The reason may be DCIS is Stage 0 of breast cancer, so some lesions are difficult to discern by radiologists while AI can potentially capture those slight and minor lesions.

Conclusions
We built a new DCIS detection system based on breast thermal images. The method CNN-BDER is based on convolutional neural network, and CNN-BDER has three contributions: (1) use of exponential linear unit to replace traditional ReLU function; (2) use of rank-based weighted pooling to replace traditional max pooling and (3) A L-way data augmentation was proposed.
The shortcomings of our proposed Model-4 are threefold: (1) the model has not been verified clinically, but will certainly form the basis of future studies; (2) the model does not work with mammogram images, so we will aim to develop a hybrid model in the future which can help give predic- The future direction will be following aspects: (1) try to expand the dataset and introduce more thermal images; (2) move our AI system online and allow radiologists worldwide to test our algorithm and (3) test other advanced AI algorithms.
where parameter β 0.01 is the commonly pre-assigned value. Its derivative is defined as Appendix C Using Fig. 7 as an example, and assuming the region (1, 1) at 1st row 1st column of the input AM I is chosen as (1, 1) I (row 1, col 1), the row vector of (1, 1) is vec [ (1, 1)