Brain Tumor Classification Based on Attention Guided Deep Learning Model

Cancer is the second leading cause of death worldwide. Brain tumors count for one out of every four cancer deaths. Providing an accurate and timely diagnosis can result in timely treatments. In recent years, the rapid development of image classification has facilitated computer-aided diagnosis. The convolutional neural network (CNN) is one of the most widely used neural network models for classifying images. However, its effectiveness is limited because it cannot accurately identify the focal point of the lesion. This paper proposes a novel brain tumor classification model that integrates an attention mechanism and a multipath network to solve the above issues. An attention mechanism is used to select the critical information belonging to the target region while ignoring irrelevant details. A multipath network assigns the data to multiple channels, before converting each channel and merging the results of all branches. The multipath network is equivalent to grouped convolution, which reduces the complexity. Experimental evaluations on this model using a dataset consisting of 3064 MR images achieved an overall accuracy of 98.61%, which outperforms previous studies on this dataset.


Introduction
According to the World Health Organization, cancer is the second leading cause of human death. Brain tumors count for one out of every four cancer deaths. In the study conducted in 2019, 17,000 people die of brain cancer each year in the United States. The five-year survival rate was 34% for men and 36% for women after being diagnosed. The earlier the diagnosis and treatment, the better the rehabilitation effect, and the longer the survival. Medical imaging is considered to be one of the most significant advances in improving clinical cancer diagnoses. Magnetic resonance imaging (MRI) is widely used to examine the abnormalities in brain tumors. However, such tasks are time-consuming and require experienced radiologists. It may cause errors due to human beings. In recent years, computer-aided diagnosis has become popular in medical research. With the rise of deep learning, deep neural networks are playing an increasingly important role in image classification. In a traditional convolutional neural network (CNN), the pooling layer directly uses the maxpooling or average pooling method to compress the image information and reduce computation. This may result in that critical information cannot be identified. In addition, traditional CNN requires a deep network to extract features effectively. A large number of parameters are a challenge for limited computing resources. To solve the above issues, this paper proposes a novel brain tumor classification model that integrates a multipath network and attention mechanism [1,2]. By introducing an attention mechanism, our network can focus on the critical information among many input sources. The multipath network assigns the input to multiple channels, converts each channel, and finally merges the results of all branches. The multipath network is equivalent to grouped convolution, which reduces the hyper-parameters and model complexity to improve the model performance. Increasing the number of paths in the network can efficiently improve the accuracy compared with increasing the width and depth of the network. Our experiments showed that our model could obtain a satisfactory classification accuracy of 98.61% on the brain tumor dataset.
The paper's contributions are as follows: 1. The paper presents a novel brain tumor classification model which integrates a multipath network and attention mechanism. 2. Our network can focus on critical information over the rest by introducing an attention mechanism. 3. The multipath network assigns the input to multiple channels, thereby increasing the number of paths in the network and efficiently improving the accuracy.
The remainder of the paper is arranged as follows: Sect. 2 reviews the related work, Sect. 3 presents our models for brain tumor classification, Sect. 4 evaluates our model's efficiency through experiments, and Sect. 5 concludes the paper.

Related Works
In the past few years, many researchers have proposed solutions to identify brain tumors using MRI images. The experiment was carried out on the brain tumor dataset figshare [3,4]. Took the tumor enlargement region as the region of concern and divided it into sub-regions using the adaptive space division method. They used the extracted intensity histogram, gray level co-occurrence matrix (GLCM), and features based on the Bag-of-words model and achieved an accuracy of 87.54% [5]. Deployed the Fisher vector to aggregate the local characteristics of each sub-region, with an average accuracy (map) of 94.68%. Similarly, [6] extracted statistical features from MRI slices using 2D discrete wavelet transform (DWT) and Gabor filtering technology. The backpropagation multilayer perceptron neural network was used to classify brain tumors and achieved an accuracy of 91.9% [7]. Deployed a probabilistic neural network (PNN) to classify brain tumors, and achieved an accuracy of 83.33% after image filtering, sharpening, resizing, and extracting the GLCM features. Traditional algorithms and models based on machine learning require special domain knowledge and experience. However, these methods adopt segmentation and manual extraction operations for statistical features, which can reduce both accuracy and system performance. The issues can be attributed to the use of different convolution layers to extract and distinguish features automatically. GLCM is used in CNNs [8] and can improve accuracy by 20%. In this case, its accuracy of 82% can be achieved [9]. Proposed a capsule network (CapsNet) for brain tumor classification. The CapsNet was used with one convolution layer and 64 feature maps and achieved an accuracy of 86.56%. The attention model has been used in various deep learning cases over the past two years, such as natural language processing, image recognition, and speech recognition. Attention mechanism had been shown to improve the performance for a range of tasks, from image location to sequence-based models. It can also integrate gate function and sequence technology and is usually used on top of one or more layers to indicate the level of abstraction between models. A high-speed network uses a gate mechanism to adjust fast connection, which makes it easier to apply in deep learning architecture.
Inspired by the success of semantic segmentation, a powerful trunk and mask attention mechanism using an hourglass module was developed by [10]. This high-capacity cell is inserted into the deep residual network between intermediate stages [11]. Developed a channel attention mechanism by adaptively adjusting characteristic channel responses, while [12] used the attention of function mapping through two network branches. Our network integrates Se modules and gate channel transformation to further enhance the extraction of advanced features. Experiments demonstrate that our network achieves a highly desirable accuracy on the brain tumor dataset.

Brain Tumor Classification
There are two issues surrounding the classification of brain tumors using MRI images: background interference and insufficient data. This paper proposes a new type of deep neural network, the structure of which is shown in Fig. 1. The network core consists of dual-attention blocks that are stacked together to construct the network. It was generated by adding Se module and Gated Channel Transformation to the multipath block [13].

Dual-Attention Network Architecture
Our network used a new type of Gated Channel Transformation (GCT) layer, which applied the channel relationship to improve the recognition capability of deep-level CNN. With efficiently combining normalization and the gating mechanism, GCT can promote competition and cooperation between neurons, while ignoring the complexity of its parameters. With this method, our network can better extract features from brain tumor images. After integrating Se module attention modules, our network implements brain tumor classification tasks.

Gated Channel Transformation
A channel normalization layer can efficiently reduce both the number of parameters and computational complexity. This lightweight layer contains a simple ℓ2 normalization, which allows the transformation unit to act directly on the operator without needing additional parameters. GCT constructs the competition or coordination relationship between channels through a normalization method [14]. GCT uses a global context embedding operation, which contains context information and an adaptive threshold operation based on the normalized output. The GCT unit also helps to establish a relationship model between the channels.
GCT units can help to efficiently process contextual information. The normalization approach is to build a competitive or cooperative relationship between channels, although it is a type of parameter-less operation. To make GCT learnable, this paper adopts an operation to embed a global context, which also embeds context information and controllable parameters before normalizing the channel. In addition, an adaptive threshold operation is used to adjust the channel characteristics based on the normalized output. Since the number of training parameters in GCT is small, GCT is relatively easy to deploy. Moreover, the intuitive GCT parameters can visually explain the behavioral value of GCT as shown in Fig. 2. Assuming that x ∈ R C×H×W represents the activation feature in CNN, GCT performs the following transformation: where contributes to the adaptability of the embedding output, and , are used to control the activation threshold. They determine the GCT behavior in each channel. The GCT complexity is O(C) , and the complexity of the SE module is O C 2 .

Multipath Network Architecture
Our network adopts a multipath structure similar to that in ResNext. The cardinality is the number of paths for each  block in the multipath network. It is a specific dimension besides the width and depth dimensionality. Experiments showed that increasing cardinality was an effective way to increase accuracy rather than deepening or extending the network. Cardinality can also increase the accuracy of multipath networks. A multipath network increases the number of paths, as shown in Fig. 3. The multipath network module is composed of multiple parallel paths, and each path is composed of a 1 × 1 and 3 × 3 convolution layer and GCT module. In addition, the multipath network module uses a residual connection and the output of the module can be obtained by adding the input and output through each path.

Dual-Attention Block
In a traditional CNN, the maximum pool or average pool is used to compress image information to reduce the amount of computation, although this may result in that the key information cannot be gotten. A traditional CNN requires a deep network to extract features effectively. However, a large number of parameters are a great challenge for limited resources. To overcome this problem, we propose a dual-attention block (see Fig. 1), which applies limited computing resources to quickly filter out high-value information. The attention module is added to each multipath network block to make reasonable use of deep features, i.e., for each tensor, X ∈ R W � × H � × C � a new tensor U ∈ R w×H×C is obtained after convolution operation F tr . After fixing the channel dimension of u, each feature graph is processed immediately.
The descriptor of the channel is 1 × 1 × C , which is scalar used to describe a map. For each channel's feature map u c (i, j) , a Global Average Pooling (GAP) is performed as shown in Formula (2): The average value Z c can be obtained by adding all the values of the matrix represented by the feature map. Next, the previously obtained channel descriptor 1 × 1 × C was taken for each map as the weight of each channel, as shown in Formula (3).
x is regenerated based on U, as shown in Formula (4) where σ is the ReLU function, These two parameters are used to limit the complexity of the model and improve the generalization capability; a 'bottleneck' is then formed by two FC layers surrounding a nonlinear mapping. After obtaining the gate-like structure, the gate of each channel s c is then multiplied by the corresponding feature map u c , so that the information flow of each feature map can be controlled, and the depth information of the acquired text data can be used hierarchically to obtain better results. GCT is then applied to the backbone network to improve the feature extraction capability for visual recognition tasks.
Our model is shown in Fig. 4. Our network stacks eight dual-attention blocks. C = 32 in the figure represents the cardinality of 32, while the following number represents the number of repeated stacks of the block.

Network Optimization Algorithm
Several optimization methods are used to speed up the convergence and improve the classification accuracy during training. This paper uses Rectified Adam (RAdam) as an optimizer and takes stochastic weight averaging (SWA) as weight smoothing. Fig. 3 The multipath network Architecture

Rectified Adam is Used to Improve Convergence Accuracy
It is a difficult job to balance the convergence speed and accuracy. Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam) are two common optimization methods used during training in deep neural network models [15]. RAdam is an improved approach based on the adaptive stochastic optimization algorithm Adam [16]. It controls whether to activate the adaptive learning rate according to the variance. RAdam has the same convergence speed as Adam and also has a similar gradient updating behavior to SGD, which can help the network achieve greater accuracy. The RAdam optimization method can be used to effectively balance the convergence speed and accuracy of the algorithm.

Weight Smoothing to Improve the Generalization
SWA can improve the model generalization character during convergence [17] and can obtain a wide local maximum during training either on stable cases or random fluctuations. It is usually used in the final stage of convergence; the method is computed as follows: where n represents the total number of training rounds, and range represents the range of rounds that do not participate in the weight smoothing process, n and range determine the starting round of weight smoothing, w i represents the model

Environment Configuration and Parameter Setting
The experimental environment configuration and internal parameter settings in the dual-attention networks are shown in Tables 1, 2. Several performance indicators are defined for the standard classifier evaluation. Classification accuracy is the most widely used quality index and is defined as the ratio of the number of correctly classified samples to the total number of data samples. The essential metrics are precision, recall (or sensitivity), and specificity, which are calculated using the relationships given below:  Recall precision Accuracy specificity measure for each category, known as the F score. Due to the imbalance between the three categories, the average F score is calculated using the relationship given below:

Dataset Description
The dataset used in this paper is a T1-weighted CE-MRI brain image-figshare [3], which contains three types of tumor images: gliomas, meningiomas, and pituitary tumors (see Fig. 5). The dataset contains a total of 3064 image slices taken from 233 patients. In practical clinical applications, a certain number of slices of the brain are usually collected, with large intervals between the slices. So we extract features from two-dimensional images. Slices with larger tumor areas are chosen to form the dataset. The size of each image was 512 × 512 mm, and the pixel size was 0.49 × 0.49 mm.
Precision c * Recall c Precision c + Recall c The tumor area in the image is manually marked by the radiologist.

Image Pre-Processing and Data Augmentation
The MRI image of a brain tumor is presented in grayscale, so we need to copy the brain tumor image three times to form a three-channel image. These images represent the input layer of the network and were normalized and resized to 224 × 224 pixels. To augment the dataset [18], the data enhancement operation was performed on the brain tumor images as follows.
(1) Image vertical flip: Flip the image clockwise by 180°.  In this way, our dataset was expanded to 9192 images. The whole process is shown in Fig. 6. The evaluation process of the designed network on the figshare dataset was applied five-fold cross-validation [19]. The entire dataset was based on images from 233 patients and was divided into five disjoint subsets of approximately equal size. One subset was selected as the test dataset, while the rest formed the training dataset. This process was repeated five times so that every subset formed the test dataset once. Such a division of the dataset was to ensure that a particular patient's image was not simultaneously present in the test and training datasets.

Experimental Design and Result Analysis
After the proposed model was trained and evaluated, we validated the results with a five-fold cross-validation strategy. In our experiment, the classification accuracy of the proposed network was 98.6%. Table 3 lists the classification results of our network for the specific brain tumor types. The specificity, precision, and recall for all categories were high, which reflects the effectiveness of our network in classifying specific types of brain tumors.
We compared the performance of our method with existing methods on the specific three-class brain tumor classification. Table 4 shows a relative comparison based on classification accuracy as a metric, which showed that the accuracy of our model surpassed other reported models. However, the table only contains accuracy as a performance indicator because it is a usual key indicator used in all related works. Table 5 provides a more detailed comparison as it is based on measurement of sensitivity and specificity measurements. Our proposed method also showed an improvement over other methods in terms of F avg scores. Figure 7 shows that the accuracy was highest once the number of iterations was greater than 100.

Conclusion
This paper proposes a novel CNN architecture for classifying brain tumors. The classification is performed on a T1-weighted contrast-enhanced MRI image dataset containing three tumor types. The proposed network integrates a multipath network and attention mechanism to extract features from brain MRI images. The attention mechanism can focus on critical information, while the multipath network can efficiently improve the accuracy compared with increasing the width and depth of the network. Our network achieved the highest accuracy when compared with previous related works. Our study also contained some limitations. First, there is a certain degree of misclassification for meningioma samples. Second, the phenomenon of overfitting is often observed when the training dataset is small. In subsequent work, data augmentation can be used to increase the number of samples in future research. In addition, we will  Data Availability All the data and codes can be accessed from the Internet.

Declarations
Conflict of Interest Not applicable.

Ethical Approval
The data have gotten rid of personal privacy. So the research does not relate to personal privacy.

Consent for Publication
Dr. Wen Jun carried out the deep learning algorithm and drafted the manuscript, Mr. Zheng participated in all experiments and performed the statistical analysis. All authors read and approved the final manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.