An FA-SegNet Image Segmentation Model Based on Fuzzy Attention and Its Application in Cardiac MRI Segmentation

Aiming at the medical images segmentation with low-recognition and high background noise, a deep convolution neural network image segmentation model based on fuzzy attention mechanism is proposed, which is called FA-SegNet. It takes SegNet as the basic framework. In the down-sampling module for image feature extraction, a fuzzy channel-attention module is added to strengthen the discrimination of different target regions. In the up-sampling module for image size restoration and multi-scale feature fusion, a fuzzy spatial-attention module is added to reduce the loss of image details and expand the receptive field. In this paper, fuzzy cognition is introduced into the feature fusion of CNNs. Based on the attention mechanism, fuzzy membership is used to re-calibrate the importance of the pixel value in local regions. It can strengthen the distinguishing ability of image features, and the fusion ability of the contextual information, which improves the segmentation accuracy of the target regions. Taking MRI segmentation as an experimental example, multiple targets such as the left ventricles, right ventricles, and left ventricular myocardium are selected as the segmentation targets. The pixels accuracy is 92.47%, the mean intersection to union is 86.18%, and the Dice coefficient is 92.44%, which are improved compared with other methods. It verifies the accuracy and applicability of the proposed method for the medical images segmentation, especially the targets with low-recognition and serious occlusion.


Introduction
Medical images can intuitively reflect the 2D and 3D morphological features of human organs and tissues, with complex structures and diverse contents. Due to the influence of noise, field drift effect, offset deformation, gray value distortion, and local posture effect, medical images are often blurred. And the individual difference also increases the difficulty of feature differentiation [1,2]. In the process of image segmentation, the fuzziness of image and human visual characteristics will bring some uncertainty, which brings difficulties to the image segmentation [3]. Therefore, it is a challenging subject to study efficient and accurate segmentation methods suitable for complex medical images, and it is also one of the hotspots in the field of medical image processing [4,5].
In recently years, deep learning has been widely used to different medical image segmentation, including CT, X-ray, PET, ultrasound, MRI, OCT. Fully Convolution Network (FCN) [6] is one of the most successful and advanced deep learning technology for semantic segmentation. Zhao et al. proposed a brain tumor segmentation technology based on deep learning, which integrates a FCN and conditional random fields into a combined framework to achieve segmentation with appearance and spatial consistency [7]. Bai et al. proposed an aortic sequence segmentation algorithm combining a FCN with a recurrent neural network in MRI images with sparse annotations [8]. Huang et al. proposed a two-stage nasopharyngeal carcinoma stages prediction framework which used VGG16 as the basic segmentation model [9]. Zhang explored a method of multi-scale feature mapping based on FCNs to pre-screen radiographs quickly and accurately in the aided diagnosis of pneumoconiosis staging [10]. However, these methods of directly up-sampling feature map or fusing feature pooling layer will lead to insufficient segmentation accuracy. They do not consider the relationship between pixels in the up-sampling decoding stage, resulting in the segmentation results being not sensitive and rich in the details of the image [11].
Another mainstream deep segmentation model is based on encoder-decoder architecture. Ronneberger et al. proposed a U-Net model for segmenting biological microscopic images, which includes two paths in series, one is a compression path for capturing context, and the other is a symmetrical expansion path for accurate positioning [12]. Badrinarayanan et al. proposed a convolutional encoder-decoder architecture SegNet for image segmentation. The decoder up-samples the input feature map with low-resolution [13]. Dai et al. proposed a Structure Correction Confrontation Network (SCAN) to segment the lung field and heart in CXR images. This network first used GAN for image segmentation [14]. Gao et al. presented a hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation, which can capture long-range dependency at different scales with minimal overhead [15]. The above methods effectively solve the principles and strategies of image segmentation. But in medical image with noise and content diversity, there are still problems of unstable segmentation effect on low-resolution and fuzzy images, and low accuracy of target pixel classification [16]. Moreover, the target boundary of medical biological image is fuzzy, and there are many complex situations such as abnormal proportion and deformation, which makes these models inapplicable for complex medical image segmentation [17].
At present, attention mechanism is a feature enhancement method widely studied and applied, which can be used as a module of image segmentation model to focus on the segmentation targets of interest [18]. Chen et al. introduced an attention mechanism into FCN to realize semantic image segmentation, which can evaluate the importance of features at different positions and scales [19]. Fu et al. proposed a dualattention-based network for scene segmentation, which can adaptively combine local features with global dependencies, and capture rich context dependencies [20]. Fuzzy theory has a good ability to describe the uncertainty of image. By introducing the fuzzy theory into the attention mechanism, it can not only effectively describe the image characteristics , but also strengthen the model's attention to the local target [21]. Lu et. al proposed a fuzzy attention-based DenseNet-BiLSTM Chinese image captioning method. The introduction of fuzzy attention mechanism effectively improves the correspondence between image features and contextual information [22]. Yao et. al introduced an attention mechanism based on fuzzy weighted entropy into deep learning network and applied it to multi-channel facial expression recognition [23]. However, the above methods emphasize the application of fuzzy attention in channel features and strengthen the model's understanding of the global content. There are still limitations for medical image segmentation which needs fine division of local regions.
In this paper, aiming at the problem of medical images semantic segmentation with low-recognition and high background noise, based on cognitive science and deep learning methods, a deep convolution neural network image segmentation model based on fuzzy attention mechanism is proposed, which is called FA-SegNet. In this method, the fuzzy attention mechanism is introduced into the feature fusion of CNNs to improve the segmentation effect and accuracy of the model. The innovative and main work of this paper as follows: • A FA-SegNet segmentation model which introduces fuzzy attention mechanism into deep convolution neural network is proposed for semantic segmentation in medical images. • The fuzzy logic theory is introduced into the attention module, and the fuzzy weight with fuzzy cognitive logic is used to recalibrate the importance of feature elements, so as to realize the deep utilization of feature information and the fusion of context information. • Experimental results on ACDC dataset show that the proposed model achieves state-of-the-art effects and verified the generalization ability of FA-SegNet.
The rest of the paper is organized as follows: In Sect. 2, the FA-SegNet model is established and its theoretical properties are analyzed. In Sect. 3, a comprehensive learning algorithm of FA-SegNet is proposed. In Sect. 4, segmentation experiments and result analysis are performed based on medical MRI images. Finally, the work of the paper is summarized, and the advantages and limitations of this method are pointed out.

Deep Convolution Neural Network Image Segmentation Model Based on Fuzzy Attention
In this section, the FA-SegNet segmentation model which introduces fuzzy attention mechanism into a deep convolution neural network, is proposed for the semantic segmentation of medical images with low-recognition and high background noise.

The SegNet Segmentation Model
SegNet [13] is composed of an encoder network and a corresponding decoder network containing 13 convolution layers, and a pixel-level classification layer is cascaded at the output of the decoder. In SegNet, each encoder layer corresponds to a decoder layer, and the output of the last decoder is sent to the pixel-level softmax classifier to generate class probability for each pixel independently. The structure of SegNet is shown in Fig. 1.
In Fig. 1, the encoder part of SegNet is composed of several convolution blocks including convolution, BN, ReLU, pooling and other operations. The receptive-field is increased through the max-pooling operation to extract image features. The decoder part is composed of deconvolution, bilinear interpolation and up-sampling operation. Through deconvolution, the semantic features after image classification can be reproduced, and the feature maps can be restored to the original size of the image. The softmax function is taken as a pixel-level classifier to output the maximum values of different classifications to realize image segmentation.

The Fuzzy Attention Module
Fuzzy logic deals with the dependence and correlation between fuzzy relations by imitating the uncertain thinking and reasoning mode of human brain and with the help of concepts such as membership function [24]. For MRI image segmentation, the fuzzy attention module is embedded on the feature channel in the down-sampling (encoding) stage and the spatial region in the up-sampling (decoding) stage, and importance of features are recalibrated.

Fuzzy Channel Attention Module in the Up-Sampling Layer
The fuzzy attention module is introduced on the feature channels of up-sampling layer. First, the embedding position of the fuzzy layer is determined. Let the dimension of input data be (N, H, W, C), and perform two full connection operations in the attention module of the feature channel, the number of neurons in the first full connection is N × 1 × 1 × (C∕16) and the number of neurons in the second To control the correlation parameter weight between the characteristic channels of the output through the membership degree determined by fuzzy logic calculation, the fuzzy mechanism layer is embedded between the two full connection layers, that is, the output of the first full connection is used as the input of the fuzzy layer, and the output of the fuzzy layer is used as the input of the second full connection layer.
S-type function [25] with fuzzy membership property is adopted as the membership function of fuzzy layer: The input data is fuzzed, and the membership value is calculated through the S-type membership function. The value range is (0, 1). The closer the value is to the two ends of the membership value, the higher the degree of certainty, and the closer the value is to the middle, the lower the degree of certainty. Two uncertainty thresholds and are determined according to domain expert knowledge or learning, here, 0 < < 1, < < 1 . When the degree of membership value belongs to ( , ) , the certainty is low, and the certainty is high when the other parts are. Amplify the weights of feature channels with high certainty, while keeping the low  weights unchanged in order to reduce the amount of calculation. Considering to improve the non-linear ability of the inaccuracy of the membership function, the result of the above formula is re-entered into the S-type membership function to obtain a new membership value, which is then defined by 0.5. The membership value close to 0.5 is low certainty. Through a set of affine transformations, it has a smaller annotation weight of the feature channel importance. The two sides away from 0.5 are highly deterministic, which makes the annotation weight of the importance of the feature channel greater. Finally, the annotation weight data with membership measurement is normalized by batch normalization method and output.
In the down-sampling module, a fuzzy channel attention module is added to strengthen the discrimination of different target regions. It is shown in Fig. 2.

Fuzzy Spatial Attention Module in the Up-Sampling Layer
For the up-sampling layer spatial attention module, the uncertainty of membership is used as the weight to evaluate the importance of pixels. The weight of pixels with low uncertainty plays a more critical role, while suppressing the weight of pixels with high uncertainty. In the processing flow, the fuzzy spatial attention layer is embedded after the last convolution operation of the up-sampling architecture and before numerical multiplication with the original feature map matrix. In the fuzzy spatial attention module, the down-sampling shallow features are combined with the up-sampling deep features to more effectively distinguish the features of different regions.
In the up-sampling module, a fuzzy spatial attention module is added to reduce the loss of image detail features and expand the receptive field. The fuzzy spatial attention module mechanism is shown in Fig. 3.
In the image semantic segmentation based on FA-Seg-Net, for the feature map with same size, it needs to go through two different attention mechanisms to restrict the correlation between the features. Different from the original SegNet model, which only extracts features through convolution activation to provide nonlinear operation, fuzzy attention mechanism can not only provide uncertain feature channels and spatial region annotation weights, but also retain the original convolution activated context information. It can better mine the region and edge feature information in the segmentation task and suppress the interference of task independent information.

Deep Convolution Neural Network Based on Fuzzy Attention Mechanism Image Segmentation Model
Based on the attention mechanism, fuzzy membership is used to re-calibrate the importance of the pixel value of each local area. It can strengthen the ability to distinguish the features of the image target, as well as the ability to fuse the contextual information and the diverse content of the images, and improve the accuracy of the model's segmentation of the target regions. The overall structure of FA-SegNet is shown in Fig. 4. The information processing flow of FA-SegNet model is as follows: (1) Input layer: C-channel image with size of H × W.
(2) Down-sampling image feature extraction, that is, the encoding stage.
In the encoding stage, image features are extracted by convolution, including 5 encoder modules. Each encoder module includes convolution layer, fuzzy channel attention module and pooling layer. All convolution layers adopt the same convolution, that is, the original size of the image is maintained after the convolution operation, and the batch standardization operation is carried out after each convolution operation and activated by the ReLU function. Let the input feature map of the convolution layer be X = [x 1 , x 2 , … , where * is the convolution operation, and v s C is the s-th convolution kernel corresponding to the input X.
Each decoder contains 2 or 3 consecutive convolution layers, and the feature map after convolution operation is input to the fuzzy channel attention module. First, the global information of each channel is obtained by global average pooling of the feature map, that is, 1 * 1 convolution operation for each channel. Formally, a statistic z ∈ ℝ C is generated by shrinking U through spatial dimensions H × W , where the c-th element of z is calculated by: Then, the average pooled global information needs to be fully connected twice, respectively, and the fuzzification operation is carried out between the two fully connections. The uncertainty determined by the fuzzy logic membership function controls the output of the correlation parameter weight between the feature channels, that is, the output of the first fully connection is used as the input of the fuzzy layer, the output of the fuzzy layer is used as the input of the second fully connection layer. The specific form is: where is the ReLU function, Fuzzy represents the fuzzification operation, and is the Sigmoid function. W 1 ∈ ℝ C r ×C and W 2 ∈ ℝ C× C r is the weight set of two fully connection layers, respectively, where r represents the reduction coefficient, which is used to reduce the number of nodes in the fully connection layer and the amount of model parameters.
For the fuzzification process, first, the input data are fuzzed, and the membership value is calculated through the S-type membership function. The value range is (0, 1). For improving the non-linear ability of the inaccuracy of the membership function, the result of the above formula is reentered into the S-type membership function to obtain a new membership value. Finally, the feature channel annotation (4) s = (Fuzzy(g(z, W))) = (W 2 Fuzzy( (W 1 z))), Fig. 4 The overall structure of FA-SegNet Page 6 of 10 weight data with membership measurement is normalized by batch normalization method and output. The calculation formula is: where BN stands for the normalization operation of Batch Normalization, and S stands for S-type function.
After two fully connection operations and fuzzification, the weight s of each channel of the feature map is finally obtained, that is, the degree of attention. It is multiplied by the corresponding feature map values, and finally the importance of each feature channel is relabeled with different weights. The specific calculation formula is: where ⋅ represents the value s c multiplied by the corresponding channel value of the feature map u c ∈ ℝ H×W .
Pooling layer is added at the end of each encoder, and the maximum pooling with the size of 2 × 2 is selected, which reduce the resolution of the feature map to half of the original. Maximum pooling formula of layer l: (3) Up-sampling image segmentation, that is, the decoding stage.
The decoding part also includes five modules corresponding to the encoding stage. Each decoder module includes upsampling layer, fuzzy spatial attention module and convolution layer. The feature map input to the decoder is improved in resolution through an up-sampling layer with a size of 2 × 2 . By corresponding to the index value recorded during down-sampling in the encoder, the input feature map data is placed in the original position, and the other positions are filled with 0. The sampling operation on layer l is defined as: The feature map after up-sampling is input to the fuzzy spatial attention module for fuzzification and spatial attention adjustment. Different from the channel attention module, the fuzzy layer is embedded after the last convolution operation and before numerical multiplication with the original feature map matrix. First, the feature map is average pooled in the channel direction, that is the H × W × 1 convolution operation, to obtain the global channel information of the feature map.
(5) z �� = BN(S(S(BN(z � ))), Then, the pooled feature map is fuzzified in the same way as in the previous section to obtain the final spatial region importance weight matrix. The values in the matrix show different degrees of attention to the pixels at each spatial position. Finally, the spatial region attention matrix is multiplied by the original input in the decoding stage to complete the recalibration of the importance of the original pixel position value at the pixel position corresponding to the spatial region. The calculation formula is: (4) The last part is a Softmax layer, which calculates the probability that each pixel of the image belongs to each category. The category corresponding to the maximum value is the label of the pixel, which completes the pixel-level image classification. The activation function of the output layer is Softmax:

The Learning Algorithm
In the training of FA-SegNet, cross-entropy loss [26] is used as loss function, and back propagation algorithm is used to learn network parameters. The loss function is defined as: The specific process is as follows: Step 1: Determine network training parameters. FA-Seg-Net model training parameters include SegNet model parameters and fuzzy attention module parameters. The parameter vector is expressed as follows: is the offset set, k 0 , k 1 and k 2 are the weight of the fully connection layer and the two coefficient matrices of the S-type membership function in the fuzzy attention module, respectively.
Step 2: Randomly sample N image slices x n (n = 1, 2, … , N) and obtain N corresponding label samples g n .
Step 3: Obtain the predicted segmented images: where t−1 is the network training parameter at (t − 1) time, and x n is the input image slice.
Step 4: Calculate the gradient of network loss value: where p n is the predicted segmented image, and g n is the label sample corresponding to the input image.
Step 5: Use SGD algorithm to optimize and update parameters: where is the learning rate.
Step 6: If the network converges to the optimal state, the training ends, otherwise return to Step 2.

Experiment and Result Analysis
In this section, the established FA-SegNet semantic segmentation network is experimentally evaluated and compared with common medical image segmentation methods.

The Experiment Dataset
The experimental dataset is from the public dataset of MIC-CAI 2017 Automated Cardiac Diagnosis Challenge (ACDC) [27]. The dataset contains clinical data of cardiac magnetic resonance imaging (MRI) of 150 patients, and each patient contains 12-35 frames of short axis MRI. The clinical diagnosis of 150 patients was divided into five categories, including 30 patients: Normal (NOR), Dilated Cardiomyopathy (DCM), Hypertrophic Cardiomyopathy (HCM), previous Myocardial Infarction (MINF) and Abnormal Right Ventricle (RV). The detail is seen in Table 1.
Because the standard segmentation label of the test set was not published, 100 patient samples of the training set are used to train and test the model in the experiment. The total number of end diastolic and end systolic slices in 100 cases was 1902. The data were randomly selected according to the ratio of 4:1 to segment the training set and test set, and (15) ∇L( t−1 ) = L(p n , g n t−1 ), the training set containing 1522 slices and the test set of 380 slices were obtained.

Data Preprocessing
The size of MRI slices provided by ACDC database ranges from 154 × 224 to 428 × 512 . All image slices are rescaled to 256 × 256 by fitting the maximum size along X and Y to 256, and fill the remaining area with the minimum value of each frame image. In addition, the MRI datasets collected by different types of scanners have different voxel intensity variation ranges, which will have a great impact on the performance of the segmentation model. In this experiment, the Z-score standardization method [28] is used to normalize the voxel intensity of the dataset samples. The calculation formula is as follows: where is the mean value of all voxels in a single MRI image. is the variance of all voxels in a single MRI image. After the above normalization processing, the digital matrix of MRI image conforms to the standard normal distribution, that is, the mean value is 0 and the standard deviation is 1.

The Model Structure and Parameters are Set
In the experiment, the single channel cardiac MRI image with the size of 256 × 256 is used as the input. The number of feature map channels output by the convolution layer of the 5 encoder modules is [64, 128, 256, 512, 512], and the size of convolution kernel of all convolution layers is 3 × 3 , padding is 'SAME', the size of the pooling layer is 2 × 2 . The output layer outputs a 256 × 256 four-dimensional vector with the same size as the original input, corresponding to the four segmentation targets of the background, left ventricle, left ventricular myocardium and right ventricle, respectively. This experiment is implemented in Python language based on the deep learning library Pytoch. The experimental environment is built on Linux system, the CPU used is Intel Xeon E5-2630 V3, and the GPU is NVIDIA Quadro P4000. The maximum number of iterations is 10000, the batch size is 2, and the epoch is 10. The stochastic gradient descent algorithm with momentum term is selected to optimize the model. The learning rate is set to 0.001 and the momentum coefficient is 0.9, the whole network does not use dropout operation.
In order to demonstrate the effectiveness of the proposed FA-SegNet model, and fully prove that the combination of SegNet and fuzzy attention module improves the segmentation performance, three current mainstream segmentation methods are selected for comparison, that (17) x � = x − , is, FCN [6], U-net [12] and original SegNet [13]. The FCN model uses VGG16 as the backbone network, the number of feature layer channels output by the convolution layer is [64,128,256,512,512,4096], and the FCN-8s skip structure is adopted. The U-net model adopts the encoder structure with the number of output channels of [64,128,256,512,1024] and the decoder structure of [1024,512,256,128,64]. The structure of the SegNet model is the same as described above. The size of the convolution kernel of all segmentation models is 3 × 3 , the pool size of the pool layer is 2 × 2 , SGD optimization method with momentum term is used.

The Experiment Result and Analysis
When evaluating the performance of the FA-SegNet proposed in this paper, the image segmentation evaluation criteria include: Pixel Accuracy (PA), Class Pixel Accuracy (CPA), Mean Pixel Accuracy (MPA), Dice coefficient, Intersection over Union (IOU), Hausdorff Distance (HD) [29]. Four models are used to segment the test set, and the experimental results are obtained. From the qualitative analysis, the segmentation results are directly displayed, in which the yellow area is the left ventricle, the green area is the left ventricular myocardium, and the blue area is the right ventricle. The qualitative results of different four methods are shown in From Fig. 5, it can be seen from the comparison between the four methods and the segmentation results with standard segmentation marks that, in most cases, the results of the segmentation model proposed in this paper are in good agreement with the standard segmentation results. For the left ventricle with fixed shape and position, the original SegNet and U-net models also achieved good results, but in the right ventricle with large morphological changes, more accurate segmentation performance cannot be achieved. The FCN model has poor segmentation effect in each region. In addition, from the perspective of the whole contour retention, the segmentation effect of FA-SegNet is better, and each region can be clearly distinguished, which is closer to the benchmark result. This is because different individuals have different heart shapes and sizes, during cardiac MRI examination, the systole and diastole of the heart and the angle of slices make the obtained MRI images change greatly, and the available training samples are less, and the knowledge learned is not enough to deal with the diversity of various regions of the heart. At the same time, the training and testing of the built deep model is carried out on different data, resulting in great differences in the segmentation results of the heart region. Therefore, the accuracy of heart region Fig. 5 The segmentation results of 4 models segmentation of the existing comparison algorithms is poor, while the FA-SegNet is a semantic segmentation model based on fuzzy attention mechanism, which can not only provide uncertain feature channels and spatial region annotation weights, but also retain the original convolution activated context information, so as to better mine the region and edge feature information in the segmentation task, and suppress the interference of task independent information.
The evaluation indexes are used to quantitatively evaluate the segmentation results of FA-SegNet, including CPA, Dice coefficient, IoU and HD. The segmentation results of left ventricular cavity, left ventricular myocardium and right ventricular were calculated. Through statistical calculation, the MPA is 0.9247, the mean Dice coefficient is 0.9244, the mean IOU is 0.8618, and the mean HD is 9.78mm. The results are shown in Table 2.
From each evaluation index, it can be seen that the segmentation of FA-SegNet on cardiac MRI image has achieved good results. However, there are differences in the segmentation of various parts. Among them, the segmentation effect of the left ventricle is the best. And the segmentation result of the right ventricle is relatively low, the reason is that the shape and intensity of the right ventricle change greatly, and it is often blocked by the left ventricle and myocardium, so only part of the ventricle can be detected, easy to mix with background information.
The results of various evaluation indexes are shown in Table 3.
It can be seen from the experimental results in Table 3 that FA-SegNet obtains the best results in PA, MPA and MIoU compared with other algorithms. In particular, compared with the original SegNet method, the proposed model is 0.04, 0.01 and 0.05 higher in PA, MPA and MIoU respectively, which proves that the proposed fuzzy attention module has a good effect in deep feature extraction.

Conclusion
In this paper, aiming at the problem of medical image segmentation, the FA-SegNet segmentation model which introduces fuzzy attention mechanism into SegNet is proposed based on the idea of cognitive science and deep learning. By applying fuzzy cognition to the attention mechanism module, the semantic concepts such as membership function and uncertainty are combined with the attention degree of feature channels and spatial regions in the attention mechanism to realize the constraints of domain knowledge on feature maps or pixel segmentation. The experimental results show that FA-SegNet greatly improves the segmentation effect and accuracy of the model, and has good applicability to small and large-scale targets. Compared with other models, the evaluation indexes are greatly improved. The FA-Seg-Net makes comprehensive use of the advantages of fuzzy attention mechanism and deep convolution network. In the information processing mechanism, it can retain the context information activated by convolution operation, better mine the regional and edge feature information in the segmentation task, and suppress the interference of task independent information. It can provide a new deep learning method for image segmentation, and has a good prospect of popularization and application. For future works, this study can be extended and applied to 3D medical images processing , which can better reflect the changes in the structure and morphology of organs and tissues in the clinic.
Author Contributions RY designed the study and took the lead in the manuscript writing. JGY supervised the study design. JY contributed to the study design. KL supervised the study design and helped in the writing of the final draft of the manuscript. SX made critical revisions of the final manuscript. All authors read and approved the final manuscript.
Funding This work was supported by the Shandong University of Science and Technology Research Fund under Grant 2019TDJH102.
Availability of data and materials Not applicable.

Conflict of interest
The authors declared that they have no conflicts of interest to this work.
Ethics approval and consent to participate Not applicable.