Abstract
Medical image segmentation is a challenging task due to the high variation in shape, size and position of infections or lesions in medical images. It is necessary to construct multiscale representations to capture image contents from different scales. However, it is still challenging for UNet with a simple skip connection to model the global multiscale context. To overcome it, we proposed a dense skipconnection with cross coattention in UNet to solve the semantic gaps for an accurate automatic medical image segmentation. We name our method MCAUNet, which enjoys two benefits: (1) it has a strong ability to model the multiscale features, and (2) it jointly explores the spatial and channel attentions. The experimental results on the COVID19 and IDRiD datasets suggest that our MCAUNet produces more precise segmentation performance for the consolidation, groundglass opacity (GGO), microaneurysms (MA) and hard exudates (EX). The source code of this work will be released via https://github.com/McGregorWwww/MCAUNet/.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Medical image segmentation [1,2,3,4,5] of target objects provides valuable information for the analysis of pathologies. However, the high variation in shape, size and position of infections or lesions is one of the key challenges in medical image segmentation. As observed in Fig. 1, the size and shape with irregular and blurred appearances in CT between consolidation and groundglass opacity (GGO) lesions vary significantly. The microaneurysms and hard exudates in fundus photography are tiny/small and dispersedly distributed, which easily results in the falsenegative detection.
Recently, deep learning has shown its strong power of feature learning in image segmentation area. For medical image segmentation, UNetlike encoderdecoder architectures have shown their power in medical image segmentation applications [6]. Although Ushaped networks have achieved good performances in many medical image segmentation applications [7,8,9], they still have several key limitations. (1) Insufficient capability of extracting context information for reconstructing the finegrained segmentation map. The global context information is generally captured by deeper layers of the encoder and is gradually transmitted to shallower layers, which may be progressively diluted. (2) Although skip connection can help recover the spatial information which gets lost through the pooling layers, it is unnecessarily restrictive due to demanding the feature maps fusion of the encoder and decoder of the same level without considering the semantic gap [10, 11]. Therefore, it raises an important question to the UNet methods: can we solve the limitation and develop a new framework that can improve over the restrictive skip connections in UNet that requires fusion of only samescale feature maps with simply concatenating?
To this end, we propose a Ushaped architecture with a more flexible multiscale cross coattention skip connection enabling flexible feature fusion in decoders for automatic segmentation. With the proposed dense connectivity, each node in a decoder is connected with the aggregation of all feature maps from the encoder by relaxing the unnecessarily restrictive skip connections where only the feature maps with the same scale are connected. It is different from UNet++ which fuses only the encoder features from the deeper layers without considering the fusion of the shallower layers (please refer to Fig. 2b). On the other hand, we design an attention mechanism [12, 13] from both the perspectives of channelwise and spatialwise to reduce the semantic gap between the encoder and decoder, termed coattention mechanism. The coattention mechanism can not only eliminate the semantic gap in feature fusion but also highlight salient features that are passed through the skip connections. Due to the reuse of feature maps, no extra computations and parameters are required, compared with UNet++ and MultiResUNet which solve the semantic gap by combining a series of convolution blocks. To facilitate the learning of the multiscale feature fusion with cross coattention connections, we employ deep supervision to facilitate the feature learning in different stages of the decoder. Our experimental results indicate that the deep supervision mechanism is effective in improving the segmentation performances of Ushaped networks, especially in the cases that the target objects have multiple scales. The performance of deep supervision highly depends on appropriate corresponding task weights. Therefore, we regard it as a multitask learning task and make the weights learnable through a balanced multitask dynamic weight (BMTD) optimization algorithm. The contribution of this work are threefolds:

We dissect the skip connections in UNet and empirically demonstrate appropriate connections are important for segmentation. We propose a multiscale cross skip connection to boost semantic segmentation by bridging the semantic gaps between lowlevel and highlevel features by an effective feature fusion scheme. Compared with the plain skip connections, the multiscale cross skip connection improve the receptive field of UNet by jointly considering the multiscale features and hence able to extract multiscale features of the target object and incorporate larger context.

While encoders have been studied rigorously, relatively few studies focus on the decoder side. The proposed bidecoder module differs from the original decoder in three ways: (1) cross coattention, which bridges the semantic gap between encoder and decoder feature maps by highlighting regions that present a significant interest for the diseases. (2) dual upsampling, which improves the upsampling performance by exploiting the finer spatial recovery in the decoder. (3) deep supervision, which further facilitates the multiscale features fusion with a direct supervision for each level. Based on a Ushape network, the proposed decoder module can be easily embedded in the frameworks in the medical image segmentation tasks.

The proposed MCAUNet is evaluated on four lesion segmentation tasks of two different datasets with difficulties including large variations of shape/size, blurred boundaries and small lesions, and it is shown that it achieves better performance than the related UNetbased architectures.
Related works
Recently, deep learning has shown their strong power of feature learning in image segmentation applications, for example, brain lesion segmentation [14], organ segmentation [15], electron microscopy image segmentation [6]. Ushaped architectures offer the advantages in medical image segmentation applications [6]. However, they still have some limitations such as lack of ability of modeling multiscale global context and the semantic gap between the encoder and the decoder. To solve the issue, some methods with different skip connections for more flexible feature fusion are proposed, as illustrated in Fig. 2. Zhou et al. [10] propose a nested Ushaped framework, UNet++, with nested dense skip pathways which replace the restrictive skip connections fusing only the samescale feature maps in UNet. Ibtehaz et al. [11] propose MultiResUNet to incorporate some residual convolutional layers along the skip connections. The study hypothesizes that the features propagating from the encoder stage may balance the possible semantic gaps. AttentionUNet [13] is proposed to reduce the semantic gap between the encoder and decoder by a spatial attention mechanism. The advantage of the methods is that they improve the segmentation performance by alleviating the semantic gap and incorporating extra convolution layers or attention mechanism. Despite achieving good performance, the works above are still incapable of effectively exploring sufficient information from full scales due to the designs of the skip connections which ignore the correlation of multiple scale encoder features.
Methods
The overall framework of MCAUNet
Our network consists of three parts: the encoder, the Multiscale Cross Skip Connection and the Bidirectional Decoder (Bidecoder) which consists of Dual Upsampling, Cross CoAttention (CCA) and Deep Supervision. We also employ a BMTD algorithm to optimize the multitask loss from the deep supervised decoder layers. Figure 3 illustrates the architecture of our proposed MCAUNet network. To improve the representation capacity of the segmentation network, we replace the original twolayer convolution block with a Residual Block [16]. To better fuse features of inconsistent semantics and scales, we propose a cross coattention guided multiscale fusion scheme, which addresses the issues that arise when fusing features given at different scales. To effectively fuse the multiscale features from different encoder levels to produce the final segmentation mask, we proposed a bidecoder module which is directly enhanced by multiscale context extracted from the contracting path. The bidecoder module also involves a dual upsampling process that improves the upsampling performance and a deep supervision scheme to facilitate backpropagation and convergence. We provide details for each step in the following sections.
Encoder
The encoder of the original UNet consists of four doubled convolution layers with an activation function, which is insufficient for feature extraction and representation. Thus, we replace each convolution with a Residual Block [16] which has been proven to be useful for increasing the ability of learning richer representations and mitigating the degradation problem. The details can be seen in Table. 1.
Multiscale cross skip connection
Skip connection was first proposed in UNet, which transmits the lowlevel information (textures, shapes, etc.) in the shallower encoder stages to the corresponding stages of the decoder. However, each stage of the decoder can only get feature from one scale through the original skip connection, which may harm the decoder features due to the semantic gaps and lacks the ability of capturing multiscale context information which has been proven essential for lesion segmentation tasks [17, 18]. To solve these problems, we replace the original skip connection scheme with a multiscale cross skip connection scheme. The proposed scheme transmits the resized (using upsamples or maxpooling) features from all the four encoder stages to each decoder stage, then combines them with a Bidecoder block which will be introduced through the next section. The cross skip paths between the encoder and the decoder can aggregate features generated by multiple scales thus leads to better segmentation prediction.
Bidirectional decoder, Bidecoder
The bidecoder block is designed as a gating operation of the skip connection based on a learned attention map given to multiple feature maps from encoder. Unlike the traditional decoder, the proposed decoder has two inputs and two outputs. Each decoder block is connected with all encoder blocks via attentional skip connections as in the UNet architecture. The inputs of bidecoder involves two parts: multiscale features from the encoder, and a complementary dual upsampled information from the deeper layers. The bidecoder processes the two inputs with two directions of horizontal and vertical paths, and then learns a more powerful representation and finer recovery by dealing with the feature learning in both directions. With the different scales inputs, the decoder further encode the feature maps as the inputs for extracting global contexts with attention mechanism to enhance finer details by recovering localized spatial information. The outputs are dual upsampled information to the shallower layers and the direct segmentation prediction with another upsampling to the original resolution.
In summary, we introduce three enhancements to the conventional decoder module in our proposed bidecoder: (1) Directly concatenating the feature maps from the encoder may cause redundancy, hence we proposed a cocorrelation with channel and spatialwise attention module to guide the channel and spatial information filtration of the encoder feature maps through skip connections, allowing a fine spatial recovery in the decoder. (2) Both deconvolution and upsampling were added in the splicing process of the highresolution features in the contraction path to leverage the complementarity between two different upsampling operations. (3) Finally, the incorporation of deep supervision can further facilitate the multiscale features fusion.
Dual upsampling
The bidecoder contains two upsampling components of nearest neighbor upsampling and deconvolution to recover resolution from the previous layers. We argue that the two processions are totally different from each other in terms of operation mode and can be complementary for the following cross correlation. Among the existing algorithms, the upsampling or deconvolution algorithm is seperately used in the decoder. In our work, the upsampling and deconvolution comprises a dualpath decoder. The combination of the upsampling and deconvolution can enhance the performance of the cross coattention when the multifeature encoder and decoder are fused.
Cross coattention (CCA)
Attention Mechanism for medical image segmentation have also been used recently [13, 19, 20], showing great potential in improving the segmentation performance. In our work, we hypothesize that the information from multiscale encoder blocks are different. We are focusing on the cross correlation between the feature maps from encoder and decoder rather than a selfattention within a single feature map. Hence, to better fuse features of inconsistent semantics and scales, we propose a multiscale channelwise and spatialwise attention module. The proposed module is incorporated into the bidecoder to guide the channel and spatial information filtration of the encoder features through skip connections and eliminate the ambiguity with the decoder features as signals.
Specifically, instead of simply aggregating features from all levels, we propose to learn the attention in four parallel different level features. Unlike the previously proposed attention modules, most of which only explore channel or spatialwise attention, the proposed multiscale cross coattention module applies attention mechanism of channel and spatialwise for highlevel and lowlevel features to exploit the complementary space and channel simultaneously. With the cross coattention, the decoder can learn the importance of each feature channels which come from multilevel feature maps, and emphasize a meaningful feature selection in the spatial map to locate the critical structures.
Motivated by SqueezeandExcitation (SE) block, we extend the self attention mechanism to a cross coattention in the multiscale feature fusion to model the interactions of encoderdecoder with different scales for better feature representations. We introduce a cross coattention module and the process is shown in Fig. 4. It involves channel and spatial attention branches. As illustrated in Fig. 4, the two branches are conducted simultaneously rather than sequentially, thus better feature representations for pixellevel prediction are obtained. It takes the concatenated results of two upsampled features \(\varvec{\hat{X}_U}\) and \(\varvec{ \hat{X}_D}\) as query feature \(\varvec{\hat{X}}\), and the encoder features from different scales as key features \(\varvec{ X^\ell }\), \(\ell \in 1,2,3,4\) indicates the level of encoder which the feature is skipconnected from. For the \(\ell\)th level encoder, each pair of feature maps (\(\varvec{ X^\ell }\), \(\varvec{ \hat{X}}\) ) are fed into the CCA module.
Mathematically, we consider the encoder feature maps \(\varvec{X^\ell }=[\varvec{x}_1^\ell ,\varvec{x}_2^\ell ,\ldots ,\varvec{x}_C^\ell ]\) and decoder feature maps \(\varvec{\hat{X}}=[\varvec{\hat{x}}_1,\varvec{\hat{x}}_2,\ldots ,\varvec{\hat{x}}_C]\) as combinations of channels \(\varvec{x}_k\in \mathbb {R}^{H\times W}\) and \(\varvec{\hat{x}}_k\in \mathbb {R}^{H\times W}\) , where W , H and C indicate width, height and channel dimension, respectively. Let \(\varvec{\tilde{P}^\ell }\in \mathbb {R}^{C\times 1 \times 1}\) and \(\varvec{\tilde{Q}^\ell }\in \mathbb {R}^{1\times H\times W}\) are the channel and spatial attention mask. A global average pooling layer \(g(\varvec{x}_k)=\frac{1}{H\times W} \sum _{i=1}^H\sum _{j=1}^W \varvec{x}_k(i,j)\) is used for Spatial squeezing. This operation embeds the global spatial information in vector \(\varvec{P}^\ell\). This vector is transformed by
where \(\varvec{L}_1\in \mathbb {R}^{\frac{C_{\hat{x}}}{2} \times C_x}\), \(\varvec{L}_{2}\in \mathbb {R}^{C_x\times \frac{C_{\hat{x}}}{2}}\) and \(\varvec{L}_{3}\in \mathbb {R}^{C_{\hat{x}}\times \frac{C_{\hat{x}}}{2}}\) being weights of three Linear layers and the ReLU operator \(\delta (\cdot )\).
This operation in Eq. (2) encodes the channelwise dependencies. The resultant vector is used to recalibrate or excite \(\varvec{X^\ell }\) as follow:
where the activation \(\sigma ({P}_i^\ell )\) indicates the importance of the ith channel, which are rescaled.
The process of modeling the spatial relationship is similar to the channel attention. We consider it as an alternative slicing of the input feature maps \(\varvec{X}^\ell =[\varvec{x}_{1,1}^\ell ,\varvec{x}_{1,2}^\ell ,\ldots ,\varvec{x}_{i,j}^\ell ,\ldots ,\varvec{x}_{H,W}^\ell ]\) and \(\varvec{\hat{X}}=[\varvec{\hat{x}}_{1,1},\varvec{\hat{x}}_{1,2},\ldots ,\varvec{\hat{x}}_{i,j},\ldots ,\varvec{\hat{x}}_{H,W}]\), where \(\varvec{x}_{i,j}^\ell\) and \(\varvec{\hat{x}}_{i,j}\in \mathbb {R}^{1\times 1 \times C}\) correspond to the spatial location (i, j) with \(i\in {1, 2, \ldots , H}\) and \(j \in {1, 2, \ldots , W }\). The spatial squeeze operation is achieved through a convolution
where \(\varvec{W}_{1}\in \mathbb {R}^{1\times 1 \times C_{\hat{x}}\times 1}\) is the weight of spatial squeeze convolution layer, \(\varvec{W}_2\in \mathbb {R}^{C_x\times C_{\hat{x}}}\) and \(\varvec{W}_{3}\in \mathbb {R}^{C_{\hat{x}}\times C_{\hat{x}}}\) reduce the feature channels of \(\varvec{X}^\ell\) and \(\varvec{\hat{X}}\) to the same number \({C_{\hat{x}}}\). Each \(Q_{i,j}^\ell\) of the projection represents the linearly combined representation for all channels C for a spatial location (i, j). This projection is passed through a sigmoid layer \(\sigma (.)\) to rescale activations to [0, 1].
where each value \(\sigma (Q_{i,j}^\ell )\) corresponds to the relative importance of a spatial information (i, j) of a given feature map.
After computing the relevance between decoder and encoder during the fusion with the channel and spatial attention, next, we perform a tensor multiplication between the two attention tensor and the original encoder features. Third, we use an elementwise sum operation between the above tensor and original features to obtain the final representations reflecting effective fusion with skip connections for better segmentation. At last, we aggregate the features from these two attention modules, a cleaned up version is indicated as \(\varvec{\tilde{X}_{cs}^{\ell }}=\varvec{\tilde{P}^{\ell }}\otimes \varvec{X}^\ell +\varvec{\tilde{Q}^{\ell }}\otimes \varvec{X}^\ell\), which is the elementwise addition of the channel and spatial excited features, where \(\otimes\) is the elementwise multiplication. The final output feature is expressed by concatenating all the features: \(\varvec{\tilde{X}_{out}}=\varvec{Concat}\left[ \varvec{\tilde{X}_{cs}}^{1},\varvec{\tilde{X}_{cs}}^{2},\varvec{\tilde{X}_{cs}}^{3},\varvec{\tilde{X}_{cs}}^{4},\varvec{\hat{X}_U},\varvec{\hat{X}_{D}}\right]\).
Deep supervision
To improve the backpropagation and make the decoder more stable, we introduce deep supervision [21] to the four stages of the decoder. Deep supervision is capable of guiding the feature learning of the hidden layers directly under the supervision of the loss and labels. We upsample the features from the first three hidden stages to the size of the last prediction stage and add three more losses to supervise them. The final output of the decoder is then rescaled to the original input size. The rescaled output is further fed into a softmax layer to produce the class probability distribution. Note that the deep supervision does not work in the inference stage, we only use the last layer of decoder Side Output 1 for producing the segmentation prediction.
Training and inference
For the main idea of enhancing the decoder of UNet, we add horizontal deep supervision in the four decoder levels. We choose deconvolution with kernel size 2 × 2, 4 × 4 and 8 × 8 to resize the output of every layer in decoder to meet the size of the ground truth. Then we compute the losses of those four layers, and use back propagation to update the weights of them, so we can deploy a direct guidance to the decoder and further improve the accuracy of the reconstruction operation.
For each layer, we employ the combined binary cross entropy loss and dice loss as our loss function:
where \(\varvec{Y}\) and \(\varvec{\hat{Y}}\) denote the ground truth labels and predicted probabilities in the batch, \(\varvec{Y}_n\) and \(\varvec{\hat{Y}}_n\) denote the nth pixel of \(\varvec{Y}\) and \(\varvec{\hat{Y}}\), N indicates the number of pixels within one batch. We empirically set the weights of the two terms in Eq. (5) to the same. The overall loss function for MCAUNet is then defined as the weighted summation of the combined loss from each level of decoder:
where i indexes the level of the decoder and \(w_i\) is the weight of each loss.
The performance of deep supervision highly depends on an appropriate choice of weights among the different tasks. How to appropriately set the weights of different tasks is a key issue in the deep supervision. A naive approach is to assign each individual task with an equal weight. It is not appropriate because the multiple tasks to be optimized have different difficulty levels. In this work, we consider the deep supervision as a multitask learning formulation and assign different weights for different tasks. We propose a dynamic task weighting algorithm, named BMTD, which helps the model to automatically achieve balanced training by dynamically tuning the weight of each task during the model optimization. The weight of each task changes every batch. Hence, we measure how well the model is trained by considering the loss ratio between the current loss and the initial loss for each task. The task which is not trained well has a larger loss ratio. Hence, the harder tasks are optimized with more priority than the easier tasks.
Experiment and results
Implementation details
The proposed architecture is listed in Table 1. We used Adam as the optimizer and set the learning rate and batch size to 5e−3 and 24. To avoid overfitting, we used early stopping and set the patience as 50 epochs. The final number of training epochs is about 200. For all the compared methods, we used the same parameter settings.
Data and experimental setting
COVID19 lung CT images segmentation
We used the public COVID19 CT images collected by Italian Society of Medical and Interventional Radiology (SIRM) dataset^{Footnote 1} that contains 100 training and 10 testing images. The groundtruth segmentation was done by a trained radiologist. Raw data are public available.^{Footnote 2} We performed 5fold cross validation and augmented the data by rotating and rescaling. To improve the computational efficiency of the model, we resized the image to 256\(\times\)256 pixels. Three evaluation metrics were adopted, including Dice coefficient (Dice), Precision and Recall.
Retinal microaneurysms segmentation
For this task, we used the Indian Diabetic Retinopathy Image Dataset (IDRiD) [22], which contains 81 images including 54 images for training and 27 images for testing. The groundtruth segmentation has precise pixel level annotation of abnormalities associated with DR. We chose microaneurysms (MA) and hard exudates (EX) as the target lesion in our experiment since both lesions are small and dispersedly distributed. We computed the area under the precisionrecall curve (AUCPR), the area under the receiver operating characteristic curve (AUCROC) and Dice coefficient (Dice) to quantitatively evaluate the segmentation results. We used online data augmentation including resize, random crop, random rotate and CLAHE.
The comparison on COVID19 dataset
We carried out experiments on the COVID19 dataset to evaluate the effectiveness of our method. Note that the comparable models have the same encoderdecoder framework as MCAUNet, including the number of channels, network depth and training strategies. We chose UNet with ResBlock as our backbone segmentation architecture. The average Dice, Precision, and Sensitivity of all the methods were listed in Table 2. As shown in Table 2, it shows that these enhancements lead to notable improvements on the two segmentation tasks. Our model yields the overall highest performance, with an increase of 3.66% Dice for GGO segmentation and 12.30% Dice for consolidations segmentation compared to the baseline UNet. Particularly for Consolidation, the increase of performance is striking. Compared to UNet, our MCAUNet improves the performance remarkably. Compared with the UNet with the residual blocks, the cross coattention module brings 3.15% improvement. The attention information from different layers in the encoder has complementary features, which obviously improves the segmentation accuracy. Meanwhile, deep supervision module individually outperforms the baseline by 1.71%. Therefore, learning the feature representation with direct supervision in the deeper layers is important. When we integrated the deep supervision and MCA together, the performance further improves to 52.60%, which outperforms the individual component of DS and MCA. With the BMTD optimization algorithm, improvements of 0.47% and 1.27% are achieved in ground glass and consolidations, respectively. These observation shows the crucial role of BMTD optimization. Moreover, it also indicates that the side outputs cannot be simply used with the same weights.
To more comprehensively evaluate our model, we chose some typical methods for further comparison. For the Covid19 dataset, we compared the proposed MCAUNet to UNet++ (Resblock) [10], MultiResUNet [11], and AttentionUNet [13]. All of the networks have an encoderdecoder based architecture. We also compared to the UNet++ with ResNet101 as powerful encoder.
The experimental results obtained by several stateoftheart segmentation networks are reported in Table 3. By comparing the results from Table 3, we can observe that the segmentation task achieves better performance in MCAUNet. Compared to other networks that were proposed in the context of medical image segmentation: UNet++ (ResNet101), MultiResUNet and AttentionUNet, our network achieves average improvements of 6.59%, 4.89% and 5.21% (in terms of Dice), 5.40%, 3.33% and 6.06% (in terms of Precision) and 5.09%, 4.76% and 1.88% (in terms of Sensitivity), respectively. Except for the sensitivity, our model also obtains improvements of 4.21% and 6.85% in terms of dice and precision compared with UNet++(ResBlock). Based on the above quantitative analysis, we can see that the cross skip connections guided by coattention mechanisms are helpful for the refinement and fusion of complementary information between multiscale features. Particularly, the proposed multiscale guided attention network performs better results than AttentionUNet, which also integrates attention modules. Besides, we visualized the segmentation results of the comparable models in Fig. 5. The red boxes highlight regions where MCAUNet performs better than the other methods by making better use of the multiscale context fusion and attention scheme. It shows that our MCAUNet generates better segmentation results, which are more similar to the ground truth than the results of the competing models. Through the empirical results, we summarize the following findings:

1.
For the 1st and 2nd cases where the boundaries of GGO often have low contrast and blurred appearances, making them difficult to be identified. MCAUNet predicts finer boundary information and maintain the object coherence, which demonstrates the effectiveness of modeling global context representations. It indicates that the multiscale fusion help to discover more complete and accurate areas of classes of interest with low contrast.

2.
2. Consolidations vary significantly in size and shape and have irregular and ambiguous boundaries. For the 3rd and 4th cases, the consolidations have a narrow shape. It can be seen that the predictions of MCAUNet captures the boundary well. It is obvious that MCAUNet keeps more details due to its multiscale features from different encoder levels. For the 5th case where the lesions contain irregular boundaries, the segmentation results generated by our method are closer to the ground truths. Moreover, it also introduces fewer mislabeled pixels, which leads to better performance than other methods. These visual results indicate that our approach can successfully recover finer segmentation details while avoiding getting distracted in ambiguous regions. Nevertheless, the other networks produce smoother segmentations, resulting in a loss of fine grained details. As UNet++ and UNet++(ResBlock) also employed a multiscale architecture, these differences suggest that the higher scale incorporation and effective cross coattention can actually improve the performance of segmentation networks. It can be seen that both methods tend to have oversegmentation problems, which may be caused by the lack of higher resolution features.
In summary, the previous approaches suffer from two main limitations in the segmentation of COVID19: large variations of consolidation and blurred boundary of GGO. For large variations of consolidation in CT lead to inaccurate prediction for the baseline and the comparable methods due to the insufficient multiscale feature which fails to deal with such variations. The blurred boundary of GGO leads to inaccurate prediction due to the lack of the high spatial information which is lost or distorted in the pooling and upsampling. Both the quantitative evaluation in Table 3 and qualitative comparison in Fig. 5 demonstrate the effectiveness of the proposed MCAUNet for COVID19 segmentation.
The comparison on IDRiD dataset
For the IDRiD dataset, we compared MCAUNet with SESVDLab [26], SSCL [24], DRUNet [23], and three topranking methods on the IDRiD challenge leaderboard [22]. DRUUNet (Deep Recurrent UNet) is a model which combines the deep residual model and recurrent convolutional operations into UNet. SSCL is an advanced semisupervised collaborative learning (SSCL) model. DeepLabv3+ is an extension of DeepLabv3, which introduces a decoder module to better recover the spatial resolutions and further refine the final segmentation masks. Different from the common methods for constructing a more accurate segmentation model, the aim of SESVDLab is to predict the segmentation errors produced by an existing model and then correct them.
The performance of these methods is shown in Tables 4 and 5. The results show that our model achieves the highest AUCPR and AUCROC, especially for the segmentation of MA in Table 4, our model beats the top3 ranking methods by 3.77%, 5.15% and 9.83% in terms of AUCPR, setting the new state of the art. It demonstrates again that our model is able to produce precise and reliable results for medical image segmentation.
Most of the existing Ushaped methods perform well on the large object segmentation, but fail to the detection of the small objects, which are particularly prevalent in the eye diseases. Due to the downsampling and upsampling operations in UNet, the feature maps in hidden layers are sparser than the original inputs, which causes a loss of image details and results in the comparable segmentation models yield inferior segmentation performance for the small lesions. Figure 6 shows some representative results and the comparable methods to exhibit the superiority of the proposed method on the segmentation of MA and EX. As illustrated in Fig. 6, from the top three examples, we can find that the comparable segmentation methods are limited in small lesion segmentation and produce amounts of false positives. From the bottom three examples, it can be observed that both UNet++(ResNet101) and UNet++(ResBlock) have oversegmentation problems. On the contrary, the boundary of the EX is undersegmented by both AttentionUNet and MultiResUNet. All the comparable segmentation models are not capable of precise segmentation of the small lesions. In the medical image domain, the multiscale information is required to be learned by the segmentation models which then facilitates the target segmentation. It shows that MCAUNet can significantly reduce the false positives and correct some inaccurately segmented regions by the previous algorithms.
Discussion
Discussion on the number of dense skip connections
Multiscale dense connection and cross coattention (CCA) are two vital modules in our segmentation model to achieve better segmentation performances. To further investigate the relative contribution of each component, we conduct a series of experiments on the EX segmentation, to investigate the individual contribution by varying the number of skip connections, skip connection schemes, and positions of skip connections. By varying the number of skip connections in the bidecoder, we explored the influence of different skip connections on the EX segmentation performance. Moreover, to evaluate the segmentation performance of the CCA, we replace CCA in all the bidecoders with a simple concatenation fusion used in the UNet. The illustration of the competing models can be referred to Fig. 7. ‘w/o up’ or ‘w/o down’ means that the upsampling or downsampling operation in the skip connection is removed.
As shown in Table 6, our proposed CCA is able to consistently achieve better performance compared with the simple concatenation fusion, which demonstrates its robustness and high flexibility for integrating information from the earlier feature maps. Moreover, it can be seen from Table 6 that the segmentation performance of the model improves with the increase of the number of skip connections. For the comparison between models with upsampled connection remained and ones with upsampled connection removed, the former is worse when the connection number is the same. For example, MCAUNet2 (w/o up) achieves an improvement by 1.05% compared with MCAUNet2(w/o down). which indicates that the higher resolution is important for the fine spatial recovery, whereas the connections from the encoders with lower resolution is not helpful for the decoders. Our findings show that the spatial information is more critical for the segmentation of the multiscale lesion objects, especially for the small lesions. MCAUNet2 (w/o down) performs the worst, even worse than MCAUNet1. The skip connection scheme in MCAUNet2 (w/o down) is similar as the UNet++ where the decoders are connected with the lower resolution feature maps of encoders. Another interesting finding is that MCAUNet4 without CCA achieved a relatively poor performance compared to MCAUNet1 with CCA in terms of AUPR. The results once again validate that simply connecting the feature maps with same level from the encoder and the decoder is not an optimal solution.
Discussion on different attention mechanisms
Based on the skip connections for information fusion, we systematically conduct the experiment of different attention mechanisms. The result is shown in Table 7. We conduct a series of comparison including the spatial and channelwise CCA vs. the spatialwise CCA, the selfattention (SA) of encoder or decoder vs. CCA and the sequential CCA vs. the concurrent CCA. The traditional selfattention mechanism is to capture the dependencies within the same feature map from the spatial and channelwise perspective. Our CCA is to capture the correlation between two feature maps from the encoder and decoder. It is apparent to see that, the proposed concurrent CCA method obtain improvements upon the traditional self attention methods in terms of Dice and precision. The channel maps help capture the context information for the feature fusion. When we integrate the spatial and channelwise together, the performance further improves to 62.48% with respect to Dice. Furthermore, when we compare the sequential and concurrent fashion for the encoderdecoder cross coattention, the concurrent fashion improves the segmentation performance over the sequential model by 0.35% in terms of Dice.
Discussion on positions of the proposed dense skip connections
We performed a series of experiments with respect to the positions of the proposed skip connection in Table 8. Figure 8 shows the illustration of the settings. Let \(\textbf{E}_i \rightarrow \textbf{D}_j\), where \(i,j=1,\ldots,4\), indicates how the encoder features are connected to the decoders. For example, \(\textbf{E}_1 \rightarrow (\textbf{D}_1, \textbf{D}_2, \textbf{D}_3, \textbf{D}_4)\) indicates that \(\textbf{E}_1\) encoder is connected to the decoders of \(\textbf{D}_1, \textbf{D}_2, \textbf{D}_3\) and \(\textbf{D}_4\). Although the proposed CCA module contributes to the performance improvement as shown in the previous results, it is interesting to investigate 1) which level of encoder is more important for the decoders; and 2) which layer of decoder is more beneficial for the same combination of multiscale encoder features. Obviously, MCAUNet with multiple dense connection leads to improved performance than the other models with the certain connections removed. It can be seen that \(\textbf{E}_1 \rightarrow (\textbf{D}_1, \textbf{D}_2, \textbf{D}_3, \textbf{D}_4)\) obtains the best performance in terms of AUPR, which indicates that the lowlevel features with higher resolution is important. The \(\textbf{E}_1 \rightarrow (\textbf{D}_1, \textbf{D}_2, \textbf{D}_3, \textbf{D}_4)\) can take full advantage of the rich spatial information, which can help refine the predicted boundary for the lesions with complex structure. On the contrary, \(\textbf{E}_4 \rightarrow (\textbf{D}_1, \textbf{D}_2, \textbf{D}_3, \textbf{D}_4)\) shows the worst performance. The reason may be that the spatial information is lost in the contracting path and semantic gap is too large, resulting in poor fusion performance.
Deep supervision
To test the effectiveness of the deep supervision scheme, we show the performance of each individual side output. From the Table 8, we observe small difference for the multiple predictions of side outputs except \(\textbf{D}_4\). Furthermore, we find the performance of \(\textbf{D}_1\) are slightly better than \(\textbf{D}_2\) and \(\textbf{D}_3\). We also try to employ an ensemblebased methods, where the multiple side outputs are combined to make a final prediction. We find the ensemble of \(\textbf{D}_1+\textbf{D}_2\) achieves a slightly better performance than the individual performance.
Conclusion
In this work, we introduced a multiscale Cross CoAttentional Skip Connection UNet architecture for the medical image segmentation. Our MCAUNet utilized the multiscale feature fusion strategy to combine semantic information at different levels and the cross coattention module to aggregate relevant global dependencies. To validate our approach, we conducted experiments on three different segmentation tasks on the two different medical image datasets: consolidation, GGO, Microaneurysms and Hard Exudates, indicating that it can be broadly applied to the other medical images segmentation tasks. We provided extensive experiments to evaluate the impact of the individual components of the proposed architecture. Moreover, we will extend our 2D model to a 3D version for capturing the interslice continuity of the lesion in the future work.
Availability of data and materials
Data are publicly available.
References
Litjens G, Kooi T, Bejnordi BE, Setio AA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88.
Pham DL, Chenyang X, Prince JL. Current methods in medical image segmentation. Ann Rev Biomed Eng. 2000;2(1):315–37.
Tan W, Huang P, Li X, Ren G, Chen Y, Yang J. Analysis of segmentation of lung parenchyma based on deep learning methods. J XRay Sci Technol. 2021;29(6):945–59.
Tan W, Liu P, Li X, Shaoxun X, Chen Y, Yang J. Segmentation of lung airways based on deep learning methods. IET Image Process. 2022;16(5):1444–56.
Wang L, Juan G, Chen Y, Liang Y, Zhang W, Jiantao P, Chen H. Automated segmentation of the optic disc from fundus images using an asymmetric deep learning network. Pattern Recognit. 2021;112: 107810.
Ronneberger O, Fischer P, Brox T. Unet: convolutional networks for biomedical image segmentation. In: Medical Image Computing and ComputerAssisted Intervention (MICCAI), volume 9351 of LNCS, 2015;234–241. Springer.
Isensee F, Kickingereder P, Wick W, Bendszus M, MaierHein KH. Brain Tumor Segmentation and Radiomics Survival Prediction: Contribution to the BRATS 2017 Challenge. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Lecture Notes in Computer Science, pp 287–297, Cham, 2018.
Falk T, Mai D, Bensch R, Ronneberger O. UNet: deep learning for cell counting, detection, and morphometry. Nat Methods. 2019;16(1):67–70.
Qian Y, Gao Y, Zheng Y, Zhu J, Dai Y, Shi Y. CrossoverNet: leveraging verticalhorizontal crossover relation for robust medical image segmentation. Pattern Recognit. 2021;113: 107756.
Zongwei Zhou Md, Siddiquee MR, Tajbakhsh N, Liang J. UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging. 2020;39(6):1856–67.
Ibtehaz N, Sohel RM. MultiResUNet : rethinking the unet architecture for multimodal biomedical image segmentation. Neural Netw. 2020;121:74–87.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, Glocker B, Rueckert D. Attention UNet: Learning Where to Look for the Pancreas. arXiv:1804.03999 [cs], 2018.
ISLES 2015A public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI.
Li X, Chen H, Qi X, Dou Q, ChiWing F, Heng PA. HDenseUNet: hybrid densely connected UNet for liver and tumor segmentation From CT volumes. IEEE Trans Med Imaging. 2018;37(12):2663–74.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778, Las Vegas, NV, USA, 2016. IEEE.
Yang J, Bo W, Li L, Cao P, Zaiane O. MSDSUNet: a multiscale deeply supervised 3D UNet for automatic segmentation of lung tumor in CT. Comput Med Imaging Graph. 2021;92: 101957.
Kamnitsas K, Ledig C, Newcombe VF, Simpson JP, Kane AD, Menon DK, Rueckert D, Glocker B. Efficient multiscale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med Image Anal. 2017;36:61–78.
Li X, Xiaowei H, Lequan Y, Zhu L, ChiWing F, Heng PA. CANet: crossdisease attention network for joint diabetic retinopathy and diabetic macular edema grading. IEEE Trans Med Imaging. 2020;39(5):1483–93.
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECANet: efficient channel attention for deep convolutional neural networks. p 9.
Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z. Deeplysupervised nets. In: Guy Lebanon and S. V. N. Vishwanathan, (eds), In: Proceedings of the eighteenth international conference on artificial intelligence and statistics, volume 38 of Proceedings of Machine Learning Research, pp 562–570, San Diego, 2015. PMLR.
Porwal P, Pachade S, Kamble R, Kokare M, Deshmukh G, Sahasrabuddhe V, Meriaudeau F. Indian diabetic retinopathy image dataset (IDRiD): a database for diabetic retinopathy screening research. Data. 2018;3(3):25.
Kou C, Li W, Liang W, Zekuan Y, Hao J. Microaneurysms segmentation with a UNet based on recurrent residual convolutional neural network. J Med Imaging. 2019;6(02):1.
Zhou Y, He X, Huang L, Liu L, Zhu F, Cui S, Shao L. Collaborative learning of semisupervised segmentation and classification for medical images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2079–2088, 2019.
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2018;40(4):834–48.
Xie Y, Zhang J, Hao L, Shen C, Xia Y. SESV: accurate medical image segmentation by predicting and correcting errors. IEEE Trans Med Imaging. 2021;40(1):286–96.
Funding
This research was supported by the National Natural Science Foundation of China (No. 62076059), the Science Project of Liaoning province (2021MS105) and the National Natural Science Foundation of China (No. 61971118).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, H., Cao, P., Yang, J. et al. MCAUNet: multiscale cross coattentional UNet for automatic medical image segmentation. Health Inf Sci Syst 11, 10 (2023). https://doi.org/10.1007/s13755022002094
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13755022002094