CAT-Seg: cascaded medical assistive tool integrating residual attention mechanisms and Squeeze-Net for 3D MRI biventricular segmentation

Cardiac image segmentation is a critical step in the early detection of cardiovascular disease. The segmentation of the biventricular is a prerequisite for evaluating cardiac function in cardiac magnetic resonance imaging (CMRI). In this paper, a cascaded model CAT-Seg is proposed for segmentation of 3D-CMRI volumes. CAT-Seg addresses the problem of biventricular confusion with other regions and localized the region of interest (ROI) to reduce the scope of processing. A modified DeepLabv3+ variant integrating SqueezeNet (SqueezeDeepLabv3+) is proposed as a part of CAT-Seg. SqueezeDeepLabv3+ handles the different shapes of the biventricular through the different cardiac phases, as the biventricular only accounts for small portion of the volume slices. Also, CAT-Seg presents a segmentation approach that integrates attention mechanisms into 3D Residual UNet architecture (3D-ResUNet) called 3D-ARU to improve the segmentation results of the three major structures (left ventricle (LV), Myocardium (Myo), and right ventricle (RV)). The integration of the spatial attention mechanism into ResUNet handles the fuzzy edges of the three structures. The proposed model achieves promising results in training and testing with the Automatic Cardiac Diagnosis Challenge (ACDC 2017) dataset and the external validation using MyoPs. CAT-Seg demonstrates competitive performance with state-of-the-art models. On ACDC 2017, CAT-Seg is able to segment LV, Myo, and RV with an average minimum dice symmetry coefficient (DSC) performance gap of 1.165%, 4.36%, and 3.115% respectively. The average maximum improvement in terms of DSC in segmenting LV, Myo and RV is 4.395%, 6.84% and 7.315% respectively. On MyoPs external validation, CAT-Seg outperformed the state-of-the-art in segmenting LV, Myo, and RV with an average minimum performance gap of 6.13%, 5.44%, and 2.912% respectively.


Introduction
Cardiovascular diseases (CVDs) are one of the top three causes of death globally, posing a serious threat to human health [1].Early detection and evaluation of cardiovascular disease are critical to improving human life [1,2].Diagnosis of CVDs involves an extensive examination of the cardiac system [2].In clinical practice, cardiac radiologist traces the biventricular contours during the end-systolic (ES) and enddiastolic (ED) phases, which typically requires a lot of time for skilled cardiac radiologists to analyze the MRI slices of a single patient [3].The physiological shape of the biventricular substructures (left ventricle (LV), myocardium (Myo), and right ventricle (RV)) is affected by most cardiovascular diseases [4].It is possible to significantly reduce the risk of developing CVDs like heart failure and ischemic heart disease by detecting biventricular morphological structure changes over an extended period of time with repetitive contouring of cardiac structure ratios or dysfunction [2].Hence, automated biventricular segmentation has a significant impact on the detection and treatment of CVDs [3].Moreover, the development of fast, robust, precise, and clinicianfriendly segmentation tools is essential in order to increase clinician productivity and enhance patient care because the current delineation methods are very time-consuming [4].
Various semi-automatic and automatic cardiac segmentation methods have been developed.Early segmentation methods employed semi-automatic segmentation approaches such as those presented in the work of Ding et al. [9], Sharan et al. [10] and Decourt et al. [11].Semiautomatic methods necessitate significant user intervention, as a result, they are unsuitable for applications requiring rapid segmentation.Therefore, recent studies focused on automatic CMRI segmentation.Some are focused on LV segmentation, while others consider biventricular, performing this task in one or more stages.Lately, end-toend deep learning segmentation models have frequently been used in conjunction with traditional methods.Table 1 summarizes the recent approaches developed to address cardiac segmentation.Some of the recent approaches lost the generalization of the model by removing patients with complex congenital intra-cardiac anatomies such as patients with univentricular hearts and patients following surgical correction of transposition of great vessels [14,16].
The majority of current segmentation models require biventricular prepositioning and redundant learning parameters, which results in poor segmentation performance.Moreover, some of the mentioned models [15,17] don't consider the ES phase.The difficulty of considering the ES phase is the need to handle different portions of the biventricular with varying scales.In addition, the biventricular suffers from distorted unclear borderline.To address these shortcomings, the proposed framework in this paper is inspired by ResNet and UNet of the aforementioned methods, which breaks down the segmentation process into two steps: localization and segmentation [2,10,14,15,17].However, unlike previous methods, each step is designed with specific techniques capable of producing promising results while considering the segmentation time.An approach based on DeepLabv3+ and SqueezeNet is proposed for ROI localization.In addition, 3D-ARU architecture is proposed that combines UNet, ResNet with a spatial attention mechanism for the segmentation process.As a result, CAT-Seg, the proposed framework, can achieve efficient segmentation results,

Methodology
In this section, we introduce the details of the data source used for biventricular segmentation in advance.Then, the architecture of the proposed framework for segmenting the three cardiac substructures is introduced.

Dataset
Two datasets are used to validate the performance of our proposed framework CAT-Seg.The datasets used are the Automated Cardiac Diagnostic Challenge Dataset (ACDC 2017) [23] from the 2017 MICCAI challenge and the MyoPS dataset from the 2020 MICCAI challenge [24].MyoPS 2020 contains 25 (102 slices) multi-sequence CMR images as a training set and 20 (72 slices) images as a testing set and it was collected using Philips Achieva 1.5T.The three CMR sequences' short-axis slices were all breath-hold, multi-slice.All patients are males suffering from myocardial infarction (MI).Three observers were used to manually label the LV, RV, and Myo from each of the three CMR sequences in order to create the ground truth segmentation.Before being employed in the creation of the ground truth segmentation, three experts in cardiac anatomy approved all of the manual segmentation results.The numerous hand delineations were averaged using a shape-based method to produce the final segmentation.

Model
The proposed framework consists of two stages to segment the three biventricular substructures (LV, Myo, and RV) in both cardiac phases (ED and ES).The first stage focuses on reducing the image's scope by roughly extracting the initial region of interest (ROI) using SqueezeDee-pLabv3+ to overcome the problem of class imbalance as the biventricular system only accounts for a small portion of MRI slices.The second stage comprises the generation of the final LV, Myo and RV segmentations by 3D ARU and overcoming the problem of fuzzy edges due to heart movements.The details of the proposed segmentation framework are shown in Fig. 2.

ROI localization
For the first stage of the proposed framework, SqueezeDee-pLabv3+ is proposed to extract the initial contours for LV, Myo, and RV.A relatively small region of interest (ROI) that includes LV, Myo, and RV is extracted.This step is used to reduce the scope of each volume by removing background regions that could impede the segmentation model's learning.Also, it reduces the computations performed by the proposed framework through reducing the slice size, as it focuses on the ROI only.Another advantage is the alleviation of pixel class imbalance, a prevalent issue in medical image processing [25].In the ROI localization step, each volume is input to SqueezeDeepLabv3+, which is based on DeepLabv3+ [21] semantic segmentation network with its encoder-decoder structure.SqueezeDeepLabv3+ is used to generate masks that will be used as a guide to locate the most appropriate segments for ROI.The details of the architecture are described in more depth below.
SqueezeDeepLabv3+ enriches the encoder by incorporating the SqueezeNet to capture essential information from the image as shown in Fig. 3. To overcome the problem of detecting small objects with a limited number of parameters, the proposed architecture's encoder employs a squeeze network rather than Xception in the original DeepLabv3+.Han et al. [22] proposed SqueezeNet, which is a lightweight and efficient CNN model.It has fewer parameters than Xception, and a single model's accuracy comparable to Xception.The SqueezeNet is primarily optimized and compressed as it uses CNN microstructure optimization.It employs many 1 × 1 small convolution kernels in place of 3 × 3 convolution kernels to optimize the design of a single convolution layer, resulting in a ninefold reduction in parameters count.It also employs CNN macrostructure optimization by reducing the 3 × 3 convolution kernel's input channel count and convolution kernel parameters, splitting the convolution layer into the squeeze layer and expand layer, and encapsulating it in the fire module.The fire module is the basic unit of the SqueezeNet network that uses modular convolution.The fire module primarily consists of two layers of convolution operations, each of which connects to a ReLU activation layer: the squeeze layer which contains all 1 × 1 convolution kernels; and the expanding layer with 1 × 1 and 3 × 3 convolution kernels.The SqueezeNet model consists of nine layers of fire modules, and three levels of maximum pooling that are interspersed throughout.Furthermore, it enlarges the convolution layer perception field of vision.
The high-level semantic characteristics are then merged by an atrous spatial pyramid pooling (ASPP) module to better capture the overall semantic information of the image before the low-level features of the backbone network are fed into the decoder.The ASSP technique was inspired by the success of atrous convolutional operations and spatial pyramid pooling.(SPP) [19].ASPP resamples feature maps produced by the encoder at various atrous rates.The results of applying a parallel convolution filter to the feature maps at various atrous rates are then concatenated in order to precisely and efficiently capture large multiscale information, as shown in Fig. 3.In this study, the ASPP module, which comprises of 1 × 1 convolution followed by 3 × 3 convolutions with different dilation rates and a max-pooling layer in parallel.The suitable dilation rates for the problem under study are determined experimentally and found to be d = 4, 8, and 12. Biventricular irregularities of different densities and sizes have been attempted to be segmented with high sensitivity using depth-wise convolution rather than standard convolution.

Segmentation
In the second stage, the proposed 3D-multiple attention ResUNet is used to segment the three cardiac structures (LV, Myo and RV) from the localized slices by Squeeze-DeepLabv3+ .Because the LV, Myo, and RV have distinct characteristics, primarily in terms of shape and size, the ROI localization step was able to extract the area where all three structures are located.However, it occasionally failed to capture each shape, particularly in the ES cardiac cycle.To improve the segmentation process and contour each of the three structures (LV, RV and Myo), just the extracted ROI portion of the original slice will be sent to 3D-ARU in this phase.
The proposed 3D-ARU architecture, as illustrated in Fig. 4, integrates both the spatial attention mechanism and Fig. 3 The proposed Squeeze-DeepLabv3+ with SqueezeNet as backbone to enrich the network encoder.And modifying the atrous rate to localize the small objects like RV in the ES phase the residual module with full pre-activation.The residual module improves the channel inter-dependencies, while at the same time reducing the computational cost.It also facilitates network training.Furthermore, the rich skip connections in the ResUNet [26] contribute to the better flow of information between different layers, which enhances gradient flow during training.Due to these benefits, we use ResUNet as the foundational architecture.The encoder feature maps and the decoder feature maps are directly concatenated in the combined U-Net [30] and ResNet methods.Despite the effectiveness of ResUNet, the fuzzy boundaries in cardiac images present a challenge to the model.Therefore, the attention module is incorporated to allow focusing on the crucial regions of the feature maps.
We incorporated the attention block in the decoder portion of our architecture in order to be able to concentrate on the crucial regions of the feature maps, which is motivated by the success of the attention mechanism.The attention mechanism narrows its focus to a subset of its input.It focuses on a specific area of the image while ignoring the others [31] similar to human visual perception, in which they can focus on a specific point or area while suppressing the surrounding areas.By suppressing feature activations in irrelevant areas of the image, attention gates can reduce false positives [31].In Fig. 5, the attention gate shows how the skip connection connects the encoder to the associated decoder.Two inputs are provided to the attention gate, the first of which comes from the skip connection of the associated encoder and contains all the contextual and spatial information in that layer.The second input is the gating signal from the decoder layer underneath it, and because it originates from a deeper area of the network, it has a better feature representation.It improves the learning of target regions relevant to the segmentation task while suppressing nontarget regions.First, both inputs are passed through the convolution operation and added.Following that, the first activation function, ReLU, is used, followed by the convolution operation.Furthermore, the output is resampled and passed through the second activation function Sigmoid to obtain the attention map, after which the encoder feature is multiplied pixel by pixel by the attention map to obtain the output.Figure 5 depicts a representation of the attention gate's structure.Figure 6 depicts sample slices, and their ground truth together with the output of CAT-Seg.As shown in Fig. 6, the final segmentation phase identifies the contours of each of the three structures and solves the problem of fuzzy boundaries.Also, it doesn't include other cardiac subsections as the attention module gives more attention to the boundaries and the intensities of the three structures.

Training
Each model (SqueezeDeepLabv3+ and 3D-ARU) was trained for 100 epochs using the Adam optimizer with a learning rate of 10 -3 , a decay factor of 0.1 per epoch, and the weight decay (L2 regularization) was set to 1xe −4 .The training set used in this case is composed of all classes of slices.The proposed 3D-ARU has 97,831,734 trainable parameters and the proposed SqueezeDeepLabv3+ has 7,051,556 trainable parameters.

Evaluation and statistical analysis
In biventricular segmentation from MRI, the region of interest (ROI), represented by true positives (TP), is too small compared to the entire slice.True negatives represent the background.Therefore, it is necessary to focus on the Dice similarity coefficient (DSC) and intersection over union (IoU) that robustly and reliably reflect model performance [28].The metrics used to evaluate the similarity between the proposed model's segmentation masks and the ground truth.In this study, the performance of the proposed CAT-Seg framework was evaluated in terms of the following metrics.
The Dice similarity coefficient (DSC) is a measurement of the overlap between the foreground pixels and the ground truth foreground pixel region of the segmented image.It is the metric commonly used to gauge how effectively the medical image segmentation method works.The formula is as follows: Another metric is the Intersection over Union (IoU), indicates the degree of dissimilarity between the segmented image's foreground pixels and the ground truth foreground pixel region.It is determined as follows: R indicates the real predicted results, and G indicates the ground truth.The true positive (TP): is the number of pixels correctly associated with the ROI, the false positive (FP): is the pixels indicated as ROI by the proposed model but as background by the ground truth, and the false negative (FN): is the pixels associated with the ROI by the ground truth but missed by the proposed model.All these values are used to determine the DSC and IoU. (1)

Results
In this section, the performance of the proposed architectures is verified for single-stage and multi-stage segmentation.
The performance of the proposed architectures: Squeeze-DeepLabv3+ and 3D-ARU variants are tested individually as single-stage segmentation models.They are compared to available architectures depicted in Table 2.The architectures in Table 2 are chosen to present the direct counterparts of the proposed models as they can be considered as components of the proposed architectures.The obtained results are shown in Table 3.The results validate the positive effect of the proposed modification on the standard 3D-ResUNet and DeepLabv3+.As shown in Table 3, the proposed 3D-ARU improved the mean DSC of the ResUNet by 1.060, attention UNet by 2.180%, and the original UNet by 3.405%.Moreover, the proposed 3D-ARU improves the mIoU of the ResU-Net by 2.050%, attention UNet by 7.080%, and the original UNet by 13.815%.In addition, the proposed SqueezeDeep-Labv3+ improved the mean DSC and mIoU of the original DeepLabv3+ by 1.235% and 6.180% respectively.
Figure 7 depicts sample segmentation results of existing architecture and the two proposed variants SqueezeDeep-labv3+ and 3D-ARU to allow visual inspection.The ground truth shows that the thickness of the myocardium wall is uneven, and the edge contour of the biventricular is fuzzy and difficult to extract along with irregularity in the biventricular shape.With the use of an attention mechanism, the proposed 3D-ARU model is able to extract the edge information effectively, and the reconstructed LV and Myo contours were significantly better than those of the UNet, attention UNet, and ResUNet models.It demonstrates that the incorporation of the attention mechanism solves the problem of the fuzzy edges but still the problem of segmenting the small object such as RV persists.In the lower bottom row, the role of the modified SqueezeDeepLabv3+ with different atrous rates is elucidated in detecting small objects such as RV.DeepLabv3+ misses segmenting some tissues as Myo and LV due to its larger atrous rate.Moreover, ResUNet was unable to segment Myo and RV due to fuzzy boundaries.In addition, UNet was able to segment Myo and LV but with an enlargement of LV and thinner Myo contour.ARU solve some of the UNet, attention UNet and ResUNet such as fuzzy boundaries but failed to extract the RV.Hence, it can be seen ARU and SquzzeDeepLab3+ complement the functionality of each other so a two-stage segmentation model would be expected to yield better results.CAT-Seg output is shown in the proposed framework column, which depicts the favorable effect of their combination.

Table 2 Model versions for the ablation experiment Method Description
DeepLabv3+ [27] The original DeepLabv3+ UNet [28] The original four-layer UNet UNet + Attention Mechanism [29] UNet with a spatial attention mechanism ResUNet [26] The original ResUNet 3D-ARU The ResUNet incorporates with an attention mechanism SqueezeDeepLabv3+ DeepLabv3+ with modified backbone by SqueezeNet and modified Atrous rate In the following, the effectiveness of CAT-Seg is experimentally verified against various two-level segmentation.The ROI localization is performed by either 3D-ARU or SqueezeDeepLabv3+, followed by fine segmentation.The localized ROIs are input to four architectures namely: 3D-UNet, Attention 3D-UNet, 3D-ARU, and SqueezeDeep-Labv3+ for segmentation.3D-UNet, and Attention 3D-UNet are selected for the coming experiment as they are frequently used in similar studies [14, 17-20, 30, 31].All sets comprise the volumes of the same patients.
Table 4 presents the segmentation results (DSC and IoU) of the different combinations for multistage ROI extraction and segmentation.First, 3D object detector frameworks namely Mask R-CNN [27], and Retina U-Net [28] have been deployed to automatically detect a bounding box encompassing the heart in CMRI.The detected bounding box is then used for cropping the full images.Object detection performance is the contrasted to multigrain segmentation.Mask R-CNN is an extension of the Faster R-CNN [29] architecture that adds a branch for predicting object masks in parallel with the existing branch for bounding box recognition.This allows it to provide more precise object localization and instance segmentation.Retina U-Net 3D is a 3D extension of the RetinaNet architecture that is designed for volumetric medical image analysis.It uses a U-Net-like architecture with a feature pyramid network to detect 3D objects in medical images.CAT-Seg outperforms the usage of Mask R-CNN as a 3D detection framework instead of SqueezeDeepLabv3+ in segmenting LV, and Myo by 0.8909%, and 0.3526% respectively.Also, it outperforms the combination of using Mask R-CNN with SqueezeDee-pLabv3+ in segmenting LV, Myo, and RV 0.9775%, 0.8515 and 0.558% respectively.Despite the usage of Mask R-CNN instead of SqueezeDeepLabv3+ in segmenting RV outperforms the CAT-Seg framework by 0.0528%, it increases testing time by 0.4210%.Moreover, the CAT-Seg framework outperforms the combination of using Retina U-Net with 3D-ARU in segmenting all the substructures.Also, for localization, the cascading of two consecutive 3D-ARU presents higher DSC in cases of segmenting Myo and RV in ES phase.However, the differences when compared to CAT-Seg is limited to 0.24% and 0.04% in case of Myo and RV respectively.In addition, the cascaded 3D-ARU testing time is 2.4 × higher than the proposed CAT-Seg.In addition, the testing time of using 3D-ARU as localization and then segmenting by squeezeDeepLabv3+ is 1.2368 × higher than the CAT-Seg.The CAT-Seg outperforms the cascaded SqeezeDeepLabv3+ by 0.11% and 0.46% in terms of mean DSC and mIoU respectively.The proposed CAT-Seg presents a performance gap of 4.87% and 15.78% compared to using 3D-ARU in localization and UNet in segmentation in terms of mean DSC and mIoU respectively.Although the combination of using SqueezeDeepLabv3+ for localization and UNet for segmentation has the lowest testing time, CAT-Seg outperforms it by 4.88% and 15.8% in terms of mean DSC and mIoU respectively.Moreover, CAT-Seg, approximately, has testing time as the combination of squeezeDeep-pLabv3+ and attention UNet but CAT-Seg draws a performance gap of 4.29% and 9.66% in terms of mean DSC and mIoU respectively.While the testing time of the cascaded squeezeDeepLabv3+ is 0.9210 × the testing time of the CAT-Seg, the mean DSC and the mIoU of the CAT-Seg are 3.29% and 2.22% better than the cascaded squeezeDeepLabv3+.Therefore, CAT-Seg is elected as the proposed model rather than any other cascaded approach.
Figure 8 shows the training and validation learning curves for both cardiac phases (ES and ED) using CAT-Seg.It demonstrates that both cardiac cycles have a similar trend in the training and validation stage with small performance gap diminishing the possibility of overfitting.
In addition, to make full use of the limited training data and show the performance stability and robustness, the training and testing set has been combined to apply fivefold cross-validation where each fold consists of 30 patients such as 6 patients from each pathology.The experimental results show that the DSC and IoU of the segmentation results of the biventricular regions on the test set increase significantly by using cross-validation for both stages of the CAT-Seg framework and the overall pipeline.Table 5 illustrates the improvement in each of the cardiac structures when fivefold cross-validation has been applied.
Another aspect is investigated to show the stability in CAT-Seg performance, the mean and range of the results are shown by boxplot in Fig. 9.It demonstrates that the range of segmentation results in terms of both DSC and IoU is compact and consistent for all three substructures.In Fig. 9a, the segmentation results of ACDC 2017 are presented.The LV segmentation results show that the DSC results are symmetric in both cardiac cycles.Also, the LV segmentation results are symmetric in terms of IoU results in the ES phase, but it has negative skew in the ED phase.Moreover, for both cardiac phases, the myocardium shows positive skew in DSC results, but it has a negative skew in IoU results.Additionally, the RV shows a spread in both cardiac phases but most of the results are symmetric.It has segmentation results that are consistent in terms of IoU than DSC.It is notable that the results in all cases are consistent with no outliers shown.The Mean IoU result in the ED cardiac phase is 0.8946 ± 0.0190   IoU results for segmenting RV are 0.870285 ± 0.041033 and 0.817455 ± 0.055544 respectively.Figure 10 depicts the importance of the localization phase as it compares the using the 3D-ARU in segmenting different types of slices in terms of mean DSC and mIoU.First, it uses the full slice without any localization or annotation and thus it results in relatively low segmentation results due to the complex structure of the cardiac MRI and surrounding objects.Then, the manually cropped slices were extracted as 128*128 blocks taken from the center following the standard used in the literature [14], 16.These slices are input to the proposed 3D-ARU model, but it also reflects a low segmentation evaluation.Moreover, cascaded 3D-ARU and the proposed model compete in the evaluation of the segmentation as both show approximately the same results in terms of mean DSC and mIoU.However, the proposed model takes roughly less than half of the testing time of the cascaded 3D-ARU.

Discussion
The performance of CAT-Seg is compared to existing approaches on the ACDC and MyoPs 2020 datasets for further validation.The comparison between the results for biventricular segmentation on ACDC dataset is shown in Table 6.CAT-Seg significantly outperformed all other methods in terms of the DSC and IoU on the ACDC test dataset.Since most of the state-of-the-art methods used DSC to evaluate the segmentation results, Table 6 details the evaluation comparison in terms of DSC.It is worth noting that the segmentation effect is particularly good for the more difficult segmentation of the ES of the heart.CAT-Seg is able to segment LV, Myo, and RV with an average minimum performance gap of 1.165%, 4.36% and 3.115% respectively.While the average maximum improvement in segmenting LV, Myo and RV is 4.395%, 6.84% and 7.315% respectively.The proposed model outperforms Li et al. [30] in LV, Myo and RV segmentation by 0.32%, 6.40%, and 1.15% respectively in ED cardiac phase.Also, in ES cardiac phase compared to Li et al. [30] the proposed model shows an outstanding performance in segmenting LV, Myo and RV by a performance gap of around 3.87%, 4.28%, and 5.08%.Furthermore, the proposed model is able to segment LV with a DSC that is 1.295% higher than that of the Yang et.al [13] work.Also, it is able to segment RV with a DSC that is 4.065% higher than that of the Yang et.al [13] model.Furthermore, the improvement in segmentation Myo is 4.36% in DSC compared to Yang et.al [13] model.Moreover, the CAT-Seg outperforms Silva et al. [32]'s model in segmenting the three substructures in both ED and ES phases.It is able to segment LV with a DSC that is 1.3% and 3.5% higher than that of the Silva et al. [32] model in the ED and ES phases.Also, the improvement in segmentation Myo in DSC is 6.38% for ED and 6.57% for ES compared to Silva et al. [32] model.Additionally, it is able to segment RV with a DSC that is 2.58% and 8.65% higher than that of the Silva et al. [32] model in the ED and ES phases respectively.Although the proposed model shows low average  The performance of CAT-Seg is compared to existing approaches on the MyoPs dataset for further validation.The comparison between the results for biventricular segmentation is shown in Table 7. CAT-Seg significantly outperformed all other methods in terms of the DSC on the MyoPs test dataset.CAT-Seg is able to segment LV, Myo, and RV with an average minimum performance gap of 6.13%, 5.44% and 2.912% respectively.While the average maximum improvement in segmenting LV, Myo and RV is 14.26%, 10.37%, and 8.544% respectively.It is worth emphasis that the results shown in Table 7 for CAT-Seg are without training on the training set of MyoPs 2020 and succeeded to surpass the performance of the state of the art.Hence, elucidate the generalization and robustness of the framework.
CAT-Seg attempts to provide a balance between the number of parameters and the accuracy, as the proposed SqueezeDeepLabv3+ uses SqueezeNet which is a lightweight and efficient CNN model.Also, it has fewer parameters than Xception so the SqueezeDeepLabv3+ decreases the number of parameters by 40.1173% and improves the accuracy by 1.3623% over the original DeepLabv3+.While the proposed 3D-ARU increases the number of parameters by 23.9719% over the original ResUNet but it improves the accuracy by 1.1615% compared to the original architecture.So, CAT-Seg framework compromises the number of parameters by using SqueezeNet for decreasing number of parameters and Attention mechanism which improves the accuracy, but it has greater number of parameters.

Conclusion
In this study, a fully automatic multi-stage segmentation framework CAT-Seg is proposed.The proposed framework is composed of two proposed architectures.In the first, ROI is localized by the modified variant SqueezeDeepLabv3+, to minimize processing and address the issue of pixel class imbalance.The proposed architecture for SqueezeDeep-Labv3+ uses SqueezeNet to enrich the encoder path.Also, SqueezeDeepLabv3+ modifies the atrous rate to localize the small structures like RV in ES.The second step involves submitting the ROI to 3D-ARU for segmentation.The proposed 3D-ARU uses ResUNet incorporating a spatial attention mechanism.
The results of the experiments show that the proposed method produces a mean DSC of 0.9595 in ED and 0.9541 in ES.In comparison to the single-stage segmentation process, the division into steps performed better.This is supported by the evaluation of the performance using the ACDC 2017 test dataset, where the proposed method achieves higher performance compared to state-of-the-art approaches in segmentation.CAT-Seg achieved an average maximum improvement in segmenting LV, Myo and RV of 4.395%, 6.84% and 7.315% respectively.Similar results are achieved when applied on the test set only of MyoPs 2020, producing a mean DSC of 0.9163 and mIoU of 0.8581.In conclusion, CAT-Seg offers a useful assistive tool to aid the early detection and treatment planning of cardiovascular diseases, which is critical for a better prognosis.For future work, this study can be extended and applied to 3D medical images augmentation, which can solve the limitation of limited dataset and reflect the changes in more samples.

Fig. 1 Fig. 2
Fig. 1 Samples from ACDC Dataset during End-diastolic and End-systolic for the four different pathologies and normal heart (LV: Green, Myo: Blue, and RV: Red)

Fig. 5
Fig. 5 Structure of attention mechanism

Fig. 6
Fig. 6 CAT-Seg final segmentation results where the RV is marked in blue, LV marked in yellow while Myocardium shown in green.Showing that segmentation results solve the problem on ROI Localized images by Removing the noisy regions that has the same inten-

Fig. 7 4
Fig.7 The effect of the 3D-ARU model in terms of fitting the shape of the (LV in yellow, Myo in green, and RV in blue) cardiac substructures.From left to right, the images are the original cardiac MRI

Fig. 8
Fig. 8 CAT-Seg DSC accuracy and loss during the training and validation process of segmenting cardiac biventricular during both cardiac phases ED in a and ES in b

Fig. 9
Fig. 9 Box plots of the CAT-Seg framework results in terms of DSC and IoU a on ACDC dataset for the three cardiac substructures (LV, Myo, and RV) and the mean IoU and DSC in both cardiac phases (ED and ES), b on MyoPs 2020 dataset for external validation

Fig. 10
Fig.10 Mean DSC and mIoU on ACDC dataset comparison between the whole slice and the three types of localized slices

Table 1
Previous approaches to address biventricular cardiac segmentation

Table 5
Evaluation of the CAT-Seg Framework and each stage separately in terms of DSC, IoU for fivefold cross-validation on ACDC dataset

Table 6
Comparison with state-of-the-art cardiac segmentation methods on ACDC dataset in segmenting (LV, Myo and RV) in terms of DSC for both cardiac phases LV in ED, it draws an average improvement of 4.5316% in segmenting the three cardiac substructures in the ES cardiac phase.Moreover, the outstanding performance of the proposed model in segmenting Myo and RV in ES cardiac phase improvement in the ES phase.Additionally, it reflects the strength of the proposed model to solve the mentioned challenge of ES segmentation especially for RV.