1 Introduction

Almost half of all diagnosed cases suffer from atrial fibrillation, the most common cardiac arrhythmia that causes permanent damage to the heart and body [1, 2]. There were 366,000 cardiovascular deaths worldwide in 2021 due to atrial fibrillation and flutter. In addition, by 2021, 8,200,000 cardiovascular deaths have occurred due to atrial fibrillation and flutter [3]. Recurrent atrial attacks cause various problems, such as enlargement of the atrial structure, myofiber changes, and fibrosis [4, 5]. Therefore, it is crucial to precisely segment the atrium in cases with AF so that it can be observed [6]. MRI devices using LGE are one of the best methods to distinguish between scarred and unscarred atrium walls. In addition, LGE MRI is a suitable method for determining the size of fibrotic tissue to examine atrial scar formation after ablation [7,8,9].

Manual segmentation of LA walls obtained from LGE-based MRI is time-consuming and tedious. [8, 10,11,12]. In addition, segmentation results show high variability among experts. Deep learning-based segmentation models have recently become increasingly important to eliminate the manual segmentation difficulties mentioned above and facilitate clinicians’ decision-making [13]. However, many deep learning-based segmentation models add different modules to provide the best performance on data sets, which increases the number of parameters of the model. The large number of parameters of the designed model causes the hardware to be insufficient in terms of calculation. In this study, standard convolution and depthwise separable convolution were used together in the layers to reduce the number of parameters of the deep learning model. As can be seen from experimental studies, the layer-based hybrid structure has significantly reduced the number of parameters of the model. Another situation in medical image segmentation models is that the boundaries of the organs are perceived in the first layers, and the boundary features disappear in the upper layers of the architecture. In organ segmentation, a robust segmentation of the borders of the organ is recommended rather than the segmentation of a sequential structure. The proposed study used the bottleneck attention module (BAM) to prevent the loss of boundary features [14]. The proposed architecture is obtained by applying various additions and subtractions to the V-Net architecture proposed by Milletari et al. [15].

The contributions of the proposed methodology to the literature are as follows:

  • A depthwise separable convolution was added to the second standard convolution layer to solve computational limitations by reducing the number of parameters of the network and learning the feature representations better [16, 17].

  • In the coding layers of the deep learning architecture, the BAM is used, which strengthens the learning performance of the network by emphasizing the high-level features more and suppressing the low-level features after the convolution operations.

  • A fusion loss function increased the network’s test accuracy in LA segmentation from LGE MRI.

  • In addition, as a result of experimental studies, PReLU activation functions were used in the convolution layers and ReLU activation functions in the BAM layer [18, 19].

Sample MRI slices of the STACOM 2018 dataset used in training the proposed architecture are shown in Fig. 1. Sample images are pictures and ground truth of the axial images in the data set.

Fig. 1
figure 1

Illustration of the challenges in STACOM 2018. The first row shows the original MR images’ slices, and the second shows the corresponding ground truths

To summarize the remaining sections of the paper, in Sect. 2, brief information is given about the state-of-the-art approaches in the literature on deep learning-based segmentation of atrium images. In Sect. 3, information about the methodology used is given. Section 4 provides information about the data set used for training the model, performance metrics, and parameters used in training. In the fifth section, the experimental analyses of the model are presented and discussed. Finally, in the sixth section, the model's performance is evaluated, and information about future studies is given.

2 Related Works

One of the most critical problems in deep learning-based organ segmentation is that the organ boundaries are considered in the first layers of the deep learning network, and the learning rate of the boundaries decreases in the following layers. In recent studies in the literature, it is seen that this problem is tried to be solved by adding various attention modules, especially in the downsampling stages.

The human visual system performs a series of visual processing to capture important features in any image, focusing the visual system on regions where essential features are [20, 21]. Attention modules have been developed in deep learning architectures to mimic these attention structures of the human visual system and increase segmentation performance. Attention modules focus on high-level features and filter out low-value features [22,23,24,25]. There are many examples of the use of attention modules in deep learning-based segmentation of organs. For instance, Vernikouskaya et al. proposed a U-Net-based fully automatic segmentation model containing a multi-stage pipeline for detecting arrhythmia from cardiac magnetic resonance images (CMRI) [26]. Jabdaragh et al. segmented the left atrium using the multi-task fractal dimension (MTFD-Net) architecture, which combines fractal geometry and a multi-task network. The study tried to increase modeling segmentation performance by mapping images in fractal dimensions [27]. Zhou et al. placed a cross-modal attention module between the encoder and decoder layers for cardiac segmentation, enabling the network to learn interrelated high-level information better [28]. Uslu et al. proposed a multi-task LA-Net network that simultaneously obtains segmentation and edge masks of the left atrium from MRI to detect atrial fibrillation. A combination of cross-attention modules (CAM) and enhanced decoder modules (EDM) were used to incorporate boundary information into the proposed model [29]. Zhang et al. developed three different attention modules in the context of spatial, channel, and region for ventricle segmentation [30]. Zhao et al. segmenting the left atrium focused on tissue boundaries and tissue region. They used attention modules based on the ResNet-101 architecture in the model’s layers. In addition, regional and boundary loss functions are used as hybrids in the model [31]. Li et al. proposed a U-Net architecture with hierarchical aggregation and attention modules for precise segmentation of LA [32].

Additionally, in the literature on left atrium segmentation, Chen et al. proposed a fully automatic segmentation model based on the deep U-Net architecture, which they obtained by modifying the U-Net architecture [33]. Yang et al. used a transfer learning and deep control strategy to focus on spatial dependence in the segmentation region in left atrium images [34]. Uslu and Bharath proposed a segmentation model with a quality control system that uses a single encoder and three decoders and predicts the run-time quality of the segmentation masks [35]. Yang et al. proposed an atlas-based end-to-end segmentation model for segmenting cardiac images obtained from LGE MRI [36]. Tao et al. proposed a fully automated deep learning-based segmentation model for the segmentation of LA and PV [37]. Xiong et al., on the other hand, proposed the AtriaNet architecture, which consists of a multi-scale dual pathway 2-D CNN model for the segmentation of LA [6]. Puybareau et al. have proposed a VGG-Net-based learning transfer model for the segmentation of LA [38].

Many different semi-supervised learning studies in the literature use unlabeled data in medical datasets to train deep learning architectures. One of the recent semi-supervised learning-based studies is the CA-Net architecture proposed by Zhao et al. [39]. In the proposed model, the Trans V module has been added to the V-Net architecture to learn contextual information. Studies in the literature mainly focus on training and testing 2D and 3D deep convolutional neural networks on datasets consisting of LGE MRI images. Luo et al. proposed a deep learning model, which they call U-Net-based semi-supervised uncertainty rectified pyramid consistency (URPC), for the segmentation of medical images [40]. To use abundant unlabeled data in the segmentation of atrium images by Li et al., a module called signed distance map of object surfaces (SDM) was added to the semi-supervised V-Net network used as the backbone [41]. Wang et al. analyzed a semi-supervised learning model called dual-consistency network (DC-Net) on 3D atrium images to achieve high performance in datasets with limited unlabeled data [42]. Luo et al. proposed a new semi-supervised pixel-based dual-task coherence learning strategy (DTCV-Net) that determines the learning strategy from unlabeled data [43]. V-Net architecture constitutes the backbone of the proposed architecture. The major challenge in most of these studies is to segment the LA boundaries and region accurately and robustly. The study suggests a fully automatic pipeline to segment the segmentation region and boundaries more precisely. In addition, the study presents layer-based hybrid convolution as significantly reducing the number of parameters by focusing on computational limitations. The proposed new approach's architectural structure and performance analysis are explained in detail in the remainder of the study.

3 Methodology

The proposed model is an encoder–decoder-based fully automatic V-shaped segmentation architecture that combines standard convolution, depthwise separable convolution, and BAM module.

3.1 Depthwise Separable Convolution

Figure 2 shows the block diagram of the depthwise separable convolution. While the standard convolution performs channel and spatial computations in one step, the depthwise separable convolution consists of two parts: depthwise convolution and pointwise convolution. Depthwise convolution applies a separate convolution to each input channel, while pointwise convolution obtains a linear combination of these convolutions and feeds it to the BAM module. Using depthwise separable convolution instead of standard convolution in the second convolution layer enabled the proposed model to use approximately 20 times fewer parameters. In addition, depthwise separable convolution significantly reduces the computational cost. However, applying depthwise separable convolution in all layers will reduce training accuracy in deep learning architectures where the parameter is very low. The mathematical models clearly show the difference between depthwise separable in Eq. 1 and standard convolution in Eq. 2. In Eqs. 1 and 2, M is the number of input image channels, N is the number of filters, Dp is the image output size, Dk is the number of kernels, and Dg is the size of feature maps for standard convolution.

$$ M \times Dp^{2} \times \left( {D_{{\text{k}}}^{2} + N} \right) $$
(1)
$$ N \times Dp^{2} \times Dg^{2} \times M $$
(2)
Fig. 2
figure 2

Depthwise separable convolution layers’ architecture

3.2 Bottleneck attention module (BAM)

The BAM used in the last layers of the downsampling blocks of the proposed architecture is shown in Fig. 3. For the feature map function F ∈ RC×H×W from the depthwise separable convolution layer, BAM infers a 3D attention map M(F) ∈ RC×H×W. Then, the rearranged feature map F' is calculated in Eq. 3.

$$ F^{\prime} = F + F \otimes M(F) $$
(3)

where ⊗ denotes element-wise multiplication. In the BAM architecture, the attention mechanism and the learning block are now used together to facilitate the gradient flow. First, the channel attention function Mc(F) ∈ RC and spatial attention function Ms(F) ∈ RH×W values are calculated to design an efficient yet powerful module. In addition, σ is a sigmoid function. Then, as seen in Eq. 4, the M(F) attention map was calculated as the sum of the channel and spatial attention values.

$$ M(F) = \sigma (M_{c} (F) + M_{s} (F)) $$
(4)
Fig. 3
figure 3

BAM layers’ architecture

3.3 DSBAV-Net

The proposed DSBAV-Net architecture is a V-shaped deep learning network consisting of an encoder and a decoder. The architectural structure of the methodology is shown in Fig. 4. When the proposed model is compared with a baseline V-Net network, after the standard convolution layer with 2 × 2 × 2 filters in each convolutional block, a depthwise separable convolution layer with 3 × 3 × 3 filters was used for depthwise convolution and 1 × 1 × 1 filters were used for pointwise convolution added by removing the second standard convolution layer for feature extraction in depth and spatially to increase the performance of the network. In addition, thanks to the depthwise separable convolution, the network’s parameters have been reduced, and computational limitations have been tried to overcome. In addition, by adding BAM to the last layer of the encoder blocks, the segmentation performance of the network is increased by suppressing the unwanted features and highlighting the high-level features in the feature maps of the images in each layer. A 5 × 5 × 5 convolution is used in the input layer of the proposed model. Only standard convolution and depthwise separable convolution layers were used in the decoder layers. 2 × 2 × 2 filters were used for standard convolution in each block. For depthwise separable convolution, as seen in Fig. 3, 3 × 3 × 3 filters were used for depthwise convolution, and 1 × 1 × 1 filters were used for pointwise convolution. In addition, the ReLU activation function for the BAM performed better, while the PReLU activation function for the other convolutional layers showed higher performance. While creating the layers in the proposed architecture, a fivefold cross-validation method was used.

Fig. 4
figure 4

Proposed architecture

3.4 Cross-Entropy Dice Fusion Loss Function

A fusion loss, including categorical cross-entropy(CE) and dice loss functions, is proposed to compute the loss of the proposed methodology in this study. Dice loss is 1 − dice score. Dice loss and CE loss functions are shown in Eqs. 6 and 7. While the two loss functions are fused, the loss1 value is obtained by multiplying the dice loss with an α coefficient. The loss2 value was obtained by multiplying the CE loss function with the coefficient 1 − α, and the total cost value was obtained by adding these two loss values. After experimental studies, the ideal α value was determined as 0.5.

$$ {\text{Dice}} = \frac{{2*{\text{TP}}}}{{2*{\text{TP}} + {\text{FP}} + {\text{FN}}}} $$
(5)
$$ {\text{Dice loss}} = 1 - {\text{Dice}} $$
(6)
$$ {\text{Loss}}1 = \alpha *{\text{Dice loss}} $$
(7)

True positive (TP) represents correctly predicted lesion pixels, while false positive (FP) represents incorrectly predicted lesion, and false negative (FN) means incorrectly predicted lesion pixels.

In Eq. 8, sp is the score of the positive class, while C gives the number of classes.

$$ {\text{CE}} = - \log \left( {\frac{{e^{{S_{p} }} }}{{\mathop \sum \nolimits_{j}^{C} e^{{s_{j} }} }}} \right) $$
(8)
$$ {\text{Loss}}2 = \left( {1 - \alpha } \right)*{\text{CE}} $$
(9)
$$ {\text{Loss}} = {\text{loss}}1 + {\text{loss}}2 $$
(10)

4 Materials

4.1 Preparing the Dataset

The segmentation performance of DSBAV-Net has been analyzed in the STACOM 2018 [44] dataset. The STACOM 2018 dataset includes 154 late gadolinium-enhanced (LGE) MRI-based AF images with an isotropic resolution of 0.625 mm × 0.625 mm × 0.625 mm. Of the 154 images, only the ground truth of 100 volumes has been shared publicly. For performance testing of the proposed methodology, the first 60 of the 100 volumes are devoted to training, 20 for testing, and the remaining 20 for validation.

The volumes in the dataset are resized 112 × 112 × 80. The STACOM 18 dataset contains data from different imaging centers. Segmentation masks include the LA region, mitral valve, LA appendage, and parts of the pulmonary vessels. The number of low and high-quality images in this dataset is almost equal.

4.2 Performance Metrics

The segmentation performance of DSBAV-Net was evaluated using the most used performance metrics in the literature, namely dice, intersection over union (IoU), 95% Hausdorff distance (95HD), average surface distance (ASD), precision (Prec) and recall (Rec) [45]. These metrics were calculated with a Python library called MedPy. Equations 5, 11, 12, 13, 14 and 15 give the mathematical equations for the performance criteria. In Eqs. 5, 11, 14 and 15, TP stands for true positive, FP stands for false positive, and FN stands for false negative. Hausdorff distance (HD) measures how far apart two subsets of space are from each other in terms of Euclidean distance. In the formula in Eqs. 12 and 13, A shows the predicted value, and B shows the ground truth, where ‘d(.)’ in the equation is the distance between the points a and b of the predicted set A and the ground truth B. Using 95% HD in the literature eliminates outliers.

$$ {\text{Jaccard}}\;({\text{IoU}}) = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}}}} $$
(11)
$$ {\text{HD}} = \max \left\{ {\begin{array}{*{20}c} {\begin{array}{*{20}l} {\max } \hfill & {\min } \hfill \\ {a \in A} \hfill & {b \in B} \hfill \\ \end{array} \left\{ {d\left( {a, \, b} \right)} \right\},} & {\begin{array}{*{20}l} {\max } \hfill & {\min } \hfill \\ {b \in B} \hfill & {a \in A} \hfill \\ \end{array} \left\{ {d\left( {a, \, b} \right)} \right\}} \\ \end{array} } \right\} $$
(12)
$$ {\text{ASD}} = \frac{1}{{\left| {{\text{SA}}} \right| + \left| {{\text{SB}}} \right| }}\left( {\mathop \sum \nolimits_{{a \in S_{A} }} d\left( {a, S_{A} } \right) + \mathop \sum \nolimits_{{b \in S_{B} }} d\left( {b, S_{A} } \right)} \right) $$
(13)
$$ {\text{Prec}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}. $$
(14)
$$ c = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$
(15)

4.3 Model’s Training Details

The training details of the model are shown in Table 1. As can be seen in Table 1, the parameters of all networks are optimized with the Adam algorithm and a learning rate of 0.0001 [46]. The training phase was set at 100 iterations for DSBAV-Net and other architectures used for comparative performance analysis. The training activity was manually stopped in case of signs of overfitting. The batch size in the training set was determined as 2 for the STACOM 2018 dataset, and image dimensions were resized to 112 × 112 × 80, respectively. Random crop, center crop, and random rotation (degree 90) and flip (axis = np.random.randint (0, 2)) techniques were applied for the STACOM 2018 dataset. The proposed methodology and all meshes used for comparative analyses were performed on a computer with GTX_1080_TI GPU and Anaconda ecosystem with Pytorch cuda library version 1.13.1.

Table 1 Models’ training details

5 Experimental Analysis and Discussion

5.1 Ablation Study

It has been observed that learning performance decreases when kernel_size is increased in standard convolution layers of DSBAV-Net. Likewise, increasing the kernel size in the depthwise separable convolution layers of DSBAV-Net caused a performance decrease in training with the features of the STACOM 2018 dataset. The imbalance between the quality of the images in STACOM 2018 is one of the most important reasons. The success rate of DSBAV-Net on the STACOM 2018 dataset increased significantly, mainly due to the spatial and channel relevance of the BAM layer. In addition, when PReLU and ReLU activation functions are used in the BAM module, the training and validation losses obtained are shown in Figs. 5 and 6. It was observed that the training performance of DSBAV-Net decreased when the standard convolution layer used in the proposed model was converted into a depthwise separable convolution layer, as shown in Table 2. In addition, the depthwise separable convolution layer used after the standard convolution layer significantly reduced the computational cost.

Fig. 5
figure 5

Training and validation losses when using ReLU in the BAM module of the proposed architecture

Fig. 6
figure 6

Training and validation losses when using PReLU in the BAM module of the proposed architecture

Table 2 Case-based performance analysis results on the test set of the proposed architecture

5.2 Comparative Performance Analysis of the Model

The final training performance results of the proposed methodology and other models are given in Fig. 7. As can be seen from the graph, the proposed architecture consistently increased the training dice score throughout 100 epochs. As can be seen from the training results, DTCV-Net and CA-Net architectures achieved the lowest dice scores.

Fig. 7
figure 7

Models' final training dice performance scores

As can be seen in Table 3, the proposed methodology has been comparatively tested for performance with V-Net, BAM V-Net, depthwise separable V-Net (DSV-Net), CA-Net, URPCU-Net, and DTCV-Net models available in the literature. In a comparative analysis, DSV-Net, one of the architectures, was proposed for the first time in this study. Depthwise separable and standard convolution are used together in each layer of DSV-Net. Hybrid convolution significantly reduced the number of parameters and significantly increased the training and testing speed of the architecture. While choosing the architectures used in the comparative analysis of the proposed methodology, attention was paid to their segmentation performance. Comparative studies, which can be seen in Table 3, show that the proposed model is very robust.

Table 3 Quantitative analysis between the proposed methodology and state-of-the-art approaches

5.2.1 Performance Analysis at STACOM 2018

Table 2 shows the case-based performance results of the proposed architecture in 20 test images on the STACOM 2018 dataset. The proposed architecture showed remarkable success in the HD metric in all cases except cases 2 and 3. The hierarchical attention mechanism of the BAM module of the proposed model resulted in high performance in the HD metric. Successful results were obtained in 20 test images in dice, Jaccard, and ASD metrics.

Table 3 shows comparative performance analyses on the STACOM 2018 data set for the methodology proposed with the latest technology approaches in the literature. As can be seen from the table, the proposed model showed a much better performance than other models. In addition, thanks to layer-based hybrid convolution, the parameters used by the network have been significantly reduced, and the computational limitations caused by the hardware have been tried to be overcome. URPCU-Net, which gave the closest performance to the proposed methodology, achieved high differences such as 1 point in dice score, 2 points in IoU, and 5 points in 95HD. In addition, DSV-Net was first tried in this study, and successful results were obtained.

Figure 8 shows the qualitative analysis between the proposed methodology and the latest technological approaches. The addition of the depthwise separable convolution and BAM to the network ensures that the network is focused on the atrium region. When the proposed model was compared to URPCU-Net, which achieved the closest performance results, the proposed methodology achieved differences of up to 1 point in dice score, 2 points in IoU, and 5 points in 95HD. In addition, DSV-Net was tried for the first time in this study, and successful results were obtained. Additionally, the sections enclosed in the red bounded box in Fig. 8 represent highly inaccurate sections predicted outside the organ region.

Fig. 8
figure 8

Qualitative analysis of the proposed methodology with state-of-the-art approaches in the literature. The sections shown in red boxes in the images are highly incorrect predictions

Figure 9 also shows the qualitative analysis of the proposed methodology in slices of another test case using state-of-the-art methods. The proposed model performed more robustly than others, especially in section 49 and 50 transitions. Additionally, the sections enclosed in the red bounded box in Fig. 9 represent highly inaccurate sections predicted outside the organ region.

Fig. 9
figure 9

Qualitative analysis of the proposed methodology with state-of-the-art approaches in the literature. The sections shown in red boxes in the images are highly incorrect predictions

5.3 Discussion

In this section, the advantages and disadvantages of the proposed method are discussed.

5.3.1 Advances in Operation Time

Thanks to the depthwise separable convolution layer in the proposed model, the number of parameters approaching 44.5 million has been reduced to 2.4 million. The proposed model significantly reduces parameters and time, as seen in Table 3. Table 3 shows the step per epoch time in seconds for architectures. One of the reasons for this is that MR images are three-dimensional, and therefore, three-dimensional matrix operations are performed. Additionally, on computers with higher graphics card VRAM size, the operation time can be further reduced by increasing the batch size. In addition, the proposed architecture completed the number of epochs per step almost 30 s earlier than BAMV-Net and baseline V-Net architectures. However, the proposed model fails in the training phase when the standard convolution layer is changed to a depthwise separable convolution layer in the model, as shown.

5.3.2 Segmentation Performance Limitations

Thanks to the BAM module, we observed that the model obtained excellent results even in low-resolution LA images. However, like other models, it had difficulty lowering the HD metric. In addition, the proposed architecture significantly reduces the computational cost compared to the number of parameters of BAM V-Net and baseline V-Net models, as seen in Table 3. On the other hand, when we reduced the number of channels in the architecture, the layer-based hybrid convolution failed to extract high-level features from LA images in the training phase. As can be seen in Fig. 5, the proposed architecture showed superior performance on MRI slices thanks to BAM and depth-separable convolution.

6 Conclusion

This study used standard and depthwise separable convolution for the first time in convolutional blocks. In addition, using the BAM module after each convolutional block in the encoder part of the model, both channel and spatial high-level features are emphasized more, and low-level features are suppressed. In addition, thanks to the depthwise separable convolution, the number of parameters has been reduced approximately 20 times, and both spatial and in-depth monitoring of the features in the image is provided. While the ReLU activation function in the BAM module of the proposed model performed better, the PReLU activation function in the other blocks of the model performed better. In addition, by fusing cross-entropy and dice loss functions, the robustness of the proposed model is increased. A comparative analysis of the proposed model on the STACOM 2018 dataset showed that it is robust. DSBAV-Net achieved a dice score of 91.49 on STACOM 2018 20% test data. The obtained dice score and qualitative analyses also show that the proposed model is highly robust for LA segmentation. The success of the proposed model in other organ segmentations will also be investigated in future studies. In addition, a new loss function that will increase the model’s performance will be discussed.