Introduction

The diverse and rich cultural heritage of various ethnic groups in China underscores the significance of researching traditional ethnic clothing patterns as a pivotal component in the preservation of Chinese ethnic culture. The application of image segmentation techniques to extract and isolate elemental patterns within ethnic attire provides crucial technological support for endeavors such as delving into the symbolic meanings inherent in ethnic patterns and the establishment of large-scale databases dedicated to ethnic patterns.

Currently, existing semantic segmentation methods predominantly rely on fully convolutional neural networks (FCNN) with a U-shaped architecture [1,2,3]. A typical U-shaped network, such as U-Net [1], comprises symmetric encoder–decoder structures and skip connections. In the encoder, a series of convolutional layers and consecutive downsampling layers are employed to extract deep features with large receptive fields. Subsequently, the decoder upsamples the extracted deep features to the input resolution for pixel-level semantic prediction. Simultaneously, skip connections fuse high-resolution features from different scales in the encoder to alleviate spatial information loss caused by downsampling. With this architectural design, U-Net has achieved significant success in various semantic segmentation tasks. Following this technological trajectory, numerous algorithms, including 3D U-Net [4], Res-Unet [5], U-Net++ [6], and Unet3+ [7], have been developed for diverse semantic segmentation tasks in various scenarios. These FCNN-based methods demonstrate outstanding performance across a range of semantic segmentation tasks, showcasing the powerful capability of convolutional neural networks in learning discriminative features.

Despite the outstanding performance achieved by methods based on Convolutional Neural Networks (CNN) in the field of semantic segmentation, they still fall short of meeting the stringent requirements for segmentation accuracy in certain applications. Segmenting complex ethnic clothing patterns with intricate textures remains a challenging task. Due to the inherent locality of convolutional operations, CNN methods struggle to learn explicit global and long-range semantic information interactions [8]. Some studies attempt to address this issue by employing dilated convolutional layers [9, 10], self-attention mechanisms [11, 12], and image pyramids [13]. However, these methods still have limitations in modeling long-range dependencies. Recently, inspired by the tremendous success of Transformers in Natural Language Processing (NLP) [14], researchers have begun exploring the integration of Transformers into the field of computer vision [15]. Dosovitskiy et al. [16] proposed the Vision Transformer (ViT) for image recognition tasks. ViT achieved performance comparable to CNN-based methods by using 2D image patches with position embeddings as input and pretraining on large-scale datasets. Additionally, Touvron et al. [17] introduced the Data-efficient Image Transformer (DeiT), demonstrating that Transformers can be trained on medium-sized datasets and, when combined with distillation methods, can achieve more robust performance. Liu et al. [18] proposed the Swin Transformer, a hierarchical Vision Transformer using a shifted windows mechanism, which outperformed in image classification, object detection, and semantic segmentation. The success of ViT, DeiT, and Swin Transformer in image recognition tasks underscores the potential of applying Transformers to the field of computer vision.

Inspired by the Swin Transformer [18], this paper introduces the Mixed Swin Transformer U-Net, aiming to fully leverage the advantages of Transformers in image segmentation. The encoder, bottleneck, and decoder are constructed based on classical convolutional layers and Swin Transformer modules [18]. Ethnic clothing images as input first undergo feature learning with shallow classical convolutional layers in the network, emphasizing initial layers' local relationships containing higher-resolution details. Subsequently, Swin Transformer modules handle deep semantic features with higher abstraction to reduce the computational load of the network. In the decoder, an Attention Gate mechanism, coupled with multi-scale features from the encoder, performs adaptive feature fusion to restore spatial resolution of feature maps and further facilitate segmentation predictions. Experiments conducted on a custom-built dataset of ethnic clothing patterns demonstrate that the proposed method exhibits outstanding segmentation accuracy and robust generalization capabilities.

The contributions of this paper can be summarized in three aspects:

  1. (1)

    Symmetric encoder–decoder structure with local-to-global self-attention: a symmetric encoder–decoder structure is established based on both traditional convolutional layers and Swin Transformer modules. The encoder incorporates local-to-global self-attention, enabling the model to capture information from local to global scales. In the decoder, global features are upsampled to the input resolution, facilitating pixel-level segmentation predictions.

  2. (2)

    Adaptive feature fusion using attention gate mechanism: the decoder utilizes an Attention Gate mechanism for adaptive feature fusion. This mechanism enhances the network's focus on target regions while suppressing irrelevant areas in the images, thereby improving segmentation accuracy.

  3. (3)

    Quantitative and qualitative experiments on ethnic clothing pattern dataset: quantitative and qualitative comparative experiments are conducted on an ethnic clothing pattern dataset, demonstrating superior performance compared to similar semantic segmentation algorithms. The proposed method achieves optimal results in terms of segmentation accuracy on the ethnic clothing pattern dataset.

Related work

CNN-based methods

The early methods for semantic segmentation were primarily based on contours and traditional machine learning algorithms [19, 20]. With the development of deep convolutional neural networks (CNNs), U-Net was introduced for semantic segmentation [1]. Due to the simplicity and excellent performance of the U-shaped architecture, various U-Net-like methods emerged, including Res-Unet [5], Dense-Unet [21], U-Net++ [6], and Unet3+ [7]. Chen et al. [22] proposed the use of fully convolutional networks for semantic segmentation, achieving pixel-level predictions. Zhao et al. [13] introduced the pyramid pooling module and pyramid scene parsing network, enhancing the model's ability to capture global contextual information. These methods have been applied to various application scenarios, such as 3D medical image segmentation [4, 23], video image segmentation [24], and 3D point cloud segmentation [25]. Currently, methods based on convolutional neural networks have achieved significant success in various segmentation domains, owing to their powerful representational capabilities. Gou et al. [26] introduced a novel approach named Multilevel Attention-based Sample Correlations for Knowledge Distillation (MASCKD). MASCKD utilizes attention maps from multiple hierarchical layers to construct sample correlations, focusing on the most critical sample regions and thereby enhancing the effectiveness of relational knowledge distillation.

Transformer-based methods

The Transformer architecture was originally proposed for machine translation tasks [14]. In the field of natural language processing, Transformer-based methods have demonstrated excellent performance across various tasks [27]. Inspired by the success of Transformers, Dosovitskiy et al. [16] pioneered the introduction of the Vision Transformer (ViT) into neural networks, achieving a remarkable balance between speed and accuracy in image recognition tasks. In comparison to CNN-based methods, ViT's drawback lies in the requirement for pre-training on its own large dataset. To alleviate the challenges of training ViT, DeiT [17] proposed several training strategies enabling ViT to perform well on ImageNet. Recently, notable works have expanded upon ViT [28, 29]. Among them, Liu et al. [18] introduced the Swin Transformer, incorporating it as the backbone network for vision tasks. Swin Transformer, based on a sliding window mechanism, has achieved state-of-the-art performance in various visual tasks, including image classification, object detection, and semantic segmentation. In this study, Swin Transformer modules are employed as units within the network, constructing a U-shaped encoder–decoder structure for the segmentation of ethnic clothing patterns.

In recent years, researchers have explored the integration of self-attention mechanisms into CNNs to enhance network performance [12]. Chen et al. [8] combined Transformers with CNNs to construct a powerful 2D medical image segmentation encoder. Valanarasu et al. [30] and Zhang et al. [31] leveraged the complementarity of Transformers and CNNs to improve the segmentation capabilities of their models. Currently, various combinations of Transformers and CNNs are applied to tasks such as multimodal brain tumor segmentation [32] and 3D medical image segmentation [33, 34]. Wang et al. [35] introduced the Mixed Transformer Module (MTM), which efficiently calculates self-connections through a well-designed local–global Gaussian-weighted self-attention mechanism (LGG-SA). They further explored inter-sample relationships using an external attention mechanism (EA). Finally, the Mixed Transformer Module was employed to construct a U-shaped network for medical image segmentation.

Dataset

This study collected a total of 1600 images of ethnic clothing patterns through on-site photography, book scanning, and internet sources. These images encompass four major categories of classic patterns, namely flowers, birds, butterflies, and ornaments. Survey results indicate that these four types of ethnic clothing patterns exhibit prominent characteristics, featuring rich texture details and a high frequency of occurrence in ethnic attire, making them suitable as representative types for studying ethnic clothing patterns. Additionally, manual annotation was performed on the entire dataset, generating corresponding mask images for each photograph. The four types of ethnic clothing patterns are illustrated in Fig. 1.

Fig. 1
figure 1

Four classic ethnic clothing patterns

Due to variations in environmental conditions such as ambient lighting during on-site collection and book scanning, this study employed Gaussian filtering to denoise the dataset. This denoising process aims to enhance the quality of images, thereby further improving the accuracy and reliability of the model's image processing results. Additionally, to generate more training samples and enhance the model's robustness and generalization capabilities, various data augmentation techniques such as random rotation, cropping, contrast variations, and changes in image size were applied to preprocess the dataset.

Methods

To address challenges arising from diverse elements in ethnic clothing pattern semantic segmentation, such as complex backgrounds, intricate details, and potential label confusion, this paper introduces a Mixed Swin Transformer U-Net (MST-Unet).

The model is based on a U-shaped network structure comprising an encoder–decoder, utilizing an Attention Gate mechanism for adaptive feature fusion during decoding. Classic convolutional operations are retained in the shallow layers of the network because these layers contain more advanced features and high-resolution details, such as the shape, color, and texture of the target. Leveraging classic convolutional operations in these layers aids the network in better learning these features. Additionally, classic convolutional operations possess excellent locality and parameter sharing properties, contributing to reduced model complexity, improved training efficiency, and enhanced handling of large-scale datasets.

In the deeper layers where semantic information is more complex, and features are highly abstract, this paper employs two consecutive Swin Transformer modules for feature extraction. The utilization of two consecutive Swin Transformer modules is attributed to the fact that each of these modules incorporates distinct attention mechanisms, namely the window-based multi-head self-attention (W-MSA) module and the shifted window-based multi-head self-attention (SW-MSA) module. The decision to employ two successive modules is driven by the need to strike a balance between model complexity, computational efficiency, and the desired feature extraction capability.

By incorporating both W-MSA and SW-MSA modules in two consecutive layers, the model aims to achieve effective hierarchical feature representation while mitigating the computational burden associated with a higher number of modules. This design choice enables the model to capture intricate patterns and details in the data while maintaining computational feasibility. Such an approach ensures the effectiveness of the Swin Transformer module within a broader architectural context, emphasizing the thoughtful consideration of complexity, computational efficiency, and feature extraction requirements.

The Local–Global Attention mechanism within these modules aids in effectively integrating local features and global information. This allows the model to better capture spatial relationships and context information in the images, thereby enhancing the model's feature representation capability. Furthermore, the Local–Global Attention mechanism helps reduce computational and storage costs. Patch Merging and Patch Expanding are employed for down-sampling and up-sampling of data, respectively, to ensure dimensional consistency of features across different stages in the model.

The Swin Transformer module integrates the Window-based Multi-Head Self-Attention (W-MSA) and Shifted Window-based Multi-Head Self-Attention (SW-MSA) mechanisms, while the Attention Gate incorporates an attention gating mechanism. These diverse attention mechanisms are synergistically fused within the model, facilitating a comprehensive integration of multiple attention mechanisms. This amalgamation enables the model to leverage the specific strengths of W-MSA and SW-MSA within the Swin Transformer module for effective feature extraction and modeling. Simultaneously, the attention gating mechanism in the Attention Gate module dynamically adjusts feature fusion weights based on the correlation between encoder and decoder feature maps. This amalgamation of various attention mechanisms enhances the model's adaptability and performance in tasks such as semantic segmentation of ethnic clothing patterns. The overall structure of the network model is illustrated in Fig. 2.

Fig. 2
figure 2

Architecture of the MST-Unet network

Upper encoder–decoder structure

In the encoder section of the upper network [35], three downsampling modules (Conv 3 × 3 Block) are employed to progressively downsample the input image and extract features at different scales. Each downsampling module comprises a dual convolution module and a pooling layer. The dual convolution module enhances the representational capacity of the feature maps, while the pooling layer downsamples the feature maps by a factor of two to reduce resolution. The network structure is illustrated in Fig. 3.

Fig. 3
figure 3

Architecture of the downsampling module. (a conv 3 × 3 block; b doubleconv block;)

In each downsampling module, feature maps of different scales, processed through dual convolution layers, are stored in a list for subsequent decoding use.

The DoubleConv Block consists of two ConvBNRelu layers and a residual connection. Each ConvBNRelu layer includes a convolution, batch normalization (BatchNorm), and an activation function (Relu). The image information as input undergoes convolution to produce the output feature map, denoted as y:

$$ {y_{i,j}} = \mathop \sum \limits_{m = 0}^{k - 1} \mathop \sum \limits_{n = 0}^{k - 1} {w_{m,n}}{x_{i + m,j + n}} + b $$
(1)

x represents the input feature map, w denotes the convolutional kernel, b represents the bias term, k indicates the size of the convolutional kernel, and i, j respectively represents the rows and columns of the output feature map. Additionally, m and n signify the positions where the convolutional kernel slides on the input feature map.

The feature map y, obtained through the convolution operation, undergoes batch normalization, and the process is represented by the following formula (2):

$$ BatchNorm\;(y) = \gamma \frac{x - \mu }{{\sqrt {{\sigma^2} + \varepsilon } }} + \beta $$
(2)

γ and β represent the learnable parameters for scaling and shifting, respectively. ε is a very small number used to avoid division by zero. μ and σ denote the mean and standard deviation of the entire batch, and their calculations are as follows:

$$ {\mu_j} = \frac{1}{B \times H \times W}\mathop \sum \limits_{i = 1}^B \mathop \sum \limits_{h = 1}^H \mathop \sum \limits_{\omega = 1}^W {x_{i,j,h,\omega }} $$
(3)
$$ \sigma_j^2 = \frac{1}{B \times H \times W}\mathop \sum \limits_{i = 1}^B \mathop \sum \limits_{h = 1}^H \mathop \sum \limits_{\omega = 1}^W {\left( {{x_{i,j,h,\omega }} - {\mu_j}} \right)^2} $$
(4)

Finally, the output after batch normalization undergoes a non-linear mapping using the ReLU activation function to extract features and introduce non-linear characteristics.

The first ConvBNRelu layer employs a 3 × 3 kernel for feature extraction, producing an output with a specified number of channels. The second ConvBN-Relu layer also uses a 3 × 3 kernel but does not apply the ReLU activation function; it solely performs convolution and batch normalization without altering the feature map dimensions or the output channel count. Following the second ConvBNRelu layer, a 1 × 1 kernel is used for element-wise linear transformation of the feature map to introduce non-linear mapping, enhancing the model's expressive power. Finally, the current output feature map is connected residually with the output feature map from the previous two ConvBNRelu layers, and the result undergoes a non-linear mapping using the ReLU activation function.

The role of the DoubleConv Block is to extract advanced features from the input feature map while preserving its low-level features using a residual connection. This preservation of low-level features is crucial for subsequent feature fusion and classification operations.

In the decoder section of the shallow network, three upsampling modules (DeConv 3 × 3 Block) are employed to progressively upsample the input image. Each module consists of a deconvolutional layer and a DoubleConv Block. The deconvolutional layer increases the resolution by upsampling the feature map by a factor of two, while the DoubleConv Block enhances the representational capacity of the feature map. In each upsampling module, the output of the deconvolutional layer is fused with the feature map saved from the corresponding stage in the encoder. Ultimately, the output after the third upsampling module yields the segmentation result.

Deeper swin transformer module

The Swin Transformer module [36] is constructed based on the shifted windows approach. The structure of two consecutive Swin Transformer modules is illustrated in Fig. 4.

Fig. 4
figure 4

Two successive swin transformer block

Each Swin Transformer module consists of a LayerNorm (LN) layer, a multi-head self-attention module that includes window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA), a residual connection, and a two-layer perceptron with GELU non-linearity. The computation process of the Swin Transformer module based on this window partitioning mechanism can be represented as:

$$ {\hat z^l} = W - MSA\left( {LN\left( {{z^{l - 1}}} \right)} \right) + {z^{l - 1}} $$
(5)
$$ {z^l} = MLP\left( {LN\left( {{{\hat z}^l}} \right)} \right) + {\hat z^l} $$
(6)
$$ {\hat z^{l + 1}} = SW - MSA\left( {LN\left( {z^l} \right)} \right) + {z^l} $$
(7)
$$ {z^{l + 1}} = MLP\left( {LN\left( {{{\hat z}^{l + 1}}} \right)} \right) + {\hat z^{l + 1}} $$
(8)

\( {\hat z^l} \) and \( {z^l} \) represent the outputs of the l-th (S)W-MSA module and the MLP module, respectively. The self-attention calculation method for the (S)W-MSA module in the network is as follows:

$$ Attention\left( {Q,K,V} \right) = SoftMax\left( {\frac{{Q{K^T}}}{\sqrt d } + B} \right)V $$
(9)

Q, K, V ∈ \( {R^{{M^2} \times d}} \) represent the query matrix, key matrix, and value matrix, respectively. M2 and d denote the number of blocks in the window and the dimension of the key or value matrix. Additionally, the values in matrix B are taken from the bias matrix B ∈ \( {R^{\left( {2M - 1} \right) \times \left( {2M + 1} \right)}} \).

In the deep encoder section of the network, a feature map with a resolution of H/4 × W/4 and a channel count of C is inputted into two consecutive Swin Transformer modules for representation learning, where the feature dimension and resolution remain unchanged. Simultaneously, the patch merging layer performs a 2× downsampling to reduce the number of tokens and increases the feature dimension to twice the original dimension. This process is repeated twice in the encoder.

The use of a too deep Transformer network can lead to convergence issues. Therefore, to construct a bottleneck structure for learning deep feature representations, two consecutive Swin Transformer blocks are employed after the encoder in the network. In this bottleneck structure, the channel count and resolution of the feature map remain unchanged.

In the deeper decoder section of the network, it is also constructed based on the Swin Transformer mod-ule. Unlike the patch merging layer used in the encoder, the decoder part employs a patch expanding layer to upsample the extracted deep features. This layer reshapes the feature maps of adjacent dimensions by 2× upsampling to create higher-resolution feature maps, while correspondingly reducing the feature dimension to half of the original dimension.

Attention gate mechanism

In the decoder part of the network, an attention gate mechanism (Attention Gate) is utilized for adaptive feature fusion [37]. In this mechanism, the encoder's feature map is first transformed through convolutional operations to match the shape of the decoder's feature map. Subsequently, attention weights are computed through element-wise multiplication and an activation function to determine the importance of each position in the feature fusion process. The attention weights are then multiplied by the encoder's feature map, and the result is added to the transformed decoder's feature map to achieve feature fusion. Finally, the fused feature map is obtained as the input to the decoder for generating the ultimate prediction results. The schematic diagram of Attention Gate is shown in Fig. 5. And it is formulated as follows:

$$ q_{att}^l = {\psi^T}\left( {{\sigma_1}\left( {W_x^Tx_i^l + W_g^Tg_i^l + {b_g}} \right)} \right) + {b_\psi } $$
(10)
$$ \alpha_i^l = {\sigma_2}\left( {q_{att}^l\left( {x_i^l,\;{g_i};\;{\Theta_{att}}} \right)} \right) $$
(11)
Fig. 5
figure 5

Schematic of attention gate

\( {\sigma_1} \) represents the ReLU function, \( {\sigma_2} \) represents the Sigmoid function, \( {\psi^T} \), \( W_x^T \), \( W_g^T \) all represent convolution operations, \( {b_g} \), \( {b_\psi } \) both correspond to the bias terms of the convolution.

Feature fusion under the attention gate mechanism, compared to the common concatenation operation along specified dimensions, allows for dynamic adjustment of feature weights based on the correlation between encoder and decoder feature maps.

This dynamic adjustment involves the computation of attention weights, typically through similarity measures such as dot product or scaled dot product, capturing the relevance between encoder and decoder features. Subsequently, these attention weights are applied to the respective feature maps, effectively emphasizing task-relevant information at each position or channel. The fused features, modified by dynamically adjusted attention weights, are obtained through operations like element-wise multiplication. The dynamism in adjusting feature weights is derived from learning parameters within the attention mechanism, or alternatively, through training, allowing the network to dynamically adapt the importance of features based on the contextual information present in the input data. In essence, the Attention Gate mechanism enables the network to dynamically regulate the emphasis on different parts of the input data, facilitating improved capture of task-related information and enhancing the model's performance in handling complex data and focusing on crucial features and retention, thereby improving the accuracy of semantic segmentation models in target boundaries and details and enhancing the quality of segmentation results.

Experiments

Implementation details

MST-Unet is implemented using Python 3.9 and PyTorch 1.11.0. To enhance the diversity of the training set, various data augmentation techniques such as flipping and rotating are applied. The input image size and batch size are set to 224 × 224 and 24, respectively. Training is performed on an NVIDIA A4000 GPU with 32GB of memory. The SGD optimizer is employed during the training process for backpropagation, with a momentum setting of 0.9 and weight decay set to 0.0001.

Experiment results on ethnic clothing pattern dataset

To further demonstrate the superiority of the MST-Unet network model, this paper compares the performance of different models in the semantic segmentation of sub-categories of ethnic clothing patterns, as shown in Table 1.

Table 1 Comparison of segmentation performance among different models in various subclasses

The Dice coefficient is defined as twice the intersection of the predicted and true segmentation areas, divided by the sum of the areas of the predicted and true segmentations. It ranges from 0 to 1, with a value of 1 indicating perfect overlap between the predicted and true segmentations. It serves as a quantitative measure of segmentation accuracy, with higher values reflecting better agreement between the predicted and ground truth segmentations.

The experiments for each model in Table 1 utilized the previously described data augmentation process, as outlined in the preceding text. Following data augmentation, the augmented data was fed into each model in Table 1 for semantic segmentation predictions of ethnic clothing patterns. The segmentation results were obtained after an equal number of iterations. The evaluation of the segmentation outcomes for each model was performed using the Dice coefficient, ultimately yielding the data presented in Table 1.

From Table 1, it can be observed that the MST-Unet network model performs optimally in predicting the patterns of flowers, birds, ornaments, and butterflies. Additionally, the MST-Unet model has the fewest parameters among the compared models, and its average Dice score for predicting the four categories of ethnic clothing patterns is as high as 89.80%, achieving the best results among the compared models.

Ablation study

As shown in Fig. 2, the proposed method in this paper includes the deep Swin Transformer modules (ST) and the Attention Gate mechanism (AG) in the decoder. To test the effectiveness of each component, some components were omitted in the proposed MST-Unet network, and ablation experiments were conducted on subcategory patterns of ethnic clothing patterns dataset. Table 2 presents the results of the ablation experiments using different components.

Table 2 Comparative analysis of segmentation performance among different components in various subclasses

From the Table 2, it can be observed that compared to Basic-Unet without any components, the use of Swin Transformer modules in the deep layers of the network improves the experimental results. For flower patterns, bird patterns, butterfly patterns, and ornament patterns, the Dice score are increased by 3.6%, 4.1%, 1.6%, and 0.4%, respectively, compared to Basic-Unet. The average Dice score is improved by 2.4%.

The second component of this paper's method is the Attention Gate adaptive feature fusion mechanism. As shown in the table, the network with the Attention Gate adaptive feature fusion mechanism in the decoder outperforms Basic-Unet without any components. The Dice score results for flower patterns, bird patterns, butterfly patterns, and ornament patterns are improved by 4.4%, 5.2%, 2.4%, and 0.5%, respectively. The average Dice score is improved by 3.1%.

Moreover, the combination of the Swin Transformer module and the Attention Gate adaptive feature fusion mechanism further improved the results. As shown in Table 2, compared to the Basic-Unet without using any components, MST-Unet achieved increases in Dice score for the flower, bird, butterfly, and stripe patterns by 9.2%, 8.2%, 5.0%, and 0.9%, respectively. The average Dice score was improved by 5.8% compared to Basic-Unet with-out any components.

Furthermore, MST-Unet demonstrated superior segmentation performance for the four subclasses of ethnic clothing patterns compared to networks using only a single component, providing further evidence of the effectiveness of the proposed method.

Ablation experiment visual comparison

To further validate the effectiveness of the components used in this proposed method, a visual comparison of the attention region distribution maps generated by models with different components is conducted, as shown in Fig. 6. From the figure, it can be observed that the components used in the proposed method have a noticeable effect on more accurately segmenting the target region in the image and suppressing irrelevant areas.

Fig. 6
figure 6

Comparative visualization of attention distribution maps for different component networks

In the attention distribution maps of the model, darker colors indicate areas where the model's attention is more concentrated. As shown in Fig. 5, the network without any components (Basic-Unet) exhibits a more scattered attention distribution in the target region of the attention map, and attention is also distributed to irrelevant areas in the image, indicating some misjudgment.

The network using the Swin Transformer module (S-Unet) and the network using the Attention Gate adaptive feature fusion mechanism (Unet-A) both exhibit more concentrated attention distribution in the target region of the attention map. They also show a relative reduction in attention distribution for irrelevant areas in the image. Meanwhile, the network using both components (MST-Unet) demonstrates attention that is entirely concentrated in the target region of the image in the distribution map, with no attention distributed to irrelevant areas, indicating an absence of misjudgment.

This observation highlights the effectiveness of the proposed method in assisting the network to concentrate attention on the target region, suppress irrelevant areas, and enhance segmentation accuracy.

Experiment results visual comparison

In order to intuitively validate the effectiveness of the MST-Unet model, this paper compares the visualized results of the MST-Unet model with other semantic segmentation models on the prediction of ethnic clothing patterns, as shown in Fig. 7. From the figure, it can be observed that the MST-Unet model has a significant advantage in the segmentation of ethnic clothing patterns compared to other segmentation methods.

Fig. 7
figure 7

Comparison of segmentation performance between MST-Unet and other models on ethnic clothing patterns. a OriginalImage, b GroundTruth, c Deeplab_v3+, d U-net, e Att-Unet, f MST-Unet

For flower patterns, the contour of the MST-Unet segmentation result is clearer, and there are fewer misjudgments in non-pattern areas compared to other models. For bird patterns, the segmentation result of MST-Unet retains more complete details, such as the contour of wings, the shape of claws, and some protruding feathers, and there is no misjudgment of background elements. For butterfly patterns, the segmentation result of the MST-Unet model has fewer misjudgments in the background area and other negative sample patterns, and there are also fewer noise points in the segmentation result compared to other models. For stripe patterns, the segmentation result of the MST-Unet model has a smoother and clearer contour, and there are fewer misjudgments in negative sample areas compared to other models.

Through the visualized comparison of segmentation results, it can be seen that the problems of label confusion due to the diversity of pattern background elements, difficulty in retaining edge details, and easy loss of small targets in ethnic clothing pattern semantic segmentation have been well addressed.

Conclusion

The proposed MST-Unet semantic segmentation model in this paper is based on an encoder–decoder structure. It utilizes a double convolution module in the shallow network and employs the Swin Transformer module in the deep network to extract features from ethnic clothing patterns. In the decoder section, Attention Gate is applied for adaptive feature fusion, thereby enhancing the model's accuracy in target boundaries and details. Finally, comparative experiments have validated the effectiveness of the MST-Unet model on the ethnic clothing pattern dataset.

Given the diverse types of ethnic clothing patterns, this study focused on four representative categories. Future research could aim to enhance the model's generalization capabilities by expanding the dataset, enabling the model to perform high-quality semantic segmentation on a wider range of ethnic clothing pattern types while ensuring segmentation results' accuracy.