Segmentation of ethnic clothing patterns with fusion of multiple attention mechanisms

Ning, Tao; Gao, Yuan; Han, Yumeng

doi:10.1007/s40747-024-01457-5

Segmentation of ethnic clothing patterns with fusion of multiple attention mechanisms

Original Article
Open access
Published: 20 May 2024

Volume 10, pages 5759–5770, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Segmentation of ethnic clothing patterns with fusion of multiple attention mechanisms

Download PDF

Tao Ning^1,2,
Yuan Gao^1,2 &
Yumeng Han^1,2

310 Accesses
Explore all metrics

Abstract

To address the challenges posed by diverse pattern-background elements, intricate details, and complex textures in the semantic segmentation of ethnic clothing patterns, this research introduces a novel semantic segmentation network model called MST-Unet (Mixed Swin Transformer U-net). The proposed model combines a U-shaped network structure with multiple attention mechanisms. The upper layers of the model employ classical convolutional operations, focusing on local relationships in the initial layers containing high-resolution details. In deeper layers, Swin Transformer modules are utilized, capable of efficient feature extraction with smaller spatial dimensions, maintaining performance while reducing computational burden. An attention gate mechanism is integrated into the decoder, contributing to enhanced performance in ethnic clothing pattern segmentation tasks by allowing the model to better capture crucial image features and achieve precise segmentation results. In visual comparisons of segmentation results, our proposed model demonstrates superior performance. The segmentation results exhibit more complete preservation of edge contours and fewer misclassifications in irrelevant regions within the images. In qualitative and quantitative experiments conducted on the ethnic clothing pattern dataset, our model achieves the highest Dice score for segmentation results in all four subclasses of ethnic clothing patterns. The average Dice score of our model reaches an impressive 89.80%, surpassing other algorithms in the same category. When compared to Deeplab_V3+, ResUnet, SwinUnet, and Unet networks, our model outperforms them by 7.72%, 5.09%, 5.05%, and 0.67%.

FashionSegNet: a model for high-precision semantic segmentation of clothing images

Article 18 May 2023

Triple Attention Network for Clothing Parsing

An Improved Clothing Parsing Method Emphasizing the Clothing with Complex Texture

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The diverse and rich cultural heritage of various ethnic groups in China underscores the significance of researching traditional ethnic clothing patterns as a pivotal component in the preservation of Chinese ethnic culture. The application of image segmentation techniques to extract and isolate elemental patterns within ethnic attire provides crucial technological support for endeavors such as delving into the symbolic meanings inherent in ethnic patterns and the establishment of large-scale databases dedicated to ethnic patterns.

Currently, existing semantic segmentation methods predominantly rely on fully convolutional neural networks (FCNN) with a U-shaped architecture [1,2,3]. A typical U-shaped network, such as U-Net [1], comprises symmetric encoder–decoder structures and skip connections. In the encoder, a series of convolutional layers and consecutive downsampling layers are employed to extract deep features with large receptive fields. Subsequently, the decoder upsamples the extracted deep features to the input resolution for pixel-level semantic prediction. Simultaneously, skip connections fuse high-resolution features from different scales in the encoder to alleviate spatial information loss caused by downsampling. With this architectural design, U-Net has achieved significant success in various semantic segmentation tasks. Following this technological trajectory, numerous algorithms, including 3D U-Net [4], Res-Unet [5], U-Net++ [6], and Unet3+ [7], have been developed for diverse semantic segmentation tasks in various scenarios. These FCNN-based methods demonstrate outstanding performance across a range of semantic segmentation tasks, showcasing the powerful capability of convolutional neural networks in learning discriminative features.

Despite the outstanding performance achieved by methods based on Convolutional Neural Networks (CNN) in the field of semantic segmentation, they still fall short of meeting the stringent requirements for segmentation accuracy in certain applications. Segmenting complex ethnic clothing patterns with intricate textures remains a challenging task. Due to the inherent locality of convolutional operations, CNN methods struggle to learn explicit global and long-range semantic information interactions [8]. Some studies attempt to address this issue by employing dilated convolutional layers [9, 10], self-attention mechanisms [11, 12], and image pyramids [13]. However, these methods still have limitations in modeling long-range dependencies. Recently, inspired by the tremendous success of Transformers in Natural Language Processing (NLP) [14], researchers have begun exploring the integration of Transformers into the field of computer vision [15]. Dosovitskiy et al. [16] proposed the Vision Transformer (ViT) for image recognition tasks. ViT achieved performance comparable to CNN-based methods by using 2D image patches with position embeddings as input and pretraining on large-scale datasets. Additionally, Touvron et al. [17] introduced the Data-efficient Image Transformer (DeiT), demonstrating that Transformers can be trained on medium-sized datasets and, when combined with distillation methods, can achieve more robust performance. Liu et al. [18] proposed the Swin Transformer, a hierarchical Vision Transformer using a shifted windows mechanism, which outperformed in image classification, object detection, and semantic segmentation. The success of ViT, DeiT, and Swin Transformer in image recognition tasks underscores the potential of applying Transformers to the field of computer vision.

Inspired by the Swin Transformer [18], this paper introduces the Mixed Swin Transformer U-Net, aiming to fully leverage the advantages of Transformers in image segmentation. The encoder, bottleneck, and decoder are constructed based on classical convolutional layers and Swin Transformer modules [18]. Ethnic clothing images as input first undergo feature learning with shallow classical convolutional layers in the network, emphasizing initial layers' local relationships containing higher-resolution details. Subsequently, Swin Transformer modules handle deep semantic features with higher abstraction to reduce the computational load of the network. In the decoder, an Attention Gate mechanism, coupled with multi-scale features from the encoder, performs adaptive feature fusion to restore spatial resolution of feature maps and further facilitate segmentation predictions. Experiments conducted on a custom-built dataset of ethnic clothing patterns demonstrate that the proposed method exhibits outstanding segmentation accuracy and robust generalization capabilities.

The contributions of this paper can be summarized in three aspects:

(1)
Symmetric encoder–decoder structure with local-to-global self-attention: a symmetric encoder–decoder structure is established based on both traditional convolutional layers and Swin Transformer modules. The encoder incorporates local-to-global self-attention, enabling the model to capture information from local to global scales. In the decoder, global features are upsampled to the input resolution, facilitating pixel-level segmentation predictions.
(2)
Adaptive feature fusion using attention gate mechanism: the decoder utilizes an Attention Gate mechanism for adaptive feature fusion. This mechanism enhances the network's focus on target regions while suppressing irrelevant areas in the images, thereby improving segmentation accuracy.
(3)
Quantitative and qualitative experiments on ethnic clothing pattern dataset: quantitative and qualitative comparative experiments are conducted on an ethnic clothing pattern dataset, demonstrating superior performance compared to similar semantic segmentation algorithms. The proposed method achieves optimal results in terms of segmentation accuracy on the ethnic clothing pattern dataset.

Related work

CNN-based methods

The early methods for semantic segmentation were primarily based on contours and traditional machine learning algorithms [19, 20]. With the development of deep convolutional neural networks (CNNs), U-Net was introduced for semantic segmentation [1]. Due to the simplicity and excellent performance of the U-shaped architecture, various U-Net-like methods emerged, including Res-Unet [5], Dense-Unet [21], U-Net++ [6], and Unet3+ [7]. Chen et al. [22] proposed the use of fully convolutional networks for semantic segmentation, achieving pixel-level predictions. Zhao et al. [13] introduced the pyramid pooling module and pyramid scene parsing network, enhancing the model's ability to capture global contextual information. These methods have been applied to various application scenarios, such as 3D medical image segmentation [4, 23], video image segmentation [24], and 3D point cloud segmentation [25]. Currently, methods based on convolutional neural networks have achieved significant success in various segmentation domains, owing to their powerful representational capabilities. Gou et al. [26] introduced a novel approach named Multilevel Attention-based Sample Correlations for Knowledge Distillation (MASCKD). MASCKD utilizes attention maps from multiple hierarchical layers to construct sample correlations, focusing on the most critical sample regions and thereby enhancing the effectiveness of relational knowledge distillation.

Transformer-based methods

The Transformer architecture was originally proposed for machine translation tasks [14]. In the field of natural language processing, Transformer-based methods have demonstrated excellent performance across various tasks [27]. Inspired by the success of Transformers, Dosovitskiy et al. [16] pioneered the introduction of the Vision Transformer (ViT) into neural networks, achieving a remarkable balance between speed and accuracy in image recognition tasks. In comparison to CNN-based methods, ViT's drawback lies in the requirement for pre-training on its own large dataset. To alleviate the challenges of training ViT, DeiT [17] proposed several training strategies enabling ViT to perform well on ImageNet. Recently, notable works have expanded upon ViT [28, 29]. Among them, Liu et al. [18] introduced the Swin Transformer, incorporating it as the backbone network for vision tasks. Swin Transformer, based on a sliding window mechanism, has achieved state-of-the-art performance in various visual tasks, including image classification, object detection, and semantic segmentation. In this study, Swin Transformer modules are employed as units within the network, constructing a U-shaped encoder–decoder structure for the segmentation of ethnic clothing patterns.

In recent years, researchers have explored the integration of self-attention mechanisms into CNNs to enhance network performance [12]. Chen et al. [8] combined Transformers with CNNs to construct a powerful 2D medical image segmentation encoder. Valanarasu et al. [30] and Zhang et al. [31] leveraged the complementarity of Transformers and CNNs to improve the segmentation capabilities of their models. Currently, various combinations of Transformers and CNNs are applied to tasks such as multimodal brain tumor segmentation [32] and 3D medical image segmentation [33, 34]. Wang et al. [35] introduced the Mixed Transformer Module (MTM), which efficiently calculates self-connections through a well-designed local–global Gaussian-weighted self-attention mechanism (LGG-SA). They further explored inter-sample relationships using an external attention mechanism (EA). Finally, the Mixed Transformer Module was employed to construct a U-shaped network for medical image segmentation.

Dataset

This study collected a total of 1600 images of ethnic clothing patterns through on-site photography, book scanning, and internet sources. These images encompass four major categories of classic patterns, namely flowers, birds, butterflies, and ornaments. Survey results indicate that these four types of ethnic clothing patterns exhibit prominent characteristics, featuring rich texture details and a high frequency of occurrence in ethnic attire, making them suitable as representative types for studying ethnic clothing patterns. Additionally, manual annotation was performed on the entire dataset, generating corresponding mask images for each photograph. The four types of ethnic clothing patterns are illustrated in Fig. 1.

Due to variations in environmental conditions such as ambient lighting during on-site collection and book scanning, this study employed Gaussian filtering to denoise the dataset. This denoising process aims to enhance the quality of images, thereby further improving the accuracy and reliability of the model's image processing results. Additionally, to generate more training samples and enhance the model's robustness and generalization capabilities, various data augmentation techniques such as random rotation, cropping, contrast variations, and changes in image size were applied to preprocess the dataset.

Methods

To address challenges arising from diverse elements in ethnic clothing pattern semantic segmentation, such as complex backgrounds, intricate details, and potential label confusion, this paper introduces a Mixed Swin Transformer U-Net (MST-Unet).

The model is based on a U-shaped network structure comprising an encoder–decoder, utilizing an Attention Gate mechanism for adaptive feature fusion during decoding. Classic convolutional operations are retained in the shallow layers of the network because these layers contain more advanced features and high-resolution details, such as the shape, color, and texture of the target. Leveraging classic convolutional operations in these layers aids the network in better learning these features. Additionally, classic convolutional operations possess excellent locality and parameter sharing properties, contributing to reduced model complexity, improved training efficiency, and enhanced handling of large-scale datasets.

In the deeper layers where semantic information is more complex, and features are highly abstract, this paper employs two consecutive Swin Transformer modules for feature extraction. The utilization of two consecutive Swin Transformer modules is attributed to the fact that each of these modules incorporates distinct attention mechanisms, namely the window-based multi-head self-attention (W-MSA) module and the shifted window-based multi-head self-attention (SW-MSA) module. The decision to employ two successive modules is driven by the need to strike a balance between model complexity, computational efficiency, and the desired feature extraction capability.

By incorporating both W-MSA and SW-MSA modules in two consecutive layers, the model aims to achieve effective hierarchical feature representation while mitigating the computational burden associated with a higher number of modules. This design choice enables the model to capture intricate patterns and details in the data while maintaining computational feasibility. Such an approach ensures the effectiveness of the Swin Transformer module within a broader architectural context, emphasizing the thoughtful consideration of complexity, computational efficiency, and feature extraction requirements.

The Local–Global Attention mechanism within these modules aids in effectively integrating local features and global information. This allows the model to better capture spatial relationships and context information in the images, thereby enhancing the model's feature representation capability. Furthermore, the Local–Global Attention mechanism helps reduce computational and storage costs. Patch Merging and Patch Expanding are employed for down-sampling and up-sampling of data, respectively, to ensure dimensional consistency of features across different stages in the model.

The Swin Transformer module integrates the Window-based Multi-Head Self-Attention (W-MSA) and Shifted Window-based Multi-Head Self-Attention (SW-MSA) mechanisms, while the Attention Gate incorporates an attention gating mechanism. These diverse attention mechanisms are synergistically fused within the model, facilitating a comprehensive integration of multiple attention mechanisms. This amalgamation enables the model to leverage the specific strengths of W-MSA and SW-MSA within the Swin Transformer module for effective feature extraction and modeling. Simultaneously, the attention gating mechanism in the Attention Gate module dynamically adjusts feature fusion weights based on the correlation between encoder and decoder feature maps. This amalgamation of various attention mechanisms enhances the model's adaptability and performance in tasks such as semantic segmentation of ethnic clothing patterns. The overall structure of the network model is illustrated in Fig. 2.

Upper encoder–decoder structure

In the encoder section of the upper network [35], three downsampling modules (Conv 3 × 3 Block) are employed to progressively downsample the input image and extract features at different scales. Each downsampling module comprises a dual convolution module and a pooling layer. The dual convolution module enhances the representational capacity of the feature maps, while the pooling layer downsamples the feature maps by a factor of two to reduce resolution. The network structure is illustrated in Fig. 3.

In each downsampling module, feature maps of different scales, processed through dual convolution layers, are stored in a list for subsequent decoding use.

The DoubleConv Block consists of two ConvBNRelu layers and a residual connection. Each ConvBNRelu layer includes a convolution, batch normalization (BatchNorm), and an activation function (Relu). The image information as input undergoes convolution to produce the output feature map, denoted as y:

$$ {y_{i,j}} = \mathop \sum \limits_{m = 0}^{k - 1} \mathop \sum \limits_{n = 0}^{k - 1} {w_{m,n}}{x_{i + m,j + n}} + b $$

(1)

x represents the input feature map, w denotes the convolutional kernel, b represents the bias term, k indicates the size of the convolutional kernel, and i, j respectively represents the rows and columns of the output feature map. Additionally, m and n signify the positions where the convolutional kernel slides on the input feature map.

The feature map y, obtained through the convolution operation, undergoes batch normalization, and the process is represented by the following formula (2):

$$ BatchNorm\;(y) = \gamma \frac{x - \mu }{{\sqrt {{\sigma^2} + \varepsilon } }} + \beta $$

(2)

γ and β represent the learnable parameters for scaling and shifting, respectively. ε is a very small number used to avoid division by zero. μ and σ denote the mean and standard deviation of the entire batch, and their calculations are as follows:

$$ {\mu_j} = \frac{1}{B \times H \times W}\mathop \sum \limits_{i = 1}^B \mathop \sum \limits_{h = 1}^H \mathop \sum \limits_{\omega = 1}^W {x_{i,j,h,\omega }} $$

(3)

$$ \sigma_j^2 = \frac{1}{B \times H \times W}\mathop \sum \limits_{i = 1}^B \mathop \sum \limits_{h = 1}^H \mathop \sum \limits_{\omega = 1}^W {\left( {{x_{i,j,h,\omega }} - {\mu_j}} \right)^2} $$

(4)

Finally, the output after batch normalization undergoes a non-linear mapping using the ReLU activation function to extract features and introduce non-linear characteristics.

The first ConvBNRelu layer employs a 3 × 3 kernel for feature extraction, producing an output with a specified number of channels. The second ConvBN-Relu layer also uses a 3 × 3 kernel but does not apply the ReLU activation function; it solely performs convolution and batch normalization without altering the feature map dimensions or the output channel count. Following the second ConvBNRelu layer, a 1 × 1 kernel is used for element-wise linear transformation of the feature map to introduce non-linear mapping, enhancing the model's expressive power. Finally, the current output feature map is connected residually with the output feature map from the previous two ConvBNRelu layers, and the result undergoes a non-linear mapping using the ReLU activation function.

The role of the DoubleConv Block is to extract advanced features from the input feature map while preserving its low-level features using a residual connection. This preservation of low-level features is crucial for subsequent feature fusion and classification operations.

In the decoder section of the shallow network, three upsampling modules (DeConv 3 × 3 Block) are employed to progressively upsample the input image. Each module consists of a deconvolutional layer and a DoubleConv Block. The deconvolutional layer increases the resolution by upsampling the feature map by a factor of two, while the DoubleConv Block enhances the representational capacity of the feature map. In each upsampling module, the output of the deconvolutional layer is fused with the feature map saved from the corresponding stage in the encoder. Ultimately, the output after the third upsampling module yields the segmentation result.

Deeper swin transformer module

The Swin Transformer module [36] is constructed based on the shifted windows approach. The structure of two consecutive Swin Transformer modules is illustrated in Fig. 4.

Each Swin Transformer module consists of a LayerNorm (LN) layer, a multi-head self-attention module that includes window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA), a residual connection, and a two-layer perceptron with GELU non-linearity. The computation process of the Swin Transformer module based on this window partitioning mechanism can be represented as:

$$ {\hat z^l} = W - MSA\left( {LN\left( {{z^{l - 1}}} \right)} \right) + {z^{l - 1}} $$

(5)

$$ {z^l} = MLP\left( {LN\left( {{{\hat z}^l}} \right)} \right) + {\hat z^l} $$

(6)

$$ {\hat z^{l + 1}} = SW - MSA\left( {LN\left( {z^l} \right)} \right) + {z^l} $$

(7)

$$ {z^{l + 1}} = MLP\left( {LN\left( {{{\hat z}^{l + 1}}} \right)} \right) + {\hat z^{l + 1}} $$

(8)

$ {\hat z^l} $ and $ {z^l} $ represent the outputs of the l-th (S)W-MSA module and the MLP module, respectively. The self-attention calculation method for the (S)W-MSA module in the network is as follows:

$$ Attention\left( {Q,K,V} \right) = SoftMax\left( {\frac{{Q{K^T}}}{\sqrt d } + B} \right)V $$

(9)

Q, K, V ∈ $ {R^{{M^2} \times d}} $ represent the query matrix, key matrix, and value matrix, respectively. M² and d denote the number of blocks in the window and the dimension of the key or value matrix. Additionally, the values in matrix B are taken from the bias matrix B ∈ $ {R^{\left( {2M - 1} \right) \times \left( {2M + 1} \right)}} $.

In the deep encoder section of the network, a feature map with a resolution of H/4 × W/4 and a channel count of C is inputted into two consecutive Swin Transformer modules for representation learning, where the feature dimension and resolution remain unchanged. Simultaneously, the patch merging layer performs a 2× downsampling to reduce the number of tokens and increases the feature dimension to twice the original dimension. This process is repeated twice in the encoder.

The use of a too deep Transformer network can lead to convergence issues. Therefore, to construct a bottleneck structure for learning deep feature representations, two consecutive Swin Transformer blocks are employed after the encoder in the network. In this bottleneck structure, the channel count and resolution of the feature map remain unchanged.

In the deeper decoder section of the network, it is also constructed based on the Swin Transformer mod-ule. Unlike the patch merging layer used in the encoder, the decoder part employs a patch expanding layer to upsample the extracted deep features. This layer reshapes the feature maps of adjacent dimensions by 2× upsampling to create higher-resolution feature maps, while correspondingly reducing the feature dimension to half of the original dimension.

Attention gate mechanism

In the decoder part of the network, an attention gate mechanism (Attention Gate) is utilized for adaptive feature fusion [37]. In this mechanism, the encoder's feature map is first transformed through convolutional operations to match the shape of the decoder's feature map. Subsequently, attention weights are computed through element-wise multiplication and an activation function to determine the importance of each position in the feature fusion process. The attention weights are then multiplied by the encoder's feature map, and the result is added to the transformed decoder's feature map to achieve feature fusion. Finally, the fused feature map is obtained as the input to the decoder for generating the ultimate prediction results. The schematic diagram of Attention Gate is shown in Fig. 5. And it is formulated as follows:

$$ q_{att}^l = {\psi^T}\left( {{\sigma_1}\left( {W_x^Tx_i^l + W_g^Tg_i^l + {b_g}} \right)} \right) + {b_\psi } $$

(10)

$$ \alpha_i^l = {\sigma_2}\left( {q_{att}^l\left( {x_i^l,\;{g_i};\;{\Theta_{att}}} \right)} \right) $$

(11)

$ {\sigma_1} $ represents the ReLU function, $ {\sigma_2} $ represents the Sigmoid function, $ {\psi^T} $, $ W_x^T $, $ W_g^T $ all represent convolution operations, $ {b_g} $, $ {b_\psi } $ both correspond to the bias terms of the convolution.

Feature fusion under the attention gate mechanism, compared to the common concatenation operation along specified dimensions, allows for dynamic adjustment of feature weights based on the correlation between encoder and decoder feature maps.

This dynamic adjustment involves the computation of attention weights, typically through similarity measures such as dot product or scaled dot product, capturing the relevance between encoder and decoder features. Subsequently, these attention weights are applied to the respective feature maps, effectively emphasizing task-relevant information at each position or channel. The fused features, modified by dynamically adjusted attention weights, are obtained through operations like element-wise multiplication. The dynamism in adjusting feature weights is derived from learning parameters within the attention mechanism, or alternatively, through training, allowing the network to dynamically adapt the importance of features based on the contextual information present in the input data. In essence, the Attention Gate mechanism enables the network to dynamically regulate the emphasis on different parts of the input data, facilitating improved capture of task-related information and enhancing the model's performance in handling complex data and focusing on crucial features and retention, thereby improving the accuracy of semantic segmentation models in target boundaries and details and enhancing the quality of segmentation results.

Experiments

Implementation details

MST-Unet is implemented using Python 3.9 and PyTorch 1.11.0. To enhance the diversity of the training set, various data augmentation techniques such as flipping and rotating are applied. The input image size and batch size are set to 224 × 224 and 24, respectively. Training is performed on an NVIDIA A4000 GPU with 32GB of memory. The SGD optimizer is employed during the training process for backpropagation, with a momentum setting of 0.9 and weight decay set to 0.0001.

Experiment results on ethnic clothing pattern dataset

To further demonstrate the superiority of the MST-Unet network model, this paper compares the performance of different models in the semantic segmentation of sub-categories of ethnic clothing patterns, as shown in Table 1.

Table 1 Comparison of segmentation performance among different models in various subclasses

Full size table

The Dice coefficient is defined as twice the intersection of the predicted and true segmentation areas, divided by the sum of the areas of the predicted and true segmentations. It ranges from 0 to 1, with a value of 1 indicating perfect overlap between the predicted and true segmentations. It serves as a quantitative measure of segmentation accuracy, with higher values reflecting better agreement between the predicted and ground truth segmentations.

The experiments for each model in Table 1 utilized the previously described data augmentation process, as outlined in the preceding text. Following data augmentation, the augmented data was fed into each model in Table 1 for semantic segmentation predictions of ethnic clothing patterns. The segmentation results were obtained after an equal number of iterations. The evaluation of the segmentation outcomes for each model was performed using the Dice coefficient, ultimately yielding the data presented in Table 1.

From Table 1, it can be observed that the MST-Unet network model performs optimally in predicting the patterns of flowers, birds, ornaments, and butterflies. Additionally, the MST-Unet model has the fewest parameters among the compared models, and its average Dice score for predicting the four categories of ethnic clothing patterns is as high as 89.80%, achieving the best results among the compared models.

Ablation study

As shown in Fig. 2, the proposed method in this paper includes the deep Swin Transformer modules (ST) and the Attention Gate mechanism (AG) in the decoder. To test the effectiveness of each component, some components were omitted in the proposed MST-Unet network, and ablation experiments were conducted on subcategory patterns of ethnic clothing patterns dataset. Table 2 presents the results of the ablation experiments using different components.

Table 2 Comparative analysis of segmentation performance among different components in various subclasses

Full size table

From the Table 2, it can be observed that compared to Basic-Unet without any components, the use of Swin Transformer modules in the deep layers of the network improves the experimental results. For flower patterns, bird patterns, butterfly patterns, and ornament patterns, the Dice score are increased by 3.6%, 4.1%, 1.6%, and 0.4%, respectively, compared to Basic-Unet. The average Dice score is improved by 2.4%.

The second component of this paper's method is the Attention Gate adaptive feature fusion mechanism. As shown in the table, the network with the Attention Gate adaptive feature fusion mechanism in the decoder outperforms Basic-Unet without any components. The Dice score results for flower patterns, bird patterns, butterfly patterns, and ornament patterns are improved by 4.4%, 5.2%, 2.4%, and 0.5%, respectively. The average Dice score is improved by 3.1%.

Moreover, the combination of the Swin Transformer module and the Attention Gate adaptive feature fusion mechanism further improved the results. As shown in Table 2, compared to the Basic-Unet without using any components, MST-Unet achieved increases in Dice score for the flower, bird, butterfly, and stripe patterns by 9.2%, 8.2%, 5.0%, and 0.9%, respectively. The average Dice score was improved by 5.8% compared to Basic-Unet with-out any components.

Furthermore, MST-Unet demonstrated superior segmentation performance for the four subclasses of ethnic clothing patterns compared to networks using only a single component, providing further evidence of the effectiveness of the proposed method.

Ablation experiment visual comparison

To further validate the effectiveness of the components used in this proposed method, a visual comparison of the attention region distribution maps generated by models with different components is conducted, as shown in Fig. 6. From the figure, it can be observed that the components used in the proposed method have a noticeable effect on more accurately segmenting the target region in the image and suppressing irrelevant areas.

In the attention distribution maps of the model, darker colors indicate areas where the model's attention is more concentrated. As shown in Fig. 5, the network without any components (Basic-Unet) exhibits a more scattered attention distribution in the target region of the attention map, and attention is also distributed to irrelevant areas in the image, indicating some misjudgment.

The network using the Swin Transformer module (S-Unet) and the network using the Attention Gate adaptive feature fusion mechanism (Unet-A) both exhibit more concentrated attention distribution in the target region of the attention map. They also show a relative reduction in attention distribution for irrelevant areas in the image. Meanwhile, the network using both components (MST-Unet) demonstrates attention that is entirely concentrated in the target region of the image in the distribution map, with no attention distributed to irrelevant areas, indicating an absence of misjudgment.

This observation highlights the effectiveness of the proposed method in assisting the network to concentrate attention on the target region, suppress irrelevant areas, and enhance segmentation accuracy.

Experiment results visual comparison

In order to intuitively validate the effectiveness of the MST-Unet model, this paper compares the visualized results of the MST-Unet model with other semantic segmentation models on the prediction of ethnic clothing patterns, as shown in Fig. 7. From the figure, it can be observed that the MST-Unet model has a significant advantage in the segmentation of ethnic clothing patterns compared to other segmentation methods.

For flower patterns, the contour of the MST-Unet segmentation result is clearer, and there are fewer misjudgments in non-pattern areas compared to other models. For bird patterns, the segmentation result of MST-Unet retains more complete details, such as the contour of wings, the shape of claws, and some protruding feathers, and there is no misjudgment of background elements. For butterfly patterns, the segmentation result of the MST-Unet model has fewer misjudgments in the background area and other negative sample patterns, and there are also fewer noise points in the segmentation result compared to other models. For stripe patterns, the segmentation result of the MST-Unet model has a smoother and clearer contour, and there are fewer misjudgments in negative sample areas compared to other models.

Through the visualized comparison of segmentation results, it can be seen that the problems of label confusion due to the diversity of pattern background elements, difficulty in retaining edge details, and easy loss of small targets in ethnic clothing pattern semantic segmentation have been well addressed.

Conclusion

The proposed MST-Unet semantic segmentation model in this paper is based on an encoder–decoder structure. It utilizes a double convolution module in the shallow network and employs the Swin Transformer module in the deep network to extract features from ethnic clothing patterns. In the decoder section, Attention Gate is applied for adaptive feature fusion, thereby enhancing the model's accuracy in target boundaries and details. Finally, comparative experiments have validated the effectiveness of the MST-Unet model on the ethnic clothing pattern dataset.

Given the diverse types of ethnic clothing patterns, this study focused on four representative categories. Future research could aim to enhance the model's generalization capabilities by expanding the dataset, enabling the model to perform high-quality semantic segmentation on a wider range of ethnic clothing pattern types while ensuring segmentation results' accuracy.

Data availability

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, October 5–9, 2015, Proceedings, Part III 18. Springer International Publishing, pp 234–241
Jin Q, Meng Z, Sun C et al (2020) RA-Unet: a hybrid deep attention-aware network to extract liver and tumor in CT scans. Front Bioeng Biotechnol 8:605132
Article Google Scholar
Isensee F, Jaeger PF, Kohl SAA et al (2021) nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18(2):203–211
Article Google Scholar
Çiçek Ö, Abdulkadir A, Lienkamp SS et al (2016) 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Medical Image computing and computer-assisted intervention–MICCAI 2016: 19th international conference, Athens, October 17–21, 2016, Proceedings, Part II 19. Springer International Publishing, pp 424–432
Xiao X, Lian S, Luo Z et al (2018) Weighted res-Unet for high-quality retina vessel segmentation. In: 2018 9th international conference on information technology in medicine and education (ITME). IEEE, pp 327–331
Zhou Z, Rahman Siddiquee M M, Tajbakhsh N et al (2018) Unet++: a nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, September 20, 2018, Proceedings 4. Springer International Publishing, pp 3–11
Huang H, Lin L, Tong R et al (2020) Unet 3+: a full-scale connected Unet for medical image segmentation. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1055–1059
Chen J, Lu Y, Yu Q et al (2021) TransUnet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306
Chen LC, Zhu Y, Papandreou G et al (2018) Encoder–decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Gu Z, Cheng J, Fu H et al (2019) Ce-net: context encoder network for 2d medical image segmentation. IEEE Trans Med Imaging 38(10):2281–2292
Article Google Scholar
Schlemper J, Oktay O, Schaap M et al (2019) Attention gated networks: learning to leverage salient regions in medical images. Med Image Anal 53:197–207
Article Google Scholar
Wang X, Girshick R, Gupta A et al (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Zhao H, Shi J, Qi X et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Advances in neural information processing systems, vol 30
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer International Publishing, Cham, pp 213–229
Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Touvron H, Cord M, Douze M et al (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357
Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Tsai A, Yezzi A, Wells W et al (2003) A shape-based approach to the segmentation of medical imagery using level sets. IEEE Trans Med Imaging 22(2):137–154
Article Google Scholar
Held K, Kops ER, Krause BJ et al (1997) Markov random field segmentation of brain MR images. IEEE Trans Med Imaging 16(6):878–886
Article Google Scholar
Li X, Chen H, Qi X et al (2018) H-DenseUnet: hybrid densely connected Unet for liver and tumor segmentation from CT volumes. IEEE Trans Med Imaging 37(12):2663–2674
Article Google Scholar
Chen LC, Papandreou G, Kokkinos I et al (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article Google Scholar
Milletari F, Navab N, Ahmadi SA (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE, pp 565–571
Xie S, Girshick R, Dollár P et al (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Liu SY, Chi JN, Wu CD et al (2023) Recurrent slice networks-based 3D point cloud-relevant integrated segmentation of semantic and instances. J Image Graph 28(07):2135–2150
Article Google Scholar
Gou J, Sun L, Yu B et al (2022) Multilevel attention-based sample correlations for knowledge distillation. IEEE Trans Ind Inf 19(5):7099–7109
Article Google Scholar
Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Wang W, Xie E, Li X et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 568–578
Han K, Xiao A, Wu E et al (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
Google Scholar
Valanarasu JMJ, Oza P, Hacihaliloglu I et al (2021) Medical transformer: Gated axial-attention for medical image segmentation. In: Medical image computing and computer assisted intervention—MICCAI 2021: 24th international conference, Strasbourg, September 27–October 1, 2021, Proceedings, Part I 24. Springer International Publishing, pp 36–46
Zhang Y, Liu H, Hu Q (2021) Transfuse: fusing transformers and cnns for medical image segmentation. In: Medical image computing and computer assisted intervention—MICCAI 2021: 24th international conference, Strasbourg, September 27–October 1, 2021, Proceedings, Part I 24. Springer International Publishing, pp 14–24
Wang W, Chen C, Ding M et al (2021) Transbts: Multimodal brain tumor segmentation using transformer. In: Medical image computing and computer assisted intervention—MICCAI 2021: 24th international conference, Strasbourg, September 27–October 1, 2021, Proceedings, Part I 24. Springer International Publishing, pp 109–119
Xie Y, Zhang J, Shen C et al (2021) Cotr: efficiently bridging cnn and transformer for 3d medical image segmentation. In: Medical image computing and computer assisted intervention—MICCAI 2021: 24th international conference, Strasbourg, September 27–October 1, 2021, Proceedings, Part III 24. Springer International Publishing, pp 171–180
Hatamizadeh A, Tang Y, Nath V et al (2022) Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 574–584
Wang H, Xie S, Lin L et al (2022) Mixed transformer u-net for medical image segmentation. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2390–2394
Cao H, Wang Y, Chen J et al (2022) Swin-Unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision. Springer Nature Switzerland, Cham, pp 205–218
Oktay O, Schlemper J, Folgoc LL et al (2018) Attention u-net: learning where to look for the pancreas. IMIDL conference
Xiao T, Liu Y, Zhou B et al (2018) Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 418–434
Bougourzi F, Distante C, Dornaika F et al (2023) PDAtt-Unet: pyramid dual-decoder attention Unet for COVID-19 infection segmentation from CT-scans. Med Image Anal 86:102797
Article Google Scholar

Download references

Acknowledgement

This work is partially supported by Research Project on Economic and Social Development of Liaoning Province (2024lsljdybkt-017), Fundamental Research Funds for the Central Universities (0919-140249) and Research project of China Federation of logistics and procurement (Grant No.:2024CSLKT3-020).

Author information

Authors and Affiliations

College of Computer Science and Technology, Dalian Minzu University, Dalian, 116000, China
Tao Ning, Yuan Gao & Yumeng Han
State Ethnic Affairs Commission Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116000, China
Tao Ning, Yuan Gao & Yumeng Han

Authors

Tao Ning
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yumeng Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Ning.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ning, T., Gao, Y. & Han, Y. Segmentation of ethnic clothing patterns with fusion of multiple attention mechanisms. Complex Intell. Syst. 10, 5759–5770 (2024). https://doi.org/10.1007/s40747-024-01457-5

Download citation

Received: 27 December 2023
Accepted: 17 April 2024
Published: 20 May 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s40747-024-01457-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Segmentation of ethnic clothing patterns with fusion of multiple attention mechanisms

Abstract

Similar content being viewed by others

FashionSegNet: a model for high-precision semantic segmentation of clothing images

Triple Attention Network for Clothing Parsing

An Improved Clothing Parsing Method Emphasizing the Clothing with Complex Texture

Introduction