Multi-scale feature flow alignment fusion with Transformer for the microscopic images segmentation of activated sludge

Zhao, Lijie; Zhang, Yingying; Wang, Guogang; Huang, Mingzhong; Zhang, Qichun; Karimi, Hamid Reza

doi:10.1007/s11760-023-02836-0

Multi-scale feature flow alignment fusion with Transformer for the microscopic images segmentation of activated sludge

Original Paper
Open access
Published: 02 November 2023

Volume 18, pages 1241–1248, (2024)
Cite this article

Download PDF

You have full access to this open access article

Signal, Image and Video Processing Aims and scope Submit manuscript

Multi-scale feature flow alignment fusion with Transformer for the microscopic images segmentation of activated sludge

Download PDF

Lijie Zhao¹,
Yingying Zhang¹,
Guogang Wang¹,
Mingzhong Huang¹,
Qichun Zhang² &
…
Hamid Reza Karimi³

796 Accesses
2 Citations
Explore all metrics

Abstract

Accurate microscopic images segmentation of activated sludge is essential for monitoring wastewater treatment processes. However, it is a challenging task due to poor contrast, artifacts, morphological similarities, and distribution imbalance. A novel image segmentation model (FafFormer) was developed in the work based on Transformer that incorporated pyramid pooling and flow alignment fusion. Pyramid Pooling Module was used to extract multi-scale features of flocs and filamentous bacteria with different morphology in the encoder. Multi-scale features were fused by flow alignment fusion module in the decoder. The module used generated semantic flow as auxiliary information to restore boundary details and facilitate fine-grained upsampling. The Focal–Lovász Loss was designed to handle class imbalance for filamentous bacteria and flocs. Image-segmentation experiments were conducted on an activated sludge dataset from a municipal wastewater treatment plant. FafFormer showed relative superiority in accuracy and reliability, especially for filamentous bacteria compared to existing models.

Coal Maceral Groups Segmentation Using Multi-scale Residual Network

Image segmentation based on U-Net++ network method to identify Bacillus Subtilis cells in micro-droplets

Article 01 September 2023

Fusion-Based Noisy Image Segmentation Method

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Activated sludge (AS) process is a biological wastewater treatment method to remove organic matter. The process relies on a community of microorganisms. However, poor sludge settling may occur when filamentous bacteria become too abundant [1]. It affects effluent quality and may violate environmental discharge standards [2]. Filamentous bulking seriously affects effluent quality and process safety.

Phase-contrast microscopic (PCM) images are one of the most direct and effective ways to understand the microorganisms and microstructure of activated sludge. Image segmentation is key to quantitative analysis and processing. Several classical segmentation methods have been developed and employed for analyzing the microscopic images of activated sludges over the years, e.g., threshold-based [3], edge-based [4], region-based [5], and clustering-based approaches [6]. Despite advanced activated sludge phase-contrast microscopic (ASPCM) images of segmentation using classical methods, challenges still exist, including complex and diverse morphology of flocs and filamentous bacteria, overlapping structures, noise presence, and unbalanced data distribution.

Deep learning has significantly advanced semantic image segmentation with widely used convolutional neural networks (CNN). Models (e.g., fully convolutional networks (FCN) [7], U-Net [8], SegNet [9], and DeepLab [10,11,12,13]) have demonstrated their effectiveness in accurately segmenting images. Applications derived from deep-learning-based image segmentation have thrived in various fields, e.g., clinical medical image segmentation [14, 15] and eye image segmentation [16].

Considerable interest has been attracted by deep learning, which has been the focus of extensive research in the context of ASPCM images segmentation. Zhao et al. [17] first applied deep learning to ASPCM images segmentation. A segmentation model is implemented for ASPCM images using U-Net. A combination of the binary Cross-Entropy loss function and the Dice coefficient loss function is employed. Moreover, a promising alternative for image segmentation emerges following the advent of Transformer [18] and vision Transformer (ViT) [19] for computer vision. SETR [20] based on ViT has achieved excellent results in image segmentation, which confirms the effectiveness of ViT in the task. However, the fixed-resolution position encoding used in ViT leads to degraded performance and reduced efficiency. Segformer [21] proposed a semantic segmentation model based on Transformer to address these limitations. A more powerful Mix Transformer (MiT) is proposed as the encoder, which provides semantic features with richer global information. The accuracy of image segmentation achieved by Segformer surpasses existing models. Advanced deep learning has greatly benefited the accuracy and efficiency of image segmentation.

ASPCM images segmentation exhibits challenges, e.g., poor contrast, artifacts, similar shapes, variations in size and shape, and imbalanced distribution. The main contributions of the work to address these challenges are as follows:

A novel semantic segmentation model (FafFormer) was proposed for ASPCM segmentation. Pyramid pooling and flow alignment fusion based on Transformer were incorporated to enhance its segmentation capabilities.
Flow Alignment Fusion Module (FAFM) was proposed in the decoder to restore boundary information using semantic flows for fine upsampling and fusion with multi-scale features.
A hybrid loss function was designed as a weighted sum of the Focal Loss [22] and Lovász-Softmax Loss [23] to handle class imbalance and mIoU optimization for ASPCM segmentation.
Experiments on a real activated sludge dataset from a municipal wastewater treatment plant demonstrated the superior accuracy and reliability of FafFormer, especially in the filamentous bacteria segmentation. It outperformed existing semantic segmentation models.

2 Proposed network architecture: FafFormer

FafFormer follows an encoder–decoder structure. The application of PPM and MiT enables the extraction of multi-scale features from ASPCM images and progressively reduces the spatial resolution. ASPCM images are divided into overlapping patches, and the pixel values of each patch are transformed to obtain the feature representation. Feature map ${F_4}$ is pooled and merged using PPM after the MiT to further capture information from multi-scale. The decoder consists of FAFM and upsampling operation. FAFM is composed of the Flow Alignment Module (FAM) [24, 25] and concatenation operation. The low- and high-level feature maps from the encoder are handled using FAM. The output is concatenated channel-wise with the encoder output. FAM captures the semantic flow between feature maps of adjacent layers to align them by predicting the flow field within the network. The results generated by the FAM are concatenated channel-wise with those generated by the encoder, which compensates for the local information lost during feature extraction. The feature maps from each phase of the encoder are fed into a multilayer perceptron (MLP) for further feature extraction and concatenated with the decoder feature maps. The predicted feature map is generated using a $1 \times 1$ convolutional layer and upsampled to the same resolution as the original image. The Focal–Lovász Loss is proposed to overcome the category imbalance and improve the accuracy of filamentous bacteria segmentation. Figure 1 shows the network architecture of FafFormer.

2.1 Encoder

The encoder composed of MiT and PPM extracts multi-scale features from ASPCM images. Specifically, ASPCM images are used as input for the overlap patch embedding in MiT to capture position encoding information. Subsequently, the original image resolution is divided into 1/4, 1/8, 1/16, and 1/32 scales using four Transformer blocks. Each Transformer block consists of efficient self-attention, Mix-FFN, and overlap patch merging. The performance of the multi-head self-attention mechanism is improved by efficient self-attention, which reduces computational complexity. Each element in the sequence interacts with other elements through multi-head self-attention, resulting in a representation of each element in the sequence. The self-attention is calculated as follows:

$$\begin{aligned} \textrm{Attention}(Q,K,V)=\textrm{Softmax}\left( \frac{ QK ^{\textrm{T}}}{{\sqrt{d_{k} }} } \right) {V}\end{aligned}$$

(1)

where Q, K, and V are the query matrix, key matrix, and value matrix, respectively; $d_k$ is the vector dimension of matrices Q and K; $O(N^2)$ is the computational complexity of the process. Efficient self-attention utilizes sequence reduction [26] with the following process:

$$\begin{aligned} {\hat{K}}= & {} \textrm{Reshape}\left( \frac{N}{R},C\cdot R\right) (K) \end{aligned}$$

(2)

$$\begin{aligned} K= & {} \textrm{Linear}(C\cdot R,C)({\hat{K}}) \end{aligned}$$

(3)

where N is the sequence length; C is the channel dimension; K is the reduced sequence; ${\hat{K}}$ denotes the sequence after the reduction of K; R denotes the reduction ratio; $\textrm{Reshape}(\frac{N}{R}, C\cdot R)(K)$ is the deformation of K to $\frac{N}{R} \times C\cdot R$; $\textrm{Linear}(C\cdot R,C)({\hat{K}})$ is dimension $C\cdot R$ as the input to the MLP and the output dimension C; R is set to [64, 16, 4, 1] in the respective stages of the encoder. The computational complexity of the efficient self-attention is reduced from $O(N^2)$ to $O(\frac{N^2}{R})$. In the process of providing location information to the network, interpolation is typically used in position encoding when there is a mismatch in resolution between the training and inference stages. However, interpolation decreases segmentation accuracy. CNN is capable of learning positional information implicitly [27]. Therefore, a zero-padded $3\times 3$ convolutional layer is used in the Mix-FFN to provide the network with the required positional information.

Feature map $F_4$, which captures the most abstract semantic information, is obtained from the deepest layer of MiT. It serves as input to the PPM where further encoding is performed to capture multi-scale information. Feature maps containing multi-scale information are extracted in parallel from $F_4$ using Max Pooling of sizes $1 \times 1$, $2\times 2$, $3\times 3$, and $6\times 6$, respectively. The resulting multi-scale feature maps are subsequently upsampled to match the $F_4$ size for concatenation. The segmentation performance for flocs and filamentous bacteria with different morphology is improved by integrating the feature information from these multi-scale feature maps.

2.2 Decoder

FAFM is proposed in the decoder to restore the images. FAFM is composed of FAM and concatenation operation. FAM consists of a sub-network generating semantic flow fields and warp production. The sub-network employs a convolutional layer and upsampling to obtain flow-field information. Specifically, a $1\times 1$ convolutional layer is used for each feature map to unify the channel dimension. The low-level feature maps and high-level feature maps in the adjacent feature maps are denoted as $F_L\in {\mathbb {R}}^{H \times W \times C}$ and $F_H\in {\mathbb {R}}^{h \times w \times c}$, respectively. $F_H$ is upsampled to the $F_L$ size using bilinear interpolation for concatenation. After the $1\times 1$ convolutional layer and $F_H$ upsampling, $F_L$ is concatenated in the channel dimension to fuse features of different scales. Then, a convolution layer of $3\times 3$ is applied for extracting the fused features and obtaining the flow field information. The entire process can be stated as follows:

$$\begin{aligned} \Delta _{L}=\textrm{C}_{3\times 3}(\textrm{Cat}(\textrm{U}(\textrm{C}_{1 \times 1}( F_H )),\textrm{C}_{1\times 1}(F_H) )) \end{aligned}$$

(4)

where $\textrm{C}$ denotes the convolutional layer; $\textrm{U}$ denotes upsampling operation; $\textrm{Cat}$ denotes the concatenation operation; $\Delta _{L}$ denotes the flow field between $F_L$ and $F_H$ generated by the sub-network; $\Delta _{L} \in {\mathbb {R}}^{H \times W \times 2}$. The flow field represents the semantic flow offset between two adjacent feature maps.

Warp production is performed on each feature point of the high-level feature maps in the adjacent feature maps. A differentiable bi-linear sampling mechanism [28] is used to obtain the corresponding values for each feature point. These values are filled at the offset position. It helps preserve the original global information and recover more local feature information. Besides, the high-level feature maps are scaled at the same spatial resolution as the low-level feature maps. Offset feature points are created by adding the generated flow field to each feature point on the low-level feature map. The final result is obtained by bilinear interpolation. FAM is a more effective upsampling technique than bilinear interpolation. It restores lost details in the feature maps and enhances the boundary information of flocs and filamentous bacteria. This mitigates the edge smoothing and jaggedness that often occur using traditional upsampling techniques.

The semantic flow information between adjacent high-level and low-level feature maps is obtained by FAFM. The semantic flow is used as auxiliary information for image recovery. This process enhances the network’s ability to capture semantic features and improve its ability to recover images accurately. The encoder shallow feature maps and the after FAM deep feature maps are concatenated. The fusion of local and global feature information compensates for the feature loss during the downsampling of the encoder. Finally, the multi-scale feature maps from the encoder are fed into MLP and then concatenated with the feature maps from the decoder. The resulting feature maps are upsampled to restore the original resolution. The decoder is designed to recover lost feature information at the boundary, which enables more accurate ASPCM images segmentation.

2.3 Focal–Lovász Loss

ASPCM images with low contrast and artifact presence make identifying the boundaries of flocs and filamentous bacteria difficult. Meanwhile, the number of flocs and filamentous bacteria is uneven, especially for the small number of filamentous bacteria. The segmentation model exhibits limitations in accurately identifying filamentous bacteria, which leads to potential misclassification. The Focal–Lovász Loss is proposed for improving the classification and segmentation of the model. The prediction probability is denoted as p and the true label denoted as y in the loss formulation. The Focal–Lovász Loss is expressed by:

$$\begin{aligned} {\mathcal {L}}(p,y)=a{\mathcal {L}}_{{\text {Lov}\acute{\textrm{a}}\text {sz}}}(p,y)+ b{\mathcal {L}}_{\textrm{FL}}(p,y) \end{aligned}$$

(5)

where a and b are the weights assigned to the two loss functions; ${\mathcal {L}}_{{\text {Lov}\acute{\textrm{a}}\text {sz}}}(p,y)$ denotes Lovász-Softmax Loss; ${\mathcal {L}}_{\textrm{FL}}(p,y)$ denotes Focal Loss. The Focal Loss is applicable to solve category imbalance. Focal Loss reduces the class imbalance problem and improves minority class prediction performance by adjusting the weight of the loss function to focus more on the minority class. This refocuses the model on learning difficult categories and improves segmentation accuracy. $p^{s}$ denotes the predicted probability for category s. s denotes the target categories, $s \in [1, S]$, and S denotes number of overall categories. The Focal Loss is expressed by:

$$\begin{aligned} {\mathcal {L}}_{\textrm{FL} }(p,y)=\sum _{s=1}^{S}\alpha (1-p^{s})^{\gamma }\textrm{log} (p^{s}) \end{aligned}$$

(6)

where $\alpha $ denotes a coefficient that adjusts the positive and negative samples; $\gamma $ denotes an adjustable parameter that adjusts the weights of the difficult-to-categorize categories in the loss function. The Lovász-Softmax Loss for the pixel-level classification in the segmentation improves the classification and segmentation of the model. By analyzing the properties of pixel arrangements and Jaccard index, the Lovász-Softmax Loss encourages the model to preserve boundary details and reduce segmentation errors, especially in complex or ambiguous regions. $p_i^{s}$ denotes the predicted probability ($p_i^{s}\in [0,1]$), and pixel i in the image belongs to category s. Construct the vector of pixel errors $m_i$ for category s based on the probability $p_i^s$, defined as:

$$\begin{aligned} m_{i}^{s}=\left\{ \begin{aligned} 1-p_{i}^{s}&\quad \textrm{if} \ s =y_i, \\ p_{i}^{s}&\quad \ \textrm{otherwise}. \end{aligned} \right. \end{aligned}$$

(7)

where $y_i$ is the ground truth label of pixel i. The surrogate to the Jaccard loss is constructed using $m^{s}$, and the Jaccard index for category s is:

$$\begin{aligned} {\mathcal {L}}(p^{s})={\bar{\Delta }}_{Jc}(m^{s}) \end{aligned}$$

(8)

where ${\bar{\Delta }}_{Jc}$ denotes the surrogate to the Jaccard loss. Given that Mean Intersection over Union (mIoU) is a common metric in semantic segmentation, the loss function is to be category-averaged. The final Lovász-Softmax Loss is defined by:

$$\begin{aligned} {\mathcal {L}}_{\text {Lov}\acute{\textrm{a}}\text {sz}}(p,y)=\frac{1}{\mid S \mid }\sum _{s \in S }{\bar{\Delta }}_{Jc}(m^{s}) \end{aligned}$$

(9)

3 Experimental study

Some relevant experimental results are presented to demonstrate FafFormer’s effectiveness in this section.

3.1 Dataset of ASPCM images

ASPCM images used in this experiment were obtained from the aeration tank of a municipal wastewater treatment plant. The dataset consisted of 323 finely annotated images, each containing three semantic classes: flocs, filamentous bacteria, and background. The images had a resolution of 2048 $\times $ 1536. The training set contained 256 images. The test set contained 67 images. Figure 2 shows the original image and the corresponding ground truth segmentation of an ASPCM image.

Table 1 Image segmentation comparison of different models

Full size table

3.2 Model training

Experiments were conducted on the PaddlePaddle deep learning framework using Python. The experiments were executed on AI Studio with Tesla V100 GPU with 32 GB of memory. The GPU memory consumption of the model is around 24.5GB during training. The AdamW optimizer was used with a momentum of 0.01 and batch size of 2 during training. The input images size is set to 1024 $\times $ 1000 pixels. The total number of iterations was set to 55,000, and this training process took approximately 15 h. The initial learning rate was set to 0.0001 and gradually decreased following the “poly” strategy with the power of 2. The encoder MiT-B3 was selected for better results. a and b were 0.35 and 0.65, respectively, in the Focal–Lovász Loss; $\gamma $ was 2; $\alpha $ was 0.25.

Figure 3 presents the performance and training status curves of the model (the loss value, learning rate, mIoU, and accuracy). The loss gradually converges during training with convergence around 55,000 iterations. A higher learning rate is initially set for fast convergence, and then the "ploy" strategy is used to fine-tune the learning rate. Both mIoU and accuracy gradually increase as training progresses.

3.3 Experiment results and comparative analysis

Table 1 presents the segmentation results from different models. FafFormer shows excellent performance in terms of mIoU compared to other models. Exceptional performance is demonstrated in segmenting filamentous bacteria concerning precision and obtains highly accurate segmentation. In terms of FLOPs, it achieves accurate segmentation with slightly increasing computational complexity. Bold font is used to highlight the best performance in terms of segmentation metrics.

Figure 4 shows the segmentation obtained by each model individually. Regions (I), (II), and (III) are selected for comparative analysis. The presence of artifacts poses challenges in the identification of flocs and filamentous bacteria within regions (II) and (III). Additionally, poor contrast and artifacts further obscure the boundaries of filamentous bacteria in region (I). FafFormer incorporates MiT and PPM to extract features, and FAFM is proposed to recover image-boundary features. It allows for more accurate segmentation of flocs and filamentous bacteria boundaries compared to other methods that are less effective at capturing boundary details. FafFormer can enhance the segmentation performance of flocs and filamentous bacteria with diverse morphology as well as the recovery of crucial boundary information.

To evaluate the universality of our approach, we conduct experiments on two different datasets, Cityscapes and ADE20k. These two datasets are widely used to evaluate and compare the performance of different semantic segmentation models. Table 2 evaluates the performance (mIoU) and computational demands of various models on the ADE20K and Cityscapes datasets. FafFormer surpasses its counterparts in mIoU scores. Though its computational load (FLOPs) exceeds that of Segformer, it remains below DeepLabV3+ and SFNet. FafFormer achieves a good balance between model performance and computational cost.

3.4 Ablation experiments

The PPM module is added into the encoder to improve the model segmentation performance by extracting semantic feature information at different scales. The ablation results of the encoder are shown in Table 3. It shows that after the addition of PPM, the model’s ability to segment filaments and background targets is improved, while the floc target is slightly weakened. There is still an improvement in overall performance. Table 4 presents ablation experiments conducted with different modules. Table 4 illustrates the slight increase in computational complexity and model parameters after combining PPM with encoders and FAFM. FafFormer enhances the model’s segmentation performance, which could be crucial in accurately handling various morphologies of flocs and filamentous bacteria.

Table 2 Comparison results of different datasets

Full size table

Table 3 Ablation experiments of the encoder

Full size table

Ablation experiments are conducted using different upsampling techniques in the decoder, e.g., bilinear interpolation, nearest neighbor interpolation, FAM, and FAFM. Table 5 shows the experimental segmentation metrics. FAFM exhibits optimal performance in terms of mIoU compared to traditional upsampling methods, followed by FAM. The concatenation operation in FAFM fuses shallow and deep semantic information.

Difficulties emerge in accurately segmenting filamentous bacteria during the experiment for the class imbalance between flocs and filamentous bacteria. A comparative analysis is conducted using different loss functions (e.g., the Cross-Entropy Loss, Focal Loss, Lovász-Softmax Loss, and Focal–Lovász Loss) to optimize the segmentation accuracy of the model. The Focal–Lovász Loss demonstrates superior performance in terms of mIoU compared to other loss functions (Table 6). The Focal–Lovász Loss addresses the class imbalance and significantly improves the segmentation accuracy of filamentous bacteria features. It mitigates the negative effects of class imbalance by assigning higher weights to the minority categories during training. The model focuses on learning filamentous bacteria features, which can optimize the segmentation accuracy of models in unbalanced class distribution.

Table 4 Ablation experiments with different modules

Full size table

Table 5 Ablation experiments of the decoder

Full size table

Table 6 Ablation experiments of model Loss

Full size table

4 Conclusion

ASPCM images faced many challenges such as low contrast, artifacts, and class imbalance, which significantly affected ASPCM images segmentation. FafFormer was developed as Transformer-based model to improve the segmentation performance of ASPCM images through pyramid pooling and flow alignment fusion. MiT and PPM were applied within the encoder to extract the features of flocs and filamentous bacteria considering their different morphology. FAFM was designed within the decoder to restore the boundary information of flocs and filamentous bacteria. FAFM used generated semantic flow as additional information for fine-grained upsampling and fused it with the multi-scale features from the encoder. The Focal–Lovász Loss was combined with the Focal Loss and Lovász-Softmax Loss to improve segmentation accuracy and address the class imbalance. The experimental evaluation of image segmentation was performed on a dataset obtained from a municipal wastewater treatment plant. The superiority of FafFormer was validated compared to existing models in terms of accuracy and reliability, particularly in filamentous bacteria segmentation. The lightweight model will be explored as part of our future research.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. All other used public datasets are available and cited in reference.

References

Khan, M.B., Lee, X.Y., Nisar, H., Ng, C.A., Yeap, K.H., Malik, A.S.: Digital image processing and analysis for activated sludge wastewater treatment. Signal Image Anal. Biomed. Life Sci. 227–248 (2015)
Zhang, Y., Cui, J., Xu, C., Yang, J., Liu, M., Ren, M., Tan, X., Lin, A., Yang, W.: The formation of discharge standards of pollutants for municipal wastewater treatment plants needs adapt to local conditions in china. Environ. Sci. Pollut. Res. 30(20), 57207–57211 (2023)
Article Google Scholar
Jenné, R., Banadda, E.N., Philips, N., Van Impe, J.: Image analysis as a monitoring tool for activated sludge properties in lab-scale installations. J. Environ. Sci. Health Part A 38(10), 2009–2018 (2003)
Article Google Scholar
Nisar, H., Yong, L.X., Ho, Y.K., Voon, Y.V., Siang, S.C.: Application of imaging techniques for monitoring flocs in activated sludge. In: 2012 International Conference on Biomedical Engineering (ICoBE), pp. 6–9 (2012). IEEE
Lee, X.Y., Khan, M.B., Nisar, H., Ho, Y.K., Ng, C.A., Malik, A.S.: Morphological analysis of activated sludge flocs and filaments. In: 2014 IEEE International Instrumentation and Measurement Technology Conference (I2MTC) Proceedings, pp. 1449–1453 (2014). IEEE
Khan, M.B., Nisar, H., Aun, N.C.: Segmentation and quantification of activated sludge floes for wastewater treatment. In: 2014 IEEE Conference on Open Systems (ICOS), pp. 18–23 (2014). IEEE
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241 (2015). Springer
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article PubMed Google Scholar
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article PubMed Google Scholar
Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1451–1460 (2018). IEEE
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Huang, T., Chen, J., Jiang, L.: DS-UNeXt: depthwise separable convolution network with large convolutional kernel for medical image segmentation. Signal Image Video Process. 17(5), 1775–1783 (2022)
Article Google Scholar
Chen, L., Cui, Y., Song, H., Huang, B., Yang, J., Zhao, D., Xia, B.: Femoral head segmentation based on improved fully convolutional neural network for ultrasound images. Signal Image Video Process. 14, 1043–1051 (2020)
Article Google Scholar
Wang, Y., Wang, J., Guo, P.: Eye-UNet: a UNet-based network with attention mechanism for low-quality human eye image segmentation. Signal Image Video Process. 17(4), 1097–1103 (2022)
Article Google Scholar
Zhao, L.-J., Zou, S.-D., Zhang, Y.-H., Huang, M.-Z., Zuo, Y., Wang, J., Lu, X.-K., Wu, Z.-H., Liu, X.-Y.: Segmentation of activated sludge phase contrast microscopy images using u-net deep learning model. Sens. Mater. 31(6), 2013–2028 (2019)
CAS Google Scholar
Ashish, V.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. arXiv:2010.11929 (2020)
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021)
Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Berman, M., Triki, A.R., Blaschko, M.B.: The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4413–4421 (2018)
Lee, J., Kim, D., Ponce, J., Ham, B.: Sfnet: Learning object-aware semantic correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2278–2287 (2019)
Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., Tong, Y.: Semantic flow for fast and accurate scene parsing. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 775–793 (2020). Springer
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Islam, M.A., Jia, S., Bruce, N.D.: How much position information do convolutional neural networks encode? arXiv:2001.08248 (2020)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. Adv. Neural Inf. Process. Syst. 28 (2015)

Download references

Acknowledgements

We would like to thank the editor and the reviewers for their valuable comments.

Funding

Open access funding provided by Politecnico di Milano within the CRUI-CARE Agreement. This work was supported in part by the National Key Research and Development Program (2018YFB1700200), the 2020 Liaoning Provincial Higher Education Innovative Talent Support Program, and the 2021 Basic Research Project of Higher Education Key Projects (LJKZ0442).

Author information

Authors and Affiliations

College of Information Engineering, Shenyang University of Chemical Technology, Shenyang, 110142, Liaoning, China
Lijie Zhao, Yingying Zhang, Guogang Wang & Mingzhong Huang
Department of Computer Science, University of Bradford, Bradford, BD7 1DP, UK
Qichun Zhang
Department of Mechanical Engineering, Politecnico di Milano, Milan, 32 20133, Italy
Hamid Reza Karimi

Authors

Lijie Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yingying Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guogang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Qichun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Reza Karimi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L. Zhao, Y. Zhang and G. Wang prepared the main contents of the manuscript. M. Huang contributed to the comments address and manuscript revision. Q. Zhang and H.R.Karimi contributed to the experimental results discussion and analysis. All authors revised and proof read the submission.

Corresponding author

Correspondence to Hamid Reza Karimi.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Consent for publication

All authors agreed on the final approval of the version to be published.

Ethics approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, L., Zhang, Y., Wang, G. et al. Multi-scale feature flow alignment fusion with Transformer for the microscopic images segmentation of activated sludge. SIViP 18, 1241–1248 (2024). https://doi.org/10.1007/s11760-023-02836-0

Download citation

Received: 27 August 2023
Revised: 06 October 2023
Accepted: 08 October 2023
Published: 02 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11760-023-02836-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-scale feature flow alignment fusion with Transformer for the microscopic images segmentation of activated sludge

Abstract

Similar content being viewed by others

Coal Maceral Groups Segmentation Using Multi-scale Residual Network

Image segmentation based on U-Net++ network method to identify Bacillus Subtilis cells in micro-droplets

Fusion-Based Noisy Image Segmentation Method

1 Introduction