1 Introduction

Activated sludge (AS) process is a biological wastewater treatment method to remove organic matter. The process relies on a community of microorganisms. However, poor sludge settling may occur when filamentous bacteria become too abundant [1]. It affects effluent quality and may violate environmental discharge standards [2]. Filamentous bulking seriously affects effluent quality and process safety.

Phase-contrast microscopic (PCM) images are one of the most direct and effective ways to understand the microorganisms and microstructure of activated sludge. Image segmentation is key to quantitative analysis and processing. Several classical segmentation methods have been developed and employed for analyzing the microscopic images of activated sludges over the years, e.g., threshold-based [3], edge-based [4], region-based [5], and clustering-based approaches [6]. Despite advanced activated sludge phase-contrast microscopic (ASPCM) images of segmentation using classical methods, challenges still exist, including complex and diverse morphology of flocs and filamentous bacteria, overlapping structures, noise presence, and unbalanced data distribution.

Deep learning has significantly advanced semantic image segmentation with widely used convolutional neural networks (CNN). Models (e.g., fully convolutional networks (FCN) [7], U-Net [8], SegNet [9], and DeepLab [10,11,12,13]) have demonstrated their effectiveness in accurately segmenting images. Applications derived from deep-learning-based image segmentation have thrived in various fields, e.g., clinical medical image segmentation [14, 15] and eye image segmentation [16].

Considerable interest has been attracted by deep learning, which has been the focus of extensive research in the context of ASPCM images segmentation. Zhao et al. [17] first applied deep learning to ASPCM images segmentation. A segmentation model is implemented for ASPCM images using U-Net. A combination of the binary Cross-Entropy loss function and the Dice coefficient loss function is employed. Moreover, a promising alternative for image segmentation emerges following the advent of Transformer [18] and vision Transformer (ViT) [19] for computer vision. SETR [20] based on ViT has achieved excellent results in image segmentation, which confirms the effectiveness of ViT in the task. However, the fixed-resolution position encoding used in ViT leads to degraded performance and reduced efficiency. Segformer [21] proposed a semantic segmentation model based on Transformer to address these limitations. A more powerful Mix Transformer (MiT) is proposed as the encoder, which provides semantic features with richer global information. The accuracy of image segmentation achieved by Segformer surpasses existing models. Advanced deep learning has greatly benefited the accuracy and efficiency of image segmentation.

ASPCM images segmentation exhibits challenges, e.g., poor contrast, artifacts, similar shapes, variations in size and shape, and imbalanced distribution. The main contributions of the work to address these challenges are as follows:

  • A novel semantic segmentation model (FafFormer) was proposed for ASPCM segmentation. Pyramid pooling and flow alignment fusion based on Transformer were incorporated to enhance its segmentation capabilities.

  • Flow Alignment Fusion Module (FAFM) was proposed in the decoder to restore boundary information using semantic flows for fine upsampling and fusion with multi-scale features.

  • A hybrid loss function was designed as a weighted sum of the Focal Loss [22] and Lovász-Softmax Loss [23] to handle class imbalance and mIoU optimization for ASPCM segmentation.

  • Experiments on a real activated sludge dataset from a municipal wastewater treatment plant demonstrated the superior accuracy and reliability of FafFormer, especially in the filamentous bacteria segmentation. It outperformed existing semantic segmentation models.

2 Proposed network architecture: FafFormer

FafFormer follows an encoder–decoder structure. The application of PPM and MiT enables the extraction of multi-scale features from ASPCM images and progressively reduces the spatial resolution. ASPCM images are divided into overlapping patches, and the pixel values of each patch are transformed to obtain the feature representation. Feature map \({F_4}\) is pooled and merged using PPM after the MiT to further capture information from multi-scale. The decoder consists of FAFM and upsampling operation. FAFM is composed of the Flow Alignment Module (FAM) [24, 25] and concatenation operation. The low- and high-level feature maps from the encoder are handled using FAM. The output is concatenated channel-wise with the encoder output. FAM captures the semantic flow between feature maps of adjacent layers to align them by predicting the flow field within the network. The results generated by the FAM are concatenated channel-wise with those generated by the encoder, which compensates for the local information lost during feature extraction. The feature maps from each phase of the encoder are fed into a multilayer perceptron (MLP) for further feature extraction and concatenated with the decoder feature maps. The predicted feature map is generated using a \(1 \times 1\) convolutional layer and upsampled to the same resolution as the original image. The Focal–Lovász Loss is proposed to overcome the category imbalance and improve the accuracy of filamentous bacteria segmentation. Figure 1 shows the network architecture of FafFormer.

Fig. 1
figure 1

Network architecture of FafFormer

2.1 Encoder

The encoder composed of MiT and PPM extracts multi-scale features from ASPCM images. Specifically, ASPCM images are used as input for the overlap patch embedding in MiT to capture position encoding information. Subsequently, the original image resolution is divided into 1/4, 1/8, 1/16, and 1/32 scales using four Transformer blocks. Each Transformer block consists of efficient self-attention, Mix-FFN, and overlap patch merging. The performance of the multi-head self-attention mechanism is improved by efficient self-attention, which reduces computational complexity. Each element in the sequence interacts with other elements through multi-head self-attention, resulting in a representation of each element in the sequence. The self-attention is calculated as follows:

$$\begin{aligned} \textrm{Attention}(Q,K,V)=\textrm{Softmax}\left( \frac{ QK ^{\textrm{T}}}{{\sqrt{d_{k} }} } \right) {V}\end{aligned}$$
(1)

where Q, K, and V are the query matrix, key matrix, and value matrix, respectively; \(d_k\) is the vector dimension of matrices Q and K; \(O(N^2)\) is the computational complexity of the process. Efficient self-attention utilizes sequence reduction [26] with the following process:

$$\begin{aligned} {\hat{K}}= & {} \textrm{Reshape}\left( \frac{N}{R},C\cdot R\right) (K) \end{aligned}$$
(2)
$$\begin{aligned} K= & {} \textrm{Linear}(C\cdot R,C)({\hat{K}}) \end{aligned}$$
(3)

where N is the sequence length; C is the channel dimension; K is the reduced sequence; \({\hat{K}}\) denotes the sequence after the reduction of K; R denotes the reduction ratio; \(\textrm{Reshape}(\frac{N}{R}, C\cdot R)(K)\) is the deformation of K to \(\frac{N}{R} \times C\cdot R\); \(\textrm{Linear}(C\cdot R,C)({\hat{K}})\) is dimension \(C\cdot R\) as the input to the MLP and the output dimension C; R is set to [64, 16, 4, 1] in the respective stages of the encoder. The computational complexity of the efficient self-attention is reduced from \(O(N^2)\) to \(O(\frac{N^2}{R})\). In the process of providing location information to the network, interpolation is typically used in position encoding when there is a mismatch in resolution between the training and inference stages. However, interpolation decreases segmentation accuracy. CNN is capable of learning positional information implicitly [27]. Therefore, a zero-padded \(3\times 3\) convolutional layer is used in the Mix-FFN to provide the network with the required positional information.

Feature map \(F_4\), which captures the most abstract semantic information, is obtained from the deepest layer of MiT. It serves as input to the PPM where further encoding is performed to capture multi-scale information. Feature maps containing multi-scale information are extracted in parallel from \(F_4\) using Max Pooling of sizes \(1 \times 1\), \(2\times 2\), \(3\times 3\), and \(6\times 6\), respectively. The resulting multi-scale feature maps are subsequently upsampled to match the \(F_4\) size for concatenation. The segmentation performance for flocs and filamentous bacteria with different morphology is improved by integrating the feature information from these multi-scale feature maps.

2.2 Decoder

FAFM is proposed in the decoder to restore the images. FAFM is composed of FAM and concatenation operation. FAM consists of a sub-network generating semantic flow fields and warp production. The sub-network employs a convolutional layer and upsampling to obtain flow-field information. Specifically, a \(1\times 1\) convolutional layer is used for each feature map to unify the channel dimension. The low-level feature maps and high-level feature maps in the adjacent feature maps are denoted as \(F_L\in {\mathbb {R}}^{H \times W \times C}\) and \(F_H\in {\mathbb {R}}^{h \times w \times c}\), respectively. \(F_H\) is upsampled to the \(F_L\) size using bilinear interpolation for concatenation. After the \(1\times 1\) convolutional layer and \(F_H\) upsampling, \(F_L\) is concatenated in the channel dimension to fuse features of different scales. Then, a convolution layer of \(3\times 3\) is applied for extracting the fused features and obtaining the flow field information. The entire process can be stated as follows:

$$\begin{aligned} \Delta _{L}=\textrm{C}_{3\times 3}(\textrm{Cat}(\textrm{U}(\textrm{C}_{1 \times 1}( F_H )),\textrm{C}_{1\times 1}(F_H) )) \end{aligned}$$
(4)

where \(\textrm{C}\) denotes the convolutional layer; \(\textrm{U}\) denotes upsampling operation; \(\textrm{Cat}\) denotes the concatenation operation; \(\Delta _{L}\) denotes the flow field between \(F_L\) and \(F_H\) generated by the sub-network; \(\Delta _{L} \in {\mathbb {R}}^{H \times W \times 2}\). The flow field represents the semantic flow offset between two adjacent feature maps.

Warp production is performed on each feature point of the high-level feature maps in the adjacent feature maps. A differentiable bi-linear sampling mechanism [28] is used to obtain the corresponding values for each feature point. These values are filled at the offset position. It helps preserve the original global information and recover more local feature information. Besides, the high-level feature maps are scaled at the same spatial resolution as the low-level feature maps. Offset feature points are created by adding the generated flow field to each feature point on the low-level feature map. The final result is obtained by bilinear interpolation. FAM is a more effective upsampling technique than bilinear interpolation. It restores lost details in the feature maps and enhances the boundary information of flocs and filamentous bacteria. This mitigates the edge smoothing and jaggedness that often occur using traditional upsampling techniques.

The semantic flow information between adjacent high-level and low-level feature maps is obtained by FAFM. The semantic flow is used as auxiliary information for image recovery. This process enhances the network’s ability to capture semantic features and improve its ability to recover images accurately. The encoder shallow feature maps and the after FAM deep feature maps are concatenated. The fusion of local and global feature information compensates for the feature loss during the downsampling of the encoder. Finally, the multi-scale feature maps from the encoder are fed into MLP and then concatenated with the feature maps from the decoder. The resulting feature maps are upsampled to restore the original resolution. The decoder is designed to recover lost feature information at the boundary, which enables more accurate ASPCM images segmentation.

2.3 Focal–Lovász Loss

ASPCM images with low contrast and artifact presence make identifying the boundaries of flocs and filamentous bacteria difficult. Meanwhile, the number of flocs and filamentous bacteria is uneven, especially for the small number of filamentous bacteria. The segmentation model exhibits limitations in accurately identifying filamentous bacteria, which leads to potential misclassification. The Focal–Lovász Loss is proposed for improving the classification and segmentation of the model. The prediction probability is denoted as p and the true label denoted as y in the loss formulation. The Focal–Lovász Loss is expressed by:

$$\begin{aligned} {\mathcal {L}}(p,y)=a{\mathcal {L}}_{{\text {Lov}\acute{\textrm{a}}\text {sz}}}(p,y)+ b{\mathcal {L}}_{\textrm{FL}}(p,y) \end{aligned}$$
(5)

where a and b are the weights assigned to the two loss functions; \({\mathcal {L}}_{{\text {Lov}\acute{\textrm{a}}\text {sz}}}(p,y)\) denotes Lovász-Softmax Loss; \({\mathcal {L}}_{\textrm{FL}}(p,y)\) denotes Focal Loss. The Focal Loss is applicable to solve category imbalance. Focal Loss reduces the class imbalance problem and improves minority class prediction performance by adjusting the weight of the loss function to focus more on the minority class. This refocuses the model on learning difficult categories and improves segmentation accuracy. \(p^{s}\) denotes the predicted probability for category s. s denotes the target categories, \(s \in [1, S]\), and S denotes number of overall categories. The Focal Loss is expressed by:

$$\begin{aligned} {\mathcal {L}}_{\textrm{FL} }(p,y)=\sum _{s=1}^{S}\alpha (1-p^{s})^{\gamma }\textrm{log} (p^{s}) \end{aligned}$$
(6)

where \(\alpha \) denotes a coefficient that adjusts the positive and negative samples; \(\gamma \) denotes an adjustable parameter that adjusts the weights of the difficult-to-categorize categories in the loss function. The Lovász-Softmax Loss for the pixel-level classification in the segmentation improves the classification and segmentation of the model. By analyzing the properties of pixel arrangements and Jaccard index, the Lovász-Softmax Loss encourages the model to preserve boundary details and reduce segmentation errors, especially in complex or ambiguous regions. \(p_i^{s}\) denotes the predicted probability (\(p_i^{s}\in [0,1]\)), and pixel i in the image belongs to category s. Construct the vector of pixel errors \(m_i\) for category s based on the probability \(p_i^s\), defined as:

$$\begin{aligned} m_{i}^{s}=\left\{ \begin{aligned} 1-p_{i}^{s}&\quad \textrm{if} \ s =y_i, \\ p_{i}^{s}&\quad \ \textrm{otherwise}. \end{aligned} \right. \end{aligned}$$
(7)

where \(y_i\) is the ground truth label of pixel i. The surrogate to the Jaccard loss is constructed using \(m^{s}\), and the Jaccard index for category s is:

$$\begin{aligned} {\mathcal {L}}(p^{s})={\bar{\Delta }}_{Jc}(m^{s}) \end{aligned}$$
(8)

where \({\bar{\Delta }}_{Jc}\) denotes the surrogate to the Jaccard loss. Given that Mean Intersection over Union (mIoU) is a common metric in semantic segmentation, the loss function is to be category-averaged. The final Lovász-Softmax Loss is defined by:

$$\begin{aligned} {\mathcal {L}}_{\text {Lov}\acute{\textrm{a}}\text {sz}}(p,y)=\frac{1}{\mid S \mid }\sum _{s \in S }{\bar{\Delta }}_{Jc}(m^{s}) \end{aligned}$$
(9)

3 Experimental study

Some relevant experimental results are presented to demonstrate FafFormer’s effectiveness in this section.

3.1 Dataset of ASPCM images

ASPCM images used in this experiment were obtained from the aeration tank of a municipal wastewater treatment plant. The dataset consisted of 323 finely annotated images, each containing three semantic classes: flocs, filamentous bacteria, and background. The images had a resolution of 2048 \(\times \) 1536. The training set contained 256 images. The test set contained 67 images. Figure 2 shows the original image and the corresponding ground truth segmentation of an ASPCM image.

Fig. 2
figure 2

Original image and ground truth of an ASPCM image

Fig. 3
figure 3

Model performance and training status curves

Table 1 Image segmentation comparison of different models

3.2 Model training

Experiments were conducted on the PaddlePaddle deep learning framework using Python. The experiments were executed on AI Studio with Tesla V100 GPU with 32 GB of memory. The GPU memory consumption of the model is around 24.5GB during training. The AdamW optimizer was used with a momentum of 0.01 and batch size of 2 during training. The input images size is set to 1024 \(\times \) 1000 pixels. The total number of iterations was set to 55,000, and this training process took approximately 15 h. The initial learning rate was set to 0.0001 and gradually decreased following the “poly” strategy with the power of 2. The encoder MiT-B3 was selected for better results. a and b were 0.35 and 0.65, respectively, in the Focal–Lovász Loss; \(\gamma \) was 2; \(\alpha \) was 0.25.

Figure 3 presents the performance and training status curves of the model (the loss value, learning rate, mIoU, and accuracy). The loss gradually converges during training with convergence around 55,000 iterations. A higher learning rate is initially set for fast convergence, and then the "ploy" strategy is used to fine-tune the learning rate. Both mIoU and accuracy gradually increase as training progresses.

3.3 Experiment results and comparative analysis

Table 1 presents the segmentation results from different models. FafFormer shows excellent performance in terms of mIoU compared to other models. Exceptional performance is demonstrated in segmenting filamentous bacteria concerning precision and obtains highly accurate segmentation. In terms of FLOPs, it achieves accurate segmentation with slightly increasing computational complexity. Bold font is used to highlight the best performance in terms of segmentation metrics.

Figure 4 shows the segmentation obtained by each model individually. Regions (I), (II), and (III) are selected for comparative analysis. The presence of artifacts poses challenges in the identification of flocs and filamentous bacteria within regions (II) and (III). Additionally, poor contrast and artifacts further obscure the boundaries of filamentous bacteria in region (I). FafFormer incorporates MiT and PPM to extract features, and FAFM is proposed to recover image-boundary features. It allows for more accurate segmentation of flocs and filamentous bacteria boundaries compared to other methods that are less effective at capturing boundary details. FafFormer can enhance the segmentation performance of flocs and filamentous bacteria with diverse morphology as well as the recovery of crucial boundary information.

To evaluate the universality of our approach, we conduct experiments on two different datasets, Cityscapes and ADE20k. These two datasets are widely used to evaluate and compare the performance of different semantic segmentation models. Table 2 evaluates the performance (mIoU) and computational demands of various models on the ADE20K and Cityscapes datasets. FafFormer surpasses its counterparts in mIoU scores. Though its computational load (FLOPs) exceeds that of Segformer, it remains below DeepLabV3+ and SFNet. FafFormer achieves a good balance between model performance and computational cost.

Fig. 4
figure 4

Comparison of segmentation results for sludge microscopic images. a Original Image, b FCN, c U-Net, d SegNet, e SFNet, f SFSegNets, g DeepLabV3+, h Segformer, i FafFormer

3.4 Ablation experiments

The PPM module is added into the encoder to improve the model segmentation performance by extracting semantic feature information at different scales. The ablation results of the encoder are shown in Table 3. It shows that after the addition of PPM, the model’s ability to segment filaments and background targets is improved, while the floc target is slightly weakened. There is still an improvement in overall performance. Table 4 presents ablation experiments conducted with different modules. Table 4 illustrates the slight increase in computational complexity and model parameters after combining PPM with encoders and FAFM. FafFormer enhances the model’s segmentation performance, which could be crucial in accurately handling various morphologies of flocs and filamentous bacteria.

Table 2 Comparison results of different datasets
Table 3 Ablation experiments of the encoder

Ablation experiments are conducted using different upsampling techniques in the decoder, e.g., bilinear interpolation, nearest neighbor interpolation, FAM, and FAFM. Table 5 shows the experimental segmentation metrics. FAFM exhibits optimal performance in terms of mIoU compared to traditional upsampling methods, followed by FAM. The concatenation operation in FAFM fuses shallow and deep semantic information.

Difficulties emerge in accurately segmenting filamentous bacteria during the experiment for the class imbalance between flocs and filamentous bacteria. A comparative analysis is conducted using different loss functions (e.g., the Cross-Entropy Loss, Focal Loss, Lovász-Softmax Loss, and Focal–Lovász Loss) to optimize the segmentation accuracy of the model. The Focal–Lovász Loss demonstrates superior performance in terms of mIoU compared to other loss functions (Table 6). The Focal–Lovász Loss addresses the class imbalance and significantly improves the segmentation accuracy of filamentous bacteria features. It mitigates the negative effects of class imbalance by assigning higher weights to the minority categories during training. The model focuses on learning filamentous bacteria features, which can optimize the segmentation accuracy of models in unbalanced class distribution.

Table 4 Ablation experiments with different modules
Table 5 Ablation experiments of the decoder
Table 6 Ablation experiments of model Loss

4 Conclusion

ASPCM images faced many challenges such as low contrast, artifacts, and class imbalance, which significantly affected ASPCM images segmentation. FafFormer was developed as Transformer-based model to improve the segmentation performance of ASPCM images through pyramid pooling and flow alignment fusion. MiT and PPM were applied within the encoder to extract the features of flocs and filamentous bacteria considering their different morphology. FAFM was designed within the decoder to restore the boundary information of flocs and filamentous bacteria. FAFM used generated semantic flow as additional information for fine-grained upsampling and fused it with the multi-scale features from the encoder. The Focal–Lovász Loss was combined with the Focal Loss and Lovász-Softmax Loss to improve segmentation accuracy and address the class imbalance. The experimental evaluation of image segmentation was performed on a dataset obtained from a municipal wastewater treatment plant. The superiority of FafFormer was validated compared to existing models in terms of accuracy and reliability, particularly in filamentous bacteria segmentation. The lightweight model will be explored as part of our future research.