Sfnet: Faster and Accurate Semantic Segmentation Via Semantic Flow

Li, Xiangtai; Zhang, Jiangning; Yang, Yibo; Cheng, Guangliang; Yang, Kuiyuan; Tong, Yunhai; Tao, Dacheng

doi:10.1007/s11263-023-01875-x

Sfnet: Faster and Accurate Semantic Segmentation Via Semantic Flow

Open access
Published: 07 September 2023

Volume 132, pages 466–489, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computer Vision Aims and scope Submit manuscript

Sfnet: Faster and Accurate Semantic Segmentation Via Semantic Flow

Download PDF

Xiangtai Li^1,2,
Jiangning Zhang³,
Yibo Yang^1,6,
Guangliang Cheng²,
Kuiyuan Yang⁵,
Yunhai Tong ORCID: orcid.org/0000-0001-8735-2516¹ &
…
Dacheng Tao^4,6

2101 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

In this paper, we focus on exploring effective methods for faster and accurate semantic segmentation. A common practice to improve the performance is to attain high-resolution feature maps with strong semantic representation. Two strategies are widely used: atrous convolutions and feature pyramid fusion, while both are either computationally intensive or ineffective. Inspired by the Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels and broadcast high-level features to high-resolution features effectively and efficiently. Furthermore, integrating our FAM to a standard feature pyramid structure exhibits superior performance over other real-time methods, even on lightweight backbone networks, such as ResNet-18 and DFNet. Then to further speed up the inference procedure, we also present a novel Gated Dual Flow Alignment Module to directly align high-resolution feature maps and low-resolution feature maps where we term the improved version network as SFNet-Lite. Extensive experiments are conducted on several challenging datasets, where results show the effectiveness of both SFNet and SFNet-Lite. In particular, when using Cityscapes test set, the SFNet-Lite series achieve 80.1 mIoU while running at 60 FPS using ResNet-18 backbone and 78.8 mIoU while running at 120 FPS using STDC backbone on RTX-3090. Moreover, we unify four challenging driving datasets (i.e., Cityscapes, Mapillary, IDD, and BDD) into one large dataset, which we named Unified Driving Segmentation (UDS) dataset. It contains diverse domain and style information. We benchmark several representative works on UDS. Both SFNet and SFNet-Lite still achieve the best speed and accuracy trade-off on UDS, which serves as a strong baseline in such a challenging setting. The code and models are publicly available at https://github.com/lxtGH/SFSegNets.

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Semantic segmentation is a fundamental vision task that aims to classify every pixel in the images correctly. It involves many real-world applications, including auto-driving, robot navigation, and image editing. The seminal work of Long et al. (2015) built a deep Fully Convolutional Network (FCN), which is mainly composed of convolutional layers to carve strong semantic representation. However, detailed object boundary information, which is also crucial to the performance, is usually missing due to the use of the down-sampling layers.

To alleviate this problem, state-of-the-art methods (Zhao et al., 2017, 2018; Jun et al., 2019; Zhu et al., 2019) apply atrous convolutions (Yu & Koltun, 2016) at the last several stages of their networks to yield feature maps with strong semantic representation while at the same time maintaining the high-resolution. Meanwhile, several state-of-the-art approaches (Xiao et al., 2018; Chen et al., 2018; Li et al., 2020) adopt multiscale feature representation to enhance final segmentation results. Recently, several methods (Cheng et al., 2021; Wang et al., 2021; Zheng et al., 2021) adopt vision transformer architectures and model the semantic segmentation as a per-segment prediction problem. In particular, they achieve stronger performance for the long-tailed datasets, including ADE-20k (Zhou et al., 2016) and COCO-stuff (Caesar et al., 2018) due to the stronger pre-trained models (Liu et al., 2021) and query-based mask representation (Carion et al., 2020).

Despite those methods achieving state-of-the-art results on various benchmarks, one fundamental problem is the real-time inference speed, particularly for high-resolution image inputs. Given that the FCN using ResNet-18 (He et al., 2016) as the backbone network has a frame rate of 57.2 FPS for a $1024\times 2048$ image, after applying atrous convolutions (Yu & Koltun, 2016) to the network as done in Zhao et al. (2017, 2018), the modified network only has a frame rate of 8.7 FPS. Moreover, under a single GTX 1080Ti GPU with no other ongoing programs, the previous state-of-the-art model PSPNet (Zhao et al., 2017) has a frame rate of only 1.6 FPS for $1024 \times 2048$ input images. Consequently, this is problematic for many advanced real-world applications, such as self-driving cars and robot navigation, which desperately demand real-time online data processing.

In order to not only maintain detailed resolution information but also get features that exhibit strong semantic representation, another direction is to build FPN-like (Lin et al., 2017; Kirillov et al., 2019; Ronneberger et al., 2015) models which leverage the lateral path to fuse feature maps in a top-down manner. In this way, the deep features of the last several layers strengthen the shallow features with high resolution, and therefore, the refined features are possible to satisfy the above two factors and are beneficial to the accuracy improvement. Such designs are mainly adopted by real-time semantic segmentation models. However, the accuracy of these methods (Ronneberger et al., 2015; Badrinarayanan & Kendall, 2017; Orsic et al., 2019; Peng et al., 2022) still needs improvement when compared to those networks that hold large feature maps in the last several stages. Is there a better solution for high accuracy and high-speed semantic segmentation? We suspect that the low accuracy problem arises from the ineffective propagation of semantics from deep layers to shallow layers, where the semantics are not well aligned across different stages.

To mitigate this issue, we propose explicitly learning the Semantic Flow between two network layers of different resolutions. Semantic Flow is inspired by optical flow, which is widely used in video processing task (Zhu et al., 2017) to represent the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by relative motion. In a flash of inspiration, we find the relationship between two feature maps of arbitrary resolutions from the same image can also be represented with the “motion” of every pixel from one feature map to the other one. In this case, once precise Semantic Flow is obtained, the network is able to propagate semantic features with minimal information loss. It should be noted that Semantic Flow is different from optical flow, since Semantic Flow takes feature maps from different levels as input and assesses the discrepancy within them to find a suitable flow field that will give a dynamic indication about how to align these two feature maps effectively.

Based on the concept of Semantic Flow, we design a novel network module called Flow Alignment Module (FAM) to utilize Semantic Flow in semantic segmentation. Feature maps after FAM are embodied with both rich semantics and abundant spatial information. Because FAM can effectively transmit semantic information from deep to shallow layers through elementary operations, it shows superior efficacy in improving accuracy and keeping superior efficiency. Moreover, FAM is end-to-end trainable and can be plugged into any backbone network to improve the results with a minor computational overhead. For simplicity, we call the networks that all incorporate FAM but have different backbones as SFNet. As depicted in Fig. 1, SFNets with different backbone networks outperform competitors by a large margin at the same speed. In particular, our method adopting ResNet-18 as backbone achieves 79.8% mIoU on the Cityscapes test server with a frame rate of 33 FPS. When adopting DF2 (Li et al., 2019) as the backbone, our method achieves 77.8% mIoU with 103 FPS and 74.5% mIoU with 134 FPS when equipped with the DF1 backbone (Li et al., 2019). The results are shown in Fig. 1 (green node).

The original SFNet (Li et al., 2020) achieves satisfactory results on speed and accuracy trade-off, and several following works (Huang et al., 2021) generalize the idea of SFNet into other domains. However, the inference speed of SFNet still needs to be faster due to the multi-stage features involved. To speed up the SFNet and maintain accuracy at the same time, we propose a new version of SFNet, named SFNet-Lite. In particular, we design a new flow-aligned module named Gated Dual Flow Aligned Module (GD-FAM). Following FAM, GD-FAM takes two features as inputs and learns two semantic flows to refine both high-resolution and low-resolution features simultaneously. Meanwhile, we also generate a shared gate map to control the flow warping processing before the final addition dynamically. The newly proposed GD-FAM can be appended at the end of SFNet backbone only once, directly refining the highest and lowest resolution features. Such design avoids multiscale feature fusion and speeds up the SFNet by a large margin. We name our new version of SFNet as SFNet-Lite. Moreover, to keep the origin accuracy, we carry out extensive experiments on Cityscapes by introducing more balanced datasets training (Zhu et al., 2019). As a result, our SFNet-Lite with ResNet-18 backbone achieves 80.1 mIoU on Cityscapes test set but with the speed of 49 FPS (16 FPS improvements with slightly better performance over original SFNet (Li et al., 2020)). Moreover, when adopting with STDCv1 backbone, our method can achieve 78.7 mIoU while running with the speed of 120 FPS. The results are shown in Fig. 1 (red node).

Since various driving datasets (Fisher et al., 2020; Varma et al., 2019; Cordts et al., 2016) are from different domains, previous real-time semantic segmentation methods train different models on different datasets, which results in that the trained models are sensitive to trained domains and can not generalize well to unseen domain (Choi et al., 2021). Recently, M-Seg propose a mixed dataset for multi-dataset semantic to achive one model for multiple dataset training and test. Motivated by above, we verify whether our SFNet series can be more effective in a unified dataset benchmark. Firstly, we benchmark our SFNet and SFNet-Lite on various driving datasets (Fisher et al., 2020; Neuhold et al., 2017; Varma et al., 2019) in the experiment part. Secondly, we creat a challenging benchmark by mixing four challenging driving datasets, including Cityscapes, Mapillary, BDD, and IDD. We term our merged dataset Unified Driving Segmentation (UDS). As shown in Fig. 2, our goal is to train a unified model to perform semantic segmentation on various scenes. To the best of our knowledge, UDS is the largest public semantic segmentation dataset for the driving scene. In particular, we extract the typical semantic class as defined by Cityscapes and BDD with 19 class labels and merge several classes in Mapillary. We further benchmark representative works on our UDS. Our SFNet also achieves the best accuracy and speed trade-off, which indicates the generalization ability of semantic flow. In particular, using DFNet (Li et al., 2019) as the backbone, our SFNet and SFNet-Lite achieve 7–9% mIoU improvements on UDS. This indicates that our proposed FAM and GD-FAM are more practical to multiple-dataset training.

In summary, a preliminary version of this work was published in Li et al. (2020). In this paper, we make the following significant extensions: (1) We introduce a new flow alignment module (GD-FAM) to increase the speed of SFNet while maintaining the original performance. Experiments show that this new design consistently outperforms our previous module with higher inference efficiency. (2) We conduct more comprehensive ablation studies to verify the proposed method, including quantitative improvements over baselines and visualization analysis. (3) We extend SFNet into Panoptic Segmentation, where we achieve 1.0%$-$1.5% PQ improvements over three strong baselines. (4) We further benchmark SFNet and several recent representative methods on two more challenging datasets, including Mapillary (Neuhold et al., 2017) and IDD (Varma et al., 2019). Our SFNet series significantly improve over different baselines and achieve the best speed and accuracy trade-off. In particular, we propose a new setting for training a unified real-time semantic segmentation model by merging existing driving datasets (UDS). Our SFNet series also achieve the best accuracy and speed trade-off, which can be a solid baseline for mixed driving segmentation. We further prove the effectiveness of SFNet and SFNet-Lite on transformer architecture on the ADE20k dataset. Moreover, aided by the RobustNet (Choi et al., 2021), we further show the effectiveness of SFNet on domain generalization setting.

2 Related Work

Generic Semantic Segmentation Current state-of-the-art methods on semantic segmentation are based on the FCN framework, which treats semantic segmentation as a dense pixel classification problem. Lots of methods focus on global context modeling with dilated backbone. Global average pooled features are concatenated into existing feature maps in Liu et al. (2015). In PSPNet (Zhao et al., 2017), average pooled features of multiple window sizes, including global average pooling, are upsampled to the same size and concatenated together to enrich global information. The DeepLab variants (Chen et al., 2015, 2017, 2018) propose atrous or dilated convolutions and atrous spatial pyramid pooling (ASPP) to increase the effective receptive field. DenseASPP (Yang et al., 2018) improves on Chen et al. (2018) by densely connecting convolutional layers with different dilation rates to further increase the receptive field of the network. In addition to concatenating global information into feature maps, multiplying global information into feature maps also shows better performance (Zhang et al., 2018; Woo et al., 2018; Yue et al., 2018; Changqian et al., 2018). Moreover, several works adopt the self-attention design to encode the global information for the scene. Using non-local operator (Wang et al., 2018), impressive results are achieved in Yuan and Wang (2021); Zhang et al. (2019); Jun et al. (2019). CCNet (Huang et al., 2019) models the long-range dependencies by considering its surrounding pixels on the criss-cross path via a recurrent way to save computation and memory cost. Meanwhile, several works (Ronneberger et al., 2015; Xiao et al., 2018; Kirillov et al., 2019; Li et al., 2021; He et al., 2021) adopt encode-decoder architecture to learn the multi-level feature representation. RefineNet (Lin et al., 2017) and DFN (Changqian et al., 2018) adopted encoder-decoder structures that fuse information in low-level and high-level layers to make dense prediction results. Following such architecture design, GFFNet (Li et al., 2020), CCLNet (Ding et al., 2018), and G-SCNN (Takikawa et al., 2019) use gates for feature fusion to avoid noise and feature redundancy. AlignSeg (Huang et al., 2021) proposes to refine the multiscale features via bottom-up design. IFA (Hanzhe et al., 2022) proposes an implicit feature alignment function to refine the multiscale feature representation. In contrast, our method transmits semantic information top-down, focusing on real-time application. However, only some of these works can perform inference in real-time, which makes it hard to employ in practical applications.

Vision Transformer based Semantic Segmentation Recently, transformer-based approaches (Dosovitskiy et al., 2021; Liu et al., 2021; Zheng et al., 2021; Yuan et al., 2022) replace the CNN backbones with vision transformers and achieve more robust results. Several works (Zheng et al., 2021; Liu et al., 2021; Xie et al., 2021; Strudel et al., 2021) show that the vision transformer backbone leads to better results on long-tailed datasets due to the better feature representation and stronger pre-training on ImageNet classification. SETR (Zheng et al., 2021) replaces the pixel level modeling with token-based modeling, while Segformer (Xie et al., 2021) proposes a new efficient backbone for segmentation. Moreover, several works (Wang et al., 2021; Cheng et al., 2021; Zhang et al., 2021) adopt Detection Transformer (DETR) (Carion et al., 2020) to treat per-pixel prediction as a per-mask prediction. In particular, Maskformer (Cheng et al., 2021) treats the pixel-level dense prediction as a set prediction problem. However, all of these works still can not perform inference in real-time due to the huge computation cost.

Fast Semantic Segmentation Fast (Real-time) semantic segmentation algorithms attract attention when demanding practical applications that need fast inference and response. Several works are designed for this setting. ICNet (Zhao et al., 2018) uses multiscale images as input and a cascade network to be more efficient. DFANet (Li et al., 2019) utilizes a light-weight backbone to speed up its network and proposes a cross-level feature aggregation to boost accuracy, while SwiftNet (Orsic et al., 2019) uses lateral connections as the cost-effective solution to restore the prediction resolution while maintaining the speed. ICNet (Zhao et al., 2018) reduces the high-resolution features into different scales to speed up the inference time. ESPNets (Mehta et al., 2018, 2019) save computation by decomposing standard convolution into point-wise convolution and spatial pyramid of atrous convolutions. BiSeNets (Changqian et al., 2018, 2021) introduce spatial path and semantic path to reduce computation. Recently, several methods (Nekrasov et al., 2019; Zhang et al., 2019; Li et al., 2019) use AutoML techniques to search efficient architectures for scene parsing. Moreover, there are several works (Fan et al., 2021; Si et al., 2019) using multi-branch architecture to improve the real-time segmentation results. However, these works result in poor segmentation results compared with those general methods on multiple benchmarks such as Cityscapes (Cordts et al., 2016) and Mapillary (Neuhold et al., 2017). Our previous work SFNet (Li et al., 2020) achieves high accuracy via learning semantic flow between multiscale features while running in real-time. However, its inference speed is still limited since more features are involved. Moreover, the capacity of multiscale features needs to be better explored via stronger data augmentation and pre-training. Thus, simultaneous achievement of high speed and high accuracy is still challenging and of great importance for real-time application purposes.

Panoptic Segmentation Earlier works (Kirillov et al., 2019; Li et al., 2019; Chen et al., 2020; Porzi et al., 2019; Yang et al., 2020) are proposed to model both stuff segmentation and thing segmentation in one model with different task heads. Detection-based methods (Xiong et al., 2019; Kirillov et al., 2019; Qiao et al., 2021; Hou et al., 2020) usually represent things with the box prediction, while several bottom-up models (Cheng et al., 2020; Wang et al., 2020) perform grouping instance via pixel-level affinity or center heat maps from semantic segmentation results. The former introduces the complex process, while the latter suffers from performance drops in complex scenarios. Recently, several works (Wang et al., 2021; Zhang et al., 2021; Cheng et al., 2021) propose directly obtaining segmentation masks without box supervision. However, all of these works ignore the speed issue. In the experiment, we further show that our method can also lead to better panoptic segmentation results.

Lightweight Architecture Design Another critical research direction is to design more efficient backbones for the downstream tasks via various approaches (Howard et al., 2017; Sandler et al., 2018; Ma et al., 2018; Fan et al., 2021). These methods focus on efficient representation learning with various network search approaches. Our work is orthogonal to those works, since we aim to design a lightweight and aligned segmentation head.

Multi-dataset Segmentation MSeg (Lambert et al., 2020) firstly proposes to merge most existing datasets in one unified taxonomy and train a unified segmentation model for variant scenes. Meanwhile, several following works (Zhou et al., 2022; Li et al., 2022) explore multi-dataset segmentation or detection. Compared with MSeg, our UDS dataset mainly focuses on the driving scene and has only 19 classes compared with more than 100 classes in MSeg. The input images are high-resolution and are used for auto-driving applications.

Domain Generalization in Segmentation The goal domain generalization (DG) (Wang et al., 2022) methods assume that the model cannot access the target domain during training and aim to improve the generalization ability to perform well in an unseen target domain. DG is slightly different from multi-data segmentation. As for segmentation, several works (Pan et al., 2018; Yue et al., 2019; Kim et al., 2022; Choi et al., 2021) adopt synthetic data such as GTAV for training and real dataset such as cityscapes for testing. Recently, RobustNet (Choi et al., 2021) disentangles the domain-specific style and domain-invariant content encoded in higher-order statistics. Our method can also be applied in DG segmentation settings by combing RobustNet (Choi et al., 2021), where we also find significant improvements over various baselines.

3 Method

In this section, we will first provide some preliminary knowledge about real-time semantic segmentation and introduce the misalignment problem therein. Then, we propose the Flow Alignment Module (FAM) to resolve the misalignment issue by learning Semantic Flow and warping top-layer feature maps accordingly. We also present the design of SFNet. Next, we introduce the proposed SFNet-Lite and the improved GD-FAM to speed up SFNet. Finally, we describe the building process of our UDS dataset and several improvement details for SFNet-Lite training.

3.1 Preliminary

The task of scene parsing is to map a RGB image ${\textbf{X}}\in {\mathbb {R}}^{H\times W \times 3}$ to a semantic map ${\textbf{Y}}\in {\mathbb {R}}^{H\times W \times C}$ with the same spatial resolution $H\times W$, where C is the number of predefined semantic categories. Following the setting of FPN (Lin et al., 2017), the input image ${\textbf{X}}$ is firstly mapped to a set of feature maps $\{{\textbf{F}}_l\}_{l=2,\ldots ,5}$ from each network stage, where ${\textbf{F}}_l \in {\mathbb {R}}^{H_l \times W_l \times C_l}$ is a $C_l$-dimensional feature map defined on a spatial grid $\varOmega _l$ with size of $H_l \times W_l, H_l = \frac{H}{2^l}, W_l = \frac{W}{2^l}$. The coarsest feature map ${\textbf{F}}_5$ comes from the deepest layer with the strongest semantics. FCN-32 s directly predicts upon ${\textbf{F}}_5$ and achieves over-smoothed results without fine details. However, some improvements can be achieved by fusing predictions from lower levels (Long et al., 2015). FPN takes a step further to gradually fuse high-level feature maps with low-level feature maps in a top-down pathway through $2\times $ bilinear upsampling, which is originally proposed for object detection (Lin et al., 2017) and recently introduced for scene parsing (Xiao et al., 2018; Kirillov et al., 2019). The whole FPN framework highly relies on upsampling operator to upsample the spatially smaller but semantically stronger feature map to be larger in spatial size. However, the bilinear upsampling recovers the resolution of downsampled feature maps by interpolating a set of uniformly sampled positions (i.e., it can only handle one kind of fixed and predefined misalignment), while the misalignment between feature maps caused by residual connection, repeated downsampling and upsampling operations, is far more complex. Therefore, position correspondence between feature maps needs to be explicitly and dynamically established to resolve their actual misalignment.

3.2 Original Flow Alignment Module and SFNet

Design Motivation For more flexible and dynamic alignment, we thoroughly investigate the idea of optical flow, which is very effective and flexible for aligning two adjacent video frame features in the video processing task (Brox et al., 2004; Zhu et al., 2017). The idea of optical flow motivates us to design a flow-based alignment module (FAM) to align feature maps of two adjacent levels by predicting a flow field inside the network. We define such flow field as Semantic Flow, which is generated between different levels in a feature pyramid.

Module Details FAM is constructed using the FPN framework, which involves compressing the feature map of each level into the same channel depth using two 1$\times $1 convolution layers before passing it on to the next level. Given two adjacent feature maps ${\textbf{F}}_{l}$ and ${\textbf{F}}_{l-1}$ with the same channel number, we up-sample ${\textbf{F}}_{l}$ to the same size as ${\textbf{F}}_{l-1}$ via a bi-linear interpolation layer. Then, we concatenate them together and take the concatenated feature map as input for a subnetwork that contains two convolutional layers with the kernel size of $3\times 3$. The output of the subnetwork is the prediction of the semantic flow field $\varDelta _{l-1} \in {\mathbb {R}}^{H_{l-1} \times W_{l-1} \times 2}$. Mathematically, the aforementioned steps can be written as:

$$\begin{aligned} \varDelta _{l-1} = \text {conv}_l(\text {cat}({\textbf{F}}_{l}, {\textbf{F}}_{l-1})), \end{aligned}$$

(1)

where $\text {cat}(\cdot )$ represents the concatenation operation and $\text {conv}_l(\cdot )$ is the $3\times 3$ convolutional layer. Since our network adopts the strided convolutions, which could lead to very low resolution, for most cases, the respective field of the 3$ \times $3 convolution $\text {conv}_l$ is sufficient to cover most large objects in that feature map. Note that, we discard the correlation layer proposed in FlowNet-C (Dosovitskiy et al., 2015), where positional correspondence is calculated explicitly. Because there exists a huge semantic gap between higher-level layer and lower-level layer, explicit correspondence calculation on such features is difficult and tends to fail for offset prediction. Furthermore, including a correlation layer to address this issue would increase the computational cost substantially, which contradicts our objective of developing a fast and accurate network.

After having computed $\varDelta _{l-1}$, each position $p_{l-1}$ on the spatial grid $\varOmega _{l-1}$ is then mapped to a point $p_{l}$ on the upper level l via a simple addition operation. Since there exists a resolution gap between features and flow field as shown in Fig. 4b, the warped grid and its offset should be halved as Eq. 2,

$$\begin{aligned} p_{l} = \frac{p_{l-1}+\varDelta _{l-1}(p_{l-1})}{2}. \end{aligned}$$

(2)

We then use the differentiable bi-linear sampling mechanism proposed in the spatial transformer networks (Jaderberg et al., 2015), which linearly interpolates the values of the 4-neighbors (top-left, top-right, bottom-left, and bottom-right) of $p_{l}$ to approximate the final output of the FAM, denoted by ${{\widetilde{{\textbf{F}}}}}_l(p_{l-1})$. Mathematically,

$$\begin{aligned} {{\widetilde{{\textbf{F}}}}}_l(p_{l-1}) = {\textbf{F}}_l(p_{l}) = \sum _{p\in {\mathcal {N}}(p_{l})} w_p{\textbf{F}}_{l}(p), \end{aligned}$$

(3)

where ${\mathcal {N}}(p_{l})$ represents neighbors of the warped points $p_l$ in ${\textbf{F}}_l$ and $w_p$ denotes the bi-linear kernel weights estimated by the distance of warped grid. This warping procedure may look similar to the convolution operation of the deformable kernels in deformable convolution network (DCN) (Dai et al., 2017). However, our method has a lot of noticeable difference from DCN. First, our predicted offset field incorporates both higher-level and lower-level features to align the positions between high-level and low-level feature maps, while the offset field of DCN moves the positions of the kernels according to the predicted location offsets in order to possess larger and more adaptive respective fields. Second, our module focuses on aligning features, while DCN works more like an attention mechanism that attends to the salient parts of the objects. More detailed comparison can be found in the experiment part.

On the whole, the proposed FAM module is light-weight and end-to-end trainable because it only contains one 3$\times $3 convolution layer and one parameter-free warping operation in total. Besides these merits, it can be plugged into networks multiple times with only a minor extra computation cost overhead. Figure 4a gives the detailed settings of the proposed module, while Fig. 4b shows the warping process. Figure 3 visualizes the feature maps of two adjacent levels, their learned semantic flow and the finally warped feature map. As shown in Fig. 3, the warped feature is more structurally neat than the normal bi-linear upsampled feature and leads to more consistent representation of objects, such as the bus and car.

Figure 4c illustrates the whole network architecture, which contains a bottom-up pathway as the encoder and a top-down pathway as the decoder. While the encoder has a backbone network offering feature representations of different levels, the decoder can be seen as a FPN equipped with several FAMs.

Encoder Part We choose standard networks pre-trained on ImageNet (Russakovsky et al., 2015) for image classification as our backbone network by removing the last fully connected layer. Specifically, our experiments use and compare the ResNet series (He et al., 2016) and DF series (Li et al., 2019). All backbones consist of 4 stages with residual blocks. To achieve both computational efficiency and larger receptive fields, we include a convolutional layer with a stride of 2 as the first layer in each stage, which downsamples the feature map. We additionally adopt the Pyramid Pooling Module (PPM) (Zhao et al., 2017) for its superior power to capture contextual information. In our setting, the output of PPM shares the same resolution as that of the last residual module. In this situation, we treat PPM and the last residual module together as the last stage for the upcoming FPN. Other modules like ASPP (Chen et al., 2017) can also be plugged into our network, which is also experimentally ablated in the experiment part.

Aligned FPN Decoder Our SFNet decoder takes feature maps from the encoder and uses the aligned feature pyramid for final scene parsing. By replacing normal bi-linear up-sampling with FAM in the top-down pathway of FPN (Lin et al., 2017), $\{{\textbf{F}}_l\}_{l=2}^4$ is refined to $\{{\widetilde{{\textbf{F}}}}_l\}_{l=2}^4$, where top-level feature maps are aligned and fused into their bottom levels via element-wise addition and l represents the range of feature pyramid level. For scene parsing, $\{{\widetilde{{\textbf{F}}}}_l\}_{l=2}^4 \cup \{{\textbf{F}}_5\}$ are up-sampled to the same resolution (i.e., 1/4 of the input image) and concatenated together for prediction. Considering there still exists misalignment during the previous step, we also replace these up-sampling operations with the proposed FAM. To be noted, we only verify the effectiveness of such design in ablation studies. Our final models for the real-time application do not contain such a replacement for better speed and accuracy trade-off.

3.3 Gated Dual Flow Alignment Module and SFNet-Lite

Motivation Original SFNet adopts a multi-stage flow-based alignment process, it leads to a slower speed than several representative networks like BiSegNet (Changqian et al., 2018; Zhao et al., 2018). Since the lightweight backbone design is not our main focus, we explore the more compact decoder with only one flow alignment module. Decreasing the number of FAM leads to inferior results (shown in experiment part, see Table 9(d)). To make up this gap, motivated by the recent success of gating design in segmentation (Takikawa et al., 2019; Li et al., 2020), we propose a new FAM variant named Gated Dual Flow Alignment Module (GD-FAM) to directly align and fuse both highest-resolution feature and lowest-resolution feature. Since there is only one aligment, which means less operators are involved, we can speed up the inference time.

Gated Dual Flow Alignment Module As FAM, GD-FAM takes two features ${\textbf{F}}_4$ and ${\textbf{F}}_1$ as inputs and directly outputs a refined high resolution feature. We up-sample ${\textbf{F}}_{4}$ to the same size as ${\textbf{F}}_{1}$ via a bi-linear interpolation layer. Then, we concatenate them together and take the concatenated feature map as input for a subnetwork $conv_{F}$ that contains two convolutional layers with the kernel size of $3\times 3$. Such network directly outputs a new flow map $\varDelta _{F} \in {\mathbb {R}}^{H_{4} \times W_{4} \times 4}$.

$$\begin{aligned} \varDelta _{F} = \text {conv}_F(\text {cat}({\textbf{F}}_{4}, {\textbf{F}}_{1})). \end{aligned}$$

(4)

We split such map $\varDelta _{F}$ into $\varDelta _{F1}$ and $\varDelta _{F4}$ to jointly align both ${\textbf{F}}_{1}$ and ${\textbf{F}}_{4}$. Moreover, we propose to a shared gate map to highlight most important area on both aligned features. Our key insight is to make full use of high level semantic feature and let the low level feature as a supplement of high level feature. In particular, we adopt another subnetwork $conv_{g}$ to that contains one convolutional layer with the kernel size of $1 \times 1$ and one Sigmoid layer to generate such gate map. To highlight the most important regions of both features, we adopt max pooling ($\textrm{Maxpool}$) and average pooling ($\textrm{Avepool}$) over both features. Then we concatenate all four maps to generate such learnable gating maps. This process is shown as following:

$$\begin{aligned} \varDelta _{G} = \text {conv}_g(\text {cat}(\textrm{Avepool}({\textbf{F}}_{4}, {\textbf{F}}_{1})). \textrm{Maxpool}({\textbf{F}}_{4}, {\textbf{F}}_{1}))),\nonumber \\ \end{aligned}$$

(5)

Then we adopt $\varDelta _{G}$ to weight the aligned high semantic features and use inversion of $\varDelta _{G}$ to weight the aligned low semantic features as fusion process. The key insights are two folds. Firstly, sharing the same gates can better highlight the most salient region. Secondly, adopting the subtracted gating supplies the missing details in low resolution feature. Such process is shown as following:

$$\begin{aligned} {\textbf{F}}_{fuse}= & {} \varDelta _{G} Wrap(\varDelta _{F1}, F1) \nonumber \\{} & {} + (1 - \varDelta _{G}) Wrap(\varDelta _{F4}, F4). \end{aligned}$$

(6)

where the Wrap process is the same as Eq. 3. Our key insight is that a better fusion of both features can lead to more fine-grained feature representation: rich semantic and high resolution feature map. The entire process is shown in Fig. 5a.

Lite Aligned Decoder The Lite Aligned Decoder is the simplified version of Aligned Decoder, which contains one GD-FAM and one PPM. As shown in Fig. 5b, the final segmentation head takes the output of ${\textbf{F}}_{fuse}$ and upsampled deep features in last stage as inputs and outputs the final segmentation map via one $1\times 1$ convolution over the combined inputs. Lite Aligned Decoder speeds up the Aligned Decoder via involving less multiscale features (only two scales). Avoiding shortcut design can also lead to faster speed when deploying the models on devices for practical usage. More results can be found in the experiment part.

Table 1 Speed comparison (FPS) on different devices for SFNet and SFNet-Lite

Sfnet: Faster and Accurate Semantic Segmentation Via Semantic Flow

Abstract

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

Object detection using YOLO: challenges, architectural successors, datasets and applications

1 Introduction

2 Related Work

3 Method

3.1 Preliminary

3.2 Original Flow Alignment Module and SFNet

3.3 Gated Dual Flow Alignment Module and SFNet-Lite

3.4 The Unified Driving Segmentation Dataset

3.5 Improvement Details and Extension

4 Experiment

4.1 Experiment Settings

4.2 Main Results

4.3 Ablation Studies

4.4 More Detailed Analysis

4.5 Extension on Efficient Panoptic Segmentation

4.6 More Analysis on SFNet and SFNet-Lite

5 Conclusion

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation