A Novel Boundary-Guided Global Feature Fusion Module for Instance Segmentation

Gao, Linchun; Wang, Shoujun; Chen, Songgui

doi:10.1007/s11063-024-11564-6

A Novel Boundary-Guided Global Feature Fusion Module for Instance Segmentation

Open access
Published: 06 March 2024

Volume 56, article number 91, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

A Novel Boundary-Guided Global Feature Fusion Module for Instance Segmentation

Download PDF

Linchun Gao^1,2,
Shoujun Wang^1,2 &
Songgui Chen³

522 Accesses
Explore all metrics

Abstract

The task of instance segmentation is widely acknowledged as being one of the most formidable challenges in the field of computer vision. Current methods have low utilization of boundary information, especially in dense scenes with occlusion and complex shapes of object instances, the boundary information may become ineffective. This results in coarse object boundary masks that fail to cover the entire object. To address this challenge, we are introducing a novel method called boundary-guided global feature fusion (BGF) which is based on the Mask R-CNN network. We designed a boundary branch that includes a Boundary Feature Extractor (BFE) module to extract object boundary features at different stages. Additionally, we constructed a binary image dataset containing instance boundaries for training the boundary branch. We also trained the boundary branch separately using a dedicated dataset before training the entire network. We then input the Mask R-CNN features and boundary features into a feature fusion module where the boundary features provide shape information needed for detection and segmentation. Finally, we use a global attention module (GAM) to further fuse features. Through extensive experiments, we demonstrate that our approach outperforms state-of-the-art instance segmentation algorithms, producing finer and more complete instance masks while also improving model capability.

Boundary-Preserving Mask R-CNN

JLInst: Boundary-Mask Joint Learning for Instance Segmentation

Supervised Edge Attention Network for Accurate Image Instance Segmentation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The task of instance segmentation presents a significant challenge within the field of computer vision, requiring not only pixel-level classification but also precise localization of different instances, even if they belong to the same category. Its applications span across several fields such as autonomous driving [1, 2], assisted medical diagnosis [3, 4], and information hiding [5, 6]. With the continuous development of deep learning technology, the accuracy of instance segmentation models has been greatly improved.

In recent years, several instance segmentation algorithms have been proposed, demonstrating excellent performance. He et al. [7] proposed Mask R-CNN, which added a mask branch on top of Faster R-CNN [8] to predict binary masks and has become the most representative framework for two-stage instance segmentation. While Mask R-CNN [7] relies mainly on pixel-level masks for instance segmentation, it tends to ignore boundary information of objects, making it difficult to handle objects with complex boundary structures such as very thin or twisted ones. Subsequently, many scholars have improved Mask R-CNN [7], and a large number of frameworks [9,10,11] have been proposed, achieving better performance. Additionally, one-stage instance segmentation methods have received significant attention due to the one-stage detectors [12, 13]. There are also some works on how to use graph neural networks for segmentation [14,15,16], which can bring some inspiring thoughts.

However, the binary instance mask predicted by existing methods cannot fully cover the object instance and often results in rough boundaries. This problem can be attributed to several factors. Firstly, the mask head uses high-level features for mask prediction, which has rich semantic information but often ignores low-level important spatial information such as boundaries. The usual solution is to fuse high-level information and low-level information [17,18,19,20], or to optimize the loss function to make the model pay more attention to boundary information [11, 21]. Secondly, existing methods that classify pixels equally treat all pixels, but boundary pixels account for less than 1% of the entire image and are difficult to classify [22]. Existing improved methods strengthen instance boundary information [9, 23,24,25], such as thickening boundaries or expanding boundary ranges. Finally, Cheng et al. [24] found that Mask IOU is insensitive to the quality of object boundary segmentation, even if the boundary pixel segmentation quality is poor, Mask IOU can still obtain high scores. Therefore, simply improving mask AP may not necessarily improve mask quality, it may focus more on the interior of the mask rather than the boundary.

Although the above methods have improved the accuracy of mask boundary prediction to some extent, there are still some limitations that need to be addressed. Firstly, existing methods optimize the mask segmentation results but do not fundamentally solve the problem of rough boundary segmentation. They still do not make full use of the object’s boundary information, resulting in insufficiently smooth mask segmentation boundaries. In addition, in dense scenes with occlusion and complex shapes of object instances, the boundary information may fail, resulting in inaccurate segmentation results. Complex-shaped objects may have many curves, angles, and details, making it difficult to accurately define the boundaries of the objects. The model may mistakenly include background areas inside the object or classify a part of the object as background, resulting in inaccuracies in the segmentation results.

To address these issues, we propose a Boundary-guided Global Feature Fusion (BGF) method for extracting boundary information at different stages and fusing it to obtain high-quality mask boundaries. Specifically, we design a module called the Boundary Feature Extracted (BFE) module that extracts boundary information at different stages to enhance the quality of predicted masks. We use a feature fusion module to merge the features extracted by the Mask R-CNN [7] network with the boundary information extracted by the BFE module for mask prediction. We also introduce a Global Attention Module (GAM) to extract representative global features for feature fusion, which greatly improves the model’s performance. Finally, we predict the masks using the fused features.

The key achievements and contributions of this research paper are highlighted below:

(1)
We introduce a boundary extraction module (BFE) within Mask R-CNN, augmenting the network with a dedicated branch to improve the accuracy of instance mask predictions through the incorporation of boundary information.
(2)
We propose a feature fusion module that includes a Global Attention Module (GAM) for fusion of features derived from the Feature Pyramid Network (FPN) and BFE. This module enhances the network’s performance in both bounding box and mask predictions.
(3)
We utilize an existing dataset to acquire instance boundaries, which serves as the training data for our boundary branch. Additionally, we introduce a novel boundary loss function, enhancing the effectiveness of our approach.

2 Related Work

2.1 Instance Segmentation

Existing methods for instance segmentation can be divided into two categories based on the number of stages: one-stage and two-stage instance segmentation methods. Two-stage instance segmentation typically involves detection followed by segmentation. Detection-based methods use detection heads [8, 26] to generate region proposals, and then predict masks. Representative network architecture is Mask R-CNN [7], which adds a mask prediction branch to the two-stage object detection method Faster R-CNN and uses RoI align instead of RoI pooling. After Mask R-CNN [7], PANet [9] adds a bottom-up path in FPN [26] and develops adaptive feature pooling, which shortens the information pathway between feature maps, and restores the information connection between candidate regions and feature layers. DetNet [27] adds dilated convolutions to the backbone structure, which ensures feature resolution while increasing the receptive field, and proposes retraining the backbone network for detection and segmentation tasks to improve feature representation. MaskLab [10] is built on top of Faster R-CNN [8] and adds directional features for segmenting instances of the same class, using direction prediction to estimate the direction of each pixel with respect to its corresponding instance center, achieving instance segmentation of instances of the same semantic class. Mask Scoring R-CNN [11] proposes that existing mask scoring strategies use classification metrics, lacking targeted evaluation mechanisms. Based on Mask R-CNN [7], it modifies the mask evaluation criteria and improves model instance segmentation performance by adding a Mask IOU branch to predict masks and score them. These methods have all improved performance to some extent. Ye et al. [28] proposed a lightweight hybrid model that combines supervised hierarchical granularity parsing tasks with unsupervised image matting tasks. The model includes a scalable hierarchical semantic segmentation module and an image matting module composed of guided filters. Through multi-stage inference and feedback mechanisms, the model achieves fine-grained segmentation and improves the segmentation results.

2.2 Boundary Segmentation

Recently, more and more researchers have focused on how to improve the quality of mask boundaries. Hayder et al. [21] proposed an algorithm named Boundary-Aware Instance Segmentation (BAIS) to enhance the instance segmentation mask boundary. The BAIS method uses a novel boundary loss function to optimize the boundary of the object segmentation. This loss function combines the internal information of the object with its edge information to encourage the model to pay more attention to the object’s boundary. Kirillov et al. [23] introduced the Point-based Rendering (PointRend) module, which uses a new upsampling method to optimize image segmentation of object edges, enabling good performance in hard-to-segment object edge parts. Some methods improve boundary quality through post-processing solutions. Liang et al. [29] use a polygon-based method (PolyTransform) to represent instance masks as polygons to better adapt to complex shape changes. Yuan et al. [30] replace rough predictions of boundary pixels with internal predictions, but this relies on accurate boundary prediction. Tang et al. [22] use the Boundary Patch Refinement (BPR) framework to improve boundary quality by using a pre-cropping and optimization approach for predicted instance masks in instance segmentation models. For the above post-processing methods, no boundary information is used in the entire mask prediction, which is not conducive to further improving segmentation performance. Chen et al. [10] propose a model called MaskLab, which can simultaneously perform instance segmentation, keypoint detection, and segmentation edge prediction. Specifically, MaskLab [10] generates instance masks, keypoint, and edge mask branches using a multi-branch convolution network, and then integrates them into a unified framework. In addition, MaskLab introduces a feature selection module based on attention mechanisms to enhance the network’s attention and response to important areas. Cheng et al. [24] proposed Boundary-preserving Mask R-CNN (BMask R-CNN), which uses boundary information to improve mask localization accuracy. By limiting the distance between the segmentation mask and the edge, BMask R-CNN preserves boundary information so that the predicted mask can align well with the instance boundary. Hu et al. [25] proposed an attention-based module to enhance the boundary information in the segmentation feature map and improve the segmentation quality. Feng et al. [31] proposed a new task called Boundary Knowledge Translation (BKT) that aims to achieve accurate segmentation of new categories with a small number of labeled samples by transforming visual boundary knowledge from labeled categories. To accomplish this, they propose a model called Translation Segmentation Network (Trans-Net), which consists of a segmentation network and two boundary discriminators. Through self-supervised mechanisms and adversarial training, Trans-Net is able to achieve segmentation results comparable to fully supervised methods with only tens of labeled samples for guidance. Cheng et al. [32] proposed a reference semantic segmentation network called Ref-Net, which aims to achieve accurate object segmentation by introducing reference object features and boundary knowledge transformation. Unlike traditional segmentation networks that rely heavily on labeled data, Ref-Net only requires a small number of finely annotated samples as guidance and achieves results comparable to fully supervised methods on multiple datasets.

3 Methods

We propose the BGF, which is based on the Mask R-CNN framework and uses ResNet [33]+FPN [26] as the feature extraction network. The overall framework is illustrated in Fig. 1. In this section, we provide an overview of the boundary branch (Sect. 3.1), and then we explain how we perform feature fusion to enhance the model’s performance (Sect. 3.2), followed by a description of our methods for learning and optimization (Sect. 3.3).

3.1 Boundary branch

Our chosen baseline network is Mask R-CNN [7], and we adopt its respective configurations. The network architecture is depicted in Fig. 1, with C1 representing a convolutional layer and S2–S5 denoting stages 2–5 of the ResNet network. Each stage consists of multiple residual blocks, and the feature map size progressively decreases from 1/4 to 1/32 of the original input. Moreover, F2 to F5 refer to the top-down feature pyramid network (FPN) [26], constructed through lateral connections. The output feature map of each layer in FPN is upsampled and element-wise added to the corresponding stage feature map in ResNet. Furthermore, F6 is obtained via max pooling from F5 and directly fed into the RPN network to generate Region of Interest (RoI). Each RoI is subsequently inputted into the classification and bounding box regression branches, alongside the newly introduced mask branch. For more comprehensive information, please consult [7, 8, 26].

Despite the effectiveness of Mask R-CNN [7] in achieving favorable instance segmentation outcomes, the challenge of imprecise mask boundaries remains unresolved. Therefore, we integrate boundary information into the network to enhance the accuracy of mask boundary prediction.

As illustrated in Fig. 2, we have developed a Boundary Feature Extraction (BFE) module for extracting instance boundaries at various stages of the ResNet network. The BFE module leverages features from four distinct stages of ResNet, consisting of multiple conv_block and identity_block. The input and output sizes of the BFE module are presented in Table 1. To detect boundaries, we modify the features of the last layer in each conv_block and identity_block of every stage using a $ 1 \times 1 $ convolution with 32 channels. Subsequently, we combine the adjusted features from different blocks by summing them pixel by pixel. Finally, a $ 1 \times 1 $ convolution with a single channel is employed to generate boundary detection results for each stage.

Table 1 Input and output size of the BFE module

Full size table

At the same time, we employ a $ 1 \times 1 $ convolution with a single channel to refine the feature maps obtained from the C1 stage. Subsequently, we upsample the feature maps from the S2 to S5 stages, which are outputs of the BFE module, in order to match the size of C1. We then combine the refined C1 feature map with the upsampled feature maps, resulting in a fused boundary feature map that covers all five stages (C1–S5). Finally, for boundary detection results, we utilize a $ 1 \times 1 $ convolution with a single channel and apply a Sigmoid activation function.

In instance segmentation tasks, the precise identification of object boundaries holds significant importance. The inclusion of boundary information enables the network to gain a better understanding of object shapes and structures, leading to improved instance segmentation accuracy. The boundary branch extracts valuable boundary information, supplying additional details regarding boundary positions and shapes. This, in turn, enhances the network’s ability to perceive boundaries effectively.

3.2 Feature Fusion

Boundary information exhibits a certain degree of locality and specificity, while semantic features tend to be more global and abstract. By integrating these two types of features through feature fusion, we can enhance the performance of segmentation. Feature fusion capitalizes on the complementary nature of high-level semantic features and low-level boundary features. This combination allows for better delineation of object boundaries and extraction of semantic information, consequently improving the accuracy of instance segmentation.

To achieve feature fusion, we propose a Global Attention Module (GAM) module. The network architecture of GAM is depicted in Fig. 3 as the purple solid box. Specifically, for the i-th layer of FPN [26], we initially employ a $ 1 \times 1 $ convolution to derive $B^{\prime }$, the boundary feature map. Subsequently, we utilize three pooling methods to downsample $ B^{\prime }$. One of these methods is Adaptive Max Pooling (AMP), which downsamples $ B^{\prime }$ into four distinct spatial sizes, such as $ \left\{ 1\times 1, 5\times 6, 9\times 12, 14\times 22\right\} $. Finally, we employ the flatten-concatenate approach to produce a feature matrix $ B^{M} \in R^{447 \times 256 } $. This process is formulated as follows:

$$\begin{aligned} B^{M}=fc({\textit{AMP}}_{j}(B^{\prime })) \end{aligned}$$

(1)

where fc denotes the flatten-concatenate method, and j denotes feature maps of different sizes using adaptive max pooling output. Global max pooling (GMP) and global average pooling (GAP) are applied to downsample $ B^{\prime }$, followed by full convolution, resulting in the creation of two $ 256\times 1-d $ attention feature maps. These maps are added pixel by pixel to obtain the channel-refined feature $ A^{C} $, which is then multiplied by $ B^{M} $ and passed through a Softmax activation function to obtain $ A_{i} $. Subsequently, we apply a transposition operation on $ A_{i}^{B} $ and perform multiplication with $ B^{M} $, resulting in the generation of a global attention feature $ B_{i}^{G} $. Further, this feature is integrated with $ F_{i} $ on the channel. Ultimately, the fused feature $ F_{i}^{\prime }$ is utilized as an input for the subsequent networks. $ F_{i}^{\prime }$ generation can be formulated as follows:

$$\begin{aligned} A^{C}= & {} {\textit{FC}}({\textit{GMP}}(B^{\prime }))+{\textit{FC}}({\textit{GAP}}(B^{\prime }))\end{aligned}$$

(2)

$$\begin{aligned} B_{i}^{G}= & {} ({\textit{Softmax}}(B^{M} \otimes A^{C}))^{T} \otimes B^{M}\end{aligned}$$

(3)

$$\begin{aligned} F_{i}^{\prime }= & {} {\textit{sum}}(F_{i},B_{i},B_{i}^{G}) \end{aligned}$$

(4)

where FC denotes the fully connected layer, $ \otimes $ denotes matrix multiplication, Softmax denotes the Softmax activation function, and sum denotes channel-wise sum method.

The feature fusion module can facilitate the network to be more effective in information transmission and contextual association. The boundary information extracted by the boundary branch and the original features interact and merge in the feature fusion module. This can increase the network’s attention to boundary information and optimize the instance segmentation results by utilizing the contextual association of boundary information. Through feature fusion, the network can better utilize global and local contexts, improving the accuracy and robustness of instance segmentation.

3.3 Learning and Optimization

Inspired by edge detection [34, 35], we view boundary prediction as a pixel-level classification problem. We fuse the learned boundary features with Mask R-CNN [7] features to provide shape information for mask prediction.

Boundary Groundtruths Popular instance segmentation datasets, such as the COCO dataset [36] and the PASCAL VOC dataset [37], do not provide annotated ground truths for boundaries. However, the boundary branch requires the annotated boundaries, which must be converted into binary boundary maps for training labels. To improve the model’s training, we used mask groundtruths to generate boundary groundtruths that only show the boundaries of each instance. As shown in Fig. 4, we extracted instance masks from the annotated images and computed their corresponding boundary information by calculating the edge information of the masks. During training, we fed the images and their corresponding binary boundary images into the model, calculated the loss function, and updated the parameters. By iterative training, the Boundary Branch can learn more precise and robust boundary feature representations, thereby improving the model’s overall performance.

Boundary Loss To enhance the training of the boundary branch, we have integrated the application of both boundary loss [38] and dice loss [39]. We define boundary loss as $ L_{{\textit{bou}}}$ and dice loss as $ L_{{\textit{dice}}}$, the $ L_{{\textit{bou}}} $ and $ L_{{\textit{dice}}} $ are described as follows:

$$\begin{aligned} L_{{\textit{bou}}}&= -\frac{1}{N} \sum _{i=1}^{N} \omega _{i}[y_{i}(1-\hat{y_{i}})+(1-y_{i})\hat{y_{i}}],\end{aligned}$$

(5)

$$\begin{aligned} L_{{\textit{dice}}}&= 1 - \frac{2\sum _{i=1}^{N} (\hat{y_{i}} \cdot y_{i}) +\varepsilon }{\sum _{i=1}^{N} \hat{y_{i}}^{2}+\sum _{i=1}^{N} y_{i}^{2}+\varepsilon }, \end{aligned}$$

(6)

where N represents the number of pixels in the image, y and $ {\hat{y}} $ represents the model prediction and ground truth label, respectively. $ \omega _{i} $ is a weighting factor to balance the effect of boundary pixels and non-boundary pixels, and $ \varepsilon $ is a constant used to avoid division by zero, and here we set it to 1.

Our boundary branch loss function $ L_{b} $ is defined as:

$$\begin{aligned} L_{b} = L_{{\textit{bou}}} + \alpha L_{{\textit{dice}}}, \end{aligned}$$

(7)

where $ \alpha $ is a hyperparameter, and here we set $ \alpha = 1 $.

4 Experiments

4.1 Dataset and Evaluation Metric

To demonstrate the efficacy of BGF, we performed comprehensive experimentation on the COCO [36] and Cityscapes [40] datasets. The COCO dataset consists of a training set (train2017), a validation set (val2017), and a test set (test-dev2017), which respectively contain 118,287, 5000, and 40,670 images. We trained our model on the train2017 set, compared it with other advanced methods on the test-dev2017 set, and conducted ablation experiments on the val2017 set. The Cityscapes dataset contains 2975 training images, 500 validation images, and 1525 testing images. In terms of instance segmentation, Cityscapes contains eight object categories and provides more accurate instance segmentation annotations than COCO. The model was trained on the designated training set and subsequently evaluated against state-of-the-art techniques using the test set. We employed AP, ${\textit{AP}}_{50} $ and ${\textit{AP}}_{75} $ as evaluation metrics for the COCO dataset, using varying IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. As for the Cityscapes dataset, our evaluation was limited to AP and ${\textit{AP}}_{50} $.

4.2 Implementation Details

Our proposed method is built upon Mask R-CNN [7], which serves as the baseline. We utilize ResNet-FPN as the backbone network and initialize its parameters with pre-trained weights from ImageNet [41] while initializing other layer parameters with random initialization. To enhance boundary information learning, we employ an additional PASCAL VOC Context dataset [42] for separate training of the boundary branch. A smaller learning rate and stochastic gradient descent algorithm are used for boundary branch training to prevent excessive adjustment of pre-trained network parameters and maintain the original feature extraction function. As per our regular procedure, we provide training for all models utilizing 4 NVIDIA GPUs.

Our approach to training on the Microsoft COCO dataset [36] involved utilizing a batch size of 16 and a maximum of 60,000 iteration steps. To optimize performance, we set the initial learning rate to 0.02 and applied a reduction factor of 0.1 and 0.01 at the 30,000th and 50,000th iterations, respectively. Additionally, we set the weight decay to 0.0001. For the Cityscapes dataset, we adopted a batch size of 4 and a maximum of 30,000 iteration steps, with the initial learning rate set to 0.004. The learning rate was then reduced to 0.0004 at the 20,000th iteration while retaining the same hyperparameters as in the Mask R-CNN [7] experiments.

4.3 Experiments on COCO Dataset

4.3.1 Ablation Study

In this part, we demonstrate the effectiveness of our proposed module by evaluating it on COCO val2017, using metrics such as ${\textit{AP}}$, ${\textit{AP}}_{50} $, and ${\textit{AP}}_{75} $, and the backbone network is ResNet-50-FPN, as presented in Table 2.

Table 2 Ablation study on COCO val2017 datasets of adding components to Mask R-CNN. PVC represents training the boundary branch separately using the PASCAL VOC Context dataset

Full size table

The symbol “” in Table 2 indicates that the corresponding module is added to Mask R-CNN [7], while “–” means the opposite. The first row represents the baseline Mask R-CNN [7] model without our module. By comparing the results of the first two rows, we observe that adding the BFE module alone does not significantly affect the evaluation metrics. This suggests that the BFE module alone cannot enhance the model’s performance. The reason behind this is that the information extracted by our boundary branch needs to be integrated with the information in FPN to achieve the desired impact. Therefore, adding the BFE module alone to the model does not influence its performance.

Further comparison of the results in the row I and row III shows a slight improvement in evaluation metrics when adding both the BFE and Fusion modules to Mask R-CNN. Specifically, ${\textit{AP}}$, ${\textit{AP}}_{50}$, and ${\textit{AP}}_{75} $ increased from 34.5%, 55.8%, and 36.7% to 35.3%, 57.1%, and 37.6%, respectively. However, directly fusing the features extracted by the BFE module with those outputted by the FPN [26] network has a limited impact on performance because of the high degree of coupling between them. To overcome this issue, we introduced the GAM module into the Fusion module. The row V of Table 2 clearly shows that the inclusion of the GAM module results in a significant improvement in network performance. Compared to the row III, ${\textit{AP}}$, ${\textit{AP}}_{50} $, and ${\textit{AP}}_{75} $ increase from 35.3%, 57.1%, and 37.6% to 37.6%, 58.7%, and 39.6%, respectively. This indicates that our GAM module, which extracts multi-scale global attention features, can more effectively fuse with the FPN network’s output features, resulting in a significant performance boost.

Table 3 Comparison with state-of-the-art methods on COCO test-dev2017

Full size table

We also analyzed whether training the boundary branch independently using additional datasets could enhance model performance. Rows III and VI in Table 2 both use the PASCAL VOC Context dataset [37] to train the boundary branch independently. Compared to the results in row II, there is a certain improvement in each metric in row III. The most significant improvement occurs in row VI, where we add all our modules to Mask R-CNN and use the PASCAL VOC Context dataset [37] to train the boundary branch separately, achieving optimal performance in terms of ${\textit{AP}}$, ${\textit{AP}}_{50}$, and ${\textit{AP}}_{75}$ at 38.7, 59.8, and 40.9%, respectively.

4.3.2 Comparative Experiments

To validate the superiority of our proposed method, we used ResNet-50-FPN and ResNet-101-FPN as feature extraction networks. We trained on COCO train2017 and tested our BGF on COCO test-dev2017. We compared our method with other advanced instance segmentation methods such as Mask R-CNN [7], Mask Scoring R-CNN [11], BMask R-CNN [24], PointRend [23], Mask R-CNN with BEM [25], and MaskLab [10]. As shown in Table 3, when the backbone network is ResNet-50-FPN and the boundary branch is trained with an additional PASCAL VOC Context dataset [37], our model achieves optimal performance, with ${\textit{AP}}$, ${\textit{AP}}_{50}$, and ${\textit{AP}}_{75} $ of 40.8, 61.2, and 43.4%, respectively.

Table 4 Comparison with state-of-the-art methods on cityscapes val(AP column) and test(Remaining columns) datasets

Full size table

We also visualized the prediction results of our BGF and Mask R-CNN [7]. As shown in Fig. 5, the left side shows the visualization of the prediction results of the two methods. To more intuitively compare the effects of the two methods, we binarized the prediction results (shown on the right side of Fig. 5). The yellow dashed box in the figure shows that the boundary predicted by our BGF is more accurate, and can better align with the target instance. Through precise boundary segmentation, the model can better distinguish the boundaries between different objects, thereby identifying them more accurately.

We also visualized the feature maps of different stages in ResNet101. As shown in Fig. 6f, after the fusion of boundary information into the F2 layer of FPN, the output feature map of the F2 layer contains more instance boundary information, thereby improving the accuracy of mask boundary prediction.

4.4 Results on Cityscapes Dataset

In addition, we present the results of instance segmentation for 8 categories contained within the Cityscapes dataset [40]. We only used fine annotations data to train our model, without using 20k coarse training images. All images were sized at $2048 \times 1024$. During training, to reduce overfitting, we randomly scaled the short edge of the image to between [800, 1024].

The Cityscapes dataset [40] contains a large number of category-specific occluded instances in the person and car categories, which is the difficulty in instance segmentation. Compared to Mask R-CNN [7], our BGF achieved significant improvements in these two categories, from 30.5 to 38.5% in humans and from 46.9 to 55.7% in cars, as shown in detail in Table 4.

As shown in Fig. 7, when there are overlaps in the person and car categories, our method can obtain more accurate instance boundaries, as evidenced by the red box in Fig. 7. Our method also performs better than Mask R-CNN [7] in cases where the target is small or blurry, as shown in the blue box in Fig. 7.

Table 5 Real-time setting comparison of speed and accuracy with other methods on COCO test-dev

Full size table

4.5 Discussions

In this study, we introduced a boundary branch to extract boundary information based on Mask R-CNN, in order to determine the influence of boundary information on the prediction accuracy and quality of masks. Previous research on improving mask boundary quality mainly focused on optimizing boundaries during the prediction process [10, 24] and improving boundary quality using post-processing techniques [23, 29, 30]. These works have improved the prediction accuracy of masks to some extent. However, the prediction accuracy and quality of masks are not necessarily positively correlated, as masks with high prediction accuracy may also have rough mask boundaries. Unlike previous methods, we first extract boundary information and then integrate it into the network for subsequent detection and segmentation. A large number of quantitative and qualitative results demonstrate that our proposed boundary branch can not only improve mask prediction accuracy (with a 5.1 increase in mask AP compared to Mask R-CNN), but also obtain high-quality accurate masks.

We summarize the possible reasons as follows. Firstly, the boundary branch can effectively improve boundary localization and capture edge details, learning the features of target boundaries, and obtaining masks with better continuity and smoothness while improving segmentation accuracy. Secondly, our proposed GAM module can further improve most metrics, especially ${\textit{AP}}_{75} $, indicating that our GAM can help generate more accurate instance masks. It is worth noting that when GAM is not used, the performance improvement is minimal by simply fusing through element summation, while GAM can solve this problem. In addition, if there are multiple objects in an image and the instances overlap, the network may fail to locate and segment each object, as shown in Fig. 8b. However, our method can partially address the instance segmentation problem in dense scenes by providing more accurate boundary information, matching the boundaries of instances with their internal regions more accurately, as shown in Fig. 8c.

To discuss the limitations of our proposed instance segmentation model, we compared the inference speed with some other methods in Table 5. We reported all speeds computed on an RTX 3090, so some speeds listed may be faster than those originally stated in the papers. Compared to Mask R-CNN, our BGF has shown an improvement in inference speed (approximately 20 ms). One possible reason is that the ROI pooling operation used in the boundary branch effectively maps regions of interest of different sizes onto a fixed-size feature map. This reduces the need for multiple sampling operations on the feature map, resulting in an increased processing speed. Despite the higher accuracy and quality of our mask predictions, the inference speed differs significantly from the real-time algorithm YOLACAT [44] (approximately 60 ms), making it challenging for real-time applications. Additionally, We present some failure cases in Fig. 9, we have observed that our method may encounter challenges in accurately segmenting when strong contours exist within the object, as demonstrated in the first to third columns of Fig. 9 (e.g., long fur on the bear’s butt, long fur on the boundary of the cat in the image, and the situation around the zebra’s ears). Finally, the detection and segmentation of images without obvious targets, such as camouflage objects, is a challenging task for our proposed network. We tested our proposed BGF model on example images in Fig. 10, but no targets were detected. Although the detection and segmentation of camouflage objects are beyond the scope of this paper, it still falls under the category of instance segmentation.

In future research, we will explore ways to optimize edge branching and integrate it into lightweight networks to further improve the performance and application scope of the models.

5 Conclusion

This article proposes a boundary-guided global attention feature fusion instance segmentation method based on the Mask R-CNN network, which can obtain more accurate mask boundaries and higher detection accuracy. Specifically, we introduce a boundary branch and use the Boundary Feature Extracted (BFE) module to extract the boundary information of instances. In addition, to better perform feature fusion, we propose a Global Attention Module (GAM), which generates multi-scale global attention features and fuses them with baseline method features. Experimental results on two popular datasets show that our proposed method outperforms other instance segmentation methods in mask prediction accuracy. Our future work is to further optimize boundary branch and incorporate it into lightweight networks to further improve network real-time performance and application scope.

References

Wang H, Xu Y, He Y, Cai Y, Chen L, Li Y, Sotelo MA, Li Z (2022) Yolov5-fog: a multiobjective visual detection algorithm for fog driving scenes based on improved yolov5. IEEE Trans Instrum Meas 71:1–12. https://doi.org/10.1109/TIM.2022.3196954
Article Google Scholar
De Brabandere B, Neven D, Van Gool L (2017) Semantic instance segmentation for autonomous driving. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW). IEEE, Honolulu, pp 478–480. https://doi.org/10.1109/CVPRW.2017.66
Hollandi R, Moshkov N, Paavolainen L, Tasnadi E, Piccinini F, Horvath P (2022) Nucleus segmentation: towards automated solutions. Trends Cell Biol 32(4):295–310. https://doi.org/10.1016/j.tcb.2021.12.004
Article Google Scholar
Lin A, Chen B, Xu J, Zhang Z, Lu G (2022) DS-TransUNet: dual swin transformer U-Net for medical image segmentation. arXiv
Meng R, Cui Q, Zhoul Z, Yuan C, Sun X (2020) A novel steganography algorithm based on instance segmentation. Comput Mater Continua 63(1)
Pan W, Yin Y, Wang X, Jing Y, Song M (2021) Seek-and-hide: adversarial steganography via deep reinforcement learning. IEEE Trans Pattern Anal Mach Intell 44(11):7871–7884
Article Google Scholar
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768
Chen L, Hermans A, Papandreou G, Schroff F, Wang P, Adam H (2018) Masklab: instance segmentation by refining object detection with semantic and direction features. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. IEEE, Salt Lake City, pp 4013–4022. https://doi.org/10.1109/CVPR.2018.00422
Huang Z, Huang L, Gong Y, Huang C, Wang X (2019) Mask scoring R-CNN. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6409–6418
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9627–9636
Jing Y, Yang Y, Wang X, Song M, Tao D (2021) Amalgamating knowledge from heterogeneous graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15709–15718
Jing Y, Yang Y, Wang X, Song M, Tao D (2021) Meta-aggregator: learning to aggregate for 1-bit graph neural networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5301–5310
Jing Y, Mao Y, Yang Y, Zhan Y, Song M, Wang X, Tao D (2022) Learning graph neural networks for image style transfer. In: European conference on computer vision. Springer, Berlin, pp 111–128
Ghiasi G, Lin T-Y, Le QV (2019) NAS-FPN: learning scalable feature pyramid architecture for object detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Long Beach, pp 7029–7038. https://doi.org/10.1109/CVPR.2019.00720
Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra R-CNN: towards balanced learning for object detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Long Beach, pp 821–830. https://doi.org/10.1109/CVPR.2019.00091
Cai Z, Vasconcelos N (2019) Cascade R-CNN: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell 43(5):1483–1498
Article Google Scholar
Chen K, Ouyang W, Loy CC, Lin D, Pang J, Wang J, Xiong Y, Li X, Sun S, Feng W, Liu Z, Shi J (2019) Hybrid task cascade for instance segmentation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Long Beach, pp 4969–4978. https://doi.org/10.1109/CVPR.2019.00511
Hayder Z, He X, Salzmann M (2017) Boundary-aware instance segmentation. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, Honolulu, pp 587–595. https://doi.org/10.1109/CVPR.2017.70
Tang C, Chen H, Li X, Li J, Zhang Z, Hu X (2021) Look closer to segment better: boundary patch refinement for instance segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Nashville, pp 13921–13930. https://doi.org/10.1109/CVPR46437.2021.01371
Kirillov A, Wu Y, He K, Girshick R (2020) Pointrend: image segmentation as rendering. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Seattle, pp 9796–9805. https://doi.org/10.1109/CVPR42600.2020.00982
Cheng T, Wang X, Huang L, Liu W (2020) Boundary-preserving mask R-CNN. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XIV 16. Springer, Berlin, pp 660–676
Hu Y, Zhang C, Zhou H, Qian Z, Zhao W (2022) Boundary-area enhanced module for instance segmentation. In: 2022 IEEE international conference on image processing (ICIP). IEEE, Bordeaux, pp 1691–1695. https://doi.org/10.1109/ICIP46576.2022.9897869
Lin T, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2018) Detnet: a backbone network for object detection. arXiv preprint arXiv:1804.06215
Ye J, Jing Y, Wang X, Ou K, Tao D, Song M (2019) Edge-sensitive human cutout with hierarchical granularity and loopy matting guidance. IEEE Trans Image Process 29:1177–1191
Article MathSciNet Google Scholar
Liang J, Homayounfar N, Ma W-C, Xiong Y, Hu R, Urtasun R (2020) Polytransform: deep polygon transformer for instance segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Seattle, pp 9128–9137. https://doi.org/10.1109/CVPR42600.2020.00915
Yuan Y, Xie J, Chen X, Wang J (2020) Segfix: model-agnostic boundary refinement for segmentation. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, Part XII 16. Springer, Berlin, pp 489–506
Feng Z, Cheng L, Wang X, Wang X, Liu YJ, Du X, Song M (2021) Visual boundary knowledge translation for foreground segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 1334–1342
Cheng L, Feng Z, Wang X, Liu YJ, Lei J, Song M (2021) Boundary knowledge translation based reference semantic segmentation. arXiv preprint arXiv:2108.01075
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Xie S, Tu Z (2015) Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision, pp 1395–1403
Su Z, Liu W, Yu Z, Hu D, Liao Q, Tian Q, Pietikäinen M, Liu L (2021) Pixel difference networks for efficient edge detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5117–5127
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, Part V 13. Springer, Berlin, pp 740–755
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Kervadec H, Bouchtiba J, Desrosiers C, Granger E, Dolz J, Ayed IB (2021) Boundary loss for highly unbalanced segmentation. Med Image Anal 67:101851. https://doi.org/10.1016/j.media.2020.101851
Article Google Scholar
Milletari F, Navab N, Ahmadi S-A (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth international conference on 3D vision (3DV). IEEE, pp 565–571
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, Las Vegas, pp 3213–3223. https://doi.org/10.1109/CVPR.2016.350
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Mottaghi R, Chen X, Liu X, Cho N-G, Lee S-W, Fidler S, Urtasun R, Yuille A (2014) The role of context for object detection and semantic segmentation in the wild. In: 2014 IEEE conference on computer vision and pattern recognition, pp 891–898. https://doi.org/10.1109/CVPR.2014.119
Bolya D, Zhou C, Xiao F, Lee YJ (2019) Yolact: real-time instance segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9157–9166
Bolya D, Zhou C, Xiao F, Lee YJ (2022) Yolact++: Better real-time instance segmentation. IEEE Trans Pattern Anal Mach Intell 44(2):1108–1121. https://doi.org/10.1109/TPAMI.2020.3014297
Article Google Scholar

Download references

Acknowledgements

This work was supported by China National Key R &D Program (2022YFE0104500), the National Natural Science Foundation of China (52001149, 52039005), and the Research Funds for the Central Universities(TKS20210102, TKS20220301, TKS20230205).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Tianjin University of Technology, Binshui West Road, Tianjin, 300384, China
Linchun Gao & Shoujun Wang
Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, Binshui West Road, Tianjin, 300384, China
Linchun Gao & Shoujun Wang
Tianjin Research Institute for Water Transport Engineering, M.O.T., Xingang Second Road, Tianjin, 300457, China
Songgui Chen

Authors

Linchun Gao
View author publications
You can also search for this author in PubMed Google Scholar
Shoujun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Songgui Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Linchun Gao performed the methodology, writing—original draft, visualization and formal analysis; Shoujun Wang performed the supervision; Songgui Chen performed the writing—review & editing and funding acquisition.

Corresponding author

Correspondence to Songgui Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest and competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, L., Wang, S. & Chen, S. A Novel Boundary-Guided Global Feature Fusion Module for Instance Segmentation. Neural Process Lett 56, 91 (2024). https://doi.org/10.1007/s11063-024-11564-6

Download citation

Accepted: 11 February 2024
Published: 06 March 2024
DOI: https://doi.org/10.1007/s11063-024-11564-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Novel Boundary-Guided Global Feature Fusion Module for Instance Segmentation

Abstract