1 Introduction

The task of instance segmentation presents a significant challenge within the field of computer vision, requiring not only pixel-level classification but also precise localization of different instances, even if they belong to the same category. Its applications span across several fields such as autonomous driving [1, 2], assisted medical diagnosis [3, 4], and information hiding [5, 6]. With the continuous development of deep learning technology, the accuracy of instance segmentation models has been greatly improved.

In recent years, several instance segmentation algorithms have been proposed, demonstrating excellent performance. He et al. [7] proposed Mask R-CNN, which added a mask branch on top of Faster R-CNN [8] to predict binary masks and has become the most representative framework for two-stage instance segmentation. While Mask R-CNN [7] relies mainly on pixel-level masks for instance segmentation, it tends to ignore boundary information of objects, making it difficult to handle objects with complex boundary structures such as very thin or twisted ones. Subsequently, many scholars have improved Mask R-CNN [7], and a large number of frameworks [9,10,11] have been proposed, achieving better performance. Additionally, one-stage instance segmentation methods have received significant attention due to the one-stage detectors [12, 13]. There are also some works on how to use graph neural networks for segmentation [14,15,16], which can bring some inspiring thoughts.

However, the binary instance mask predicted by existing methods cannot fully cover the object instance and often results in rough boundaries. This problem can be attributed to several factors. Firstly, the mask head uses high-level features for mask prediction, which has rich semantic information but often ignores low-level important spatial information such as boundaries. The usual solution is to fuse high-level information and low-level information [17,18,19,20], or to optimize the loss function to make the model pay more attention to boundary information [11, 21]. Secondly, existing methods that classify pixels equally treat all pixels, but boundary pixels account for less than 1% of the entire image and are difficult to classify [22]. Existing improved methods strengthen instance boundary information [9, 23,24,25], such as thickening boundaries or expanding boundary ranges. Finally, Cheng et al. [24] found that Mask IOU is insensitive to the quality of object boundary segmentation, even if the boundary pixel segmentation quality is poor, Mask IOU can still obtain high scores. Therefore, simply improving mask AP may not necessarily improve mask quality, it may focus more on the interior of the mask rather than the boundary.

Although the above methods have improved the accuracy of mask boundary prediction to some extent, there are still some limitations that need to be addressed. Firstly, existing methods optimize the mask segmentation results but do not fundamentally solve the problem of rough boundary segmentation. They still do not make full use of the object’s boundary information, resulting in insufficiently smooth mask segmentation boundaries. In addition, in dense scenes with occlusion and complex shapes of object instances, the boundary information may fail, resulting in inaccurate segmentation results. Complex-shaped objects may have many curves, angles, and details, making it difficult to accurately define the boundaries of the objects. The model may mistakenly include background areas inside the object or classify a part of the object as background, resulting in inaccuracies in the segmentation results.

To address these issues, we propose a Boundary-guided Global Feature Fusion (BGF) method for extracting boundary information at different stages and fusing it to obtain high-quality mask boundaries. Specifically, we design a module called the Boundary Feature Extracted (BFE) module that extracts boundary information at different stages to enhance the quality of predicted masks. We use a feature fusion module to merge the features extracted by the Mask R-CNN [7] network with the boundary information extracted by the BFE module for mask prediction. We also introduce a Global Attention Module (GAM) to extract representative global features for feature fusion, which greatly improves the model’s performance. Finally, we predict the masks using the fused features.

The key achievements and contributions of this research paper are highlighted below:

  1. (1)

    We introduce a boundary extraction module (BFE) within Mask R-CNN, augmenting the network with a dedicated branch to improve the accuracy of instance mask predictions through the incorporation of boundary information.

  2. (2)

    We propose a feature fusion module that includes a Global Attention Module (GAM) for fusion of features derived from the Feature Pyramid Network (FPN) and BFE. This module enhances the network’s performance in both bounding box and mask predictions.

  3. (3)

    We utilize an existing dataset to acquire instance boundaries, which serves as the training data for our boundary branch. Additionally, we introduce a novel boundary loss function, enhancing the effectiveness of our approach.

2 Related Work

2.1 Instance Segmentation

Existing methods for instance segmentation can be divided into two categories based on the number of stages: one-stage and two-stage instance segmentation methods. Two-stage instance segmentation typically involves detection followed by segmentation. Detection-based methods use detection heads [8, 26] to generate region proposals, and then predict masks. Representative network architecture is Mask R-CNN [7], which adds a mask prediction branch to the two-stage object detection method Faster R-CNN and uses RoI align instead of RoI pooling. After Mask R-CNN [7], PANet [9] adds a bottom-up path in FPN [26] and develops adaptive feature pooling, which shortens the information pathway between feature maps, and restores the information connection between candidate regions and feature layers. DetNet [27] adds dilated convolutions to the backbone structure, which ensures feature resolution while increasing the receptive field, and proposes retraining the backbone network for detection and segmentation tasks to improve feature representation. MaskLab [10] is built on top of Faster R-CNN [8] and adds directional features for segmenting instances of the same class, using direction prediction to estimate the direction of each pixel with respect to its corresponding instance center, achieving instance segmentation of instances of the same semantic class. Mask Scoring R-CNN [11] proposes that existing mask scoring strategies use classification metrics, lacking targeted evaluation mechanisms. Based on Mask R-CNN [7], it modifies the mask evaluation criteria and improves model instance segmentation performance by adding a Mask IOU branch to predict masks and score them. These methods have all improved performance to some extent. Ye et al. [28] proposed a lightweight hybrid model that combines supervised hierarchical granularity parsing tasks with unsupervised image matting tasks. The model includes a scalable hierarchical semantic segmentation module and an image matting module composed of guided filters. Through multi-stage inference and feedback mechanisms, the model achieves fine-grained segmentation and improves the segmentation results.

2.2 Boundary Segmentation

Recently, more and more researchers have focused on how to improve the quality of mask boundaries. Hayder et al. [21] proposed an algorithm named Boundary-Aware Instance Segmentation (BAIS) to enhance the instance segmentation mask boundary. The BAIS method uses a novel boundary loss function to optimize the boundary of the object segmentation. This loss function combines the internal information of the object with its edge information to encourage the model to pay more attention to the object’s boundary. Kirillov et al. [23] introduced the Point-based Rendering (PointRend) module, which uses a new upsampling method to optimize image segmentation of object edges, enabling good performance in hard-to-segment object edge parts. Some methods improve boundary quality through post-processing solutions. Liang et al. [29] use a polygon-based method (PolyTransform) to represent instance masks as polygons to better adapt to complex shape changes. Yuan et al. [30] replace rough predictions of boundary pixels with internal predictions, but this relies on accurate boundary prediction. Tang et al. [22] use the Boundary Patch Refinement (BPR) framework to improve boundary quality by using a pre-cropping and optimization approach for predicted instance masks in instance segmentation models. For the above post-processing methods, no boundary information is used in the entire mask prediction, which is not conducive to further improving segmentation performance. Chen et al. [10] propose a model called MaskLab, which can simultaneously perform instance segmentation, keypoint detection, and segmentation edge prediction. Specifically, MaskLab [10] generates instance masks, keypoint, and edge mask branches using a multi-branch convolution network, and then integrates them into a unified framework. In addition, MaskLab introduces a feature selection module based on attention mechanisms to enhance the network’s attention and response to important areas. Cheng et al. [24] proposed Boundary-preserving Mask R-CNN (BMask R-CNN), which uses boundary information to improve mask localization accuracy. By limiting the distance between the segmentation mask and the edge, BMask R-CNN preserves boundary information so that the predicted mask can align well with the instance boundary. Hu et al. [25] proposed an attention-based module to enhance the boundary information in the segmentation feature map and improve the segmentation quality. Feng et al. [31] proposed a new task called Boundary Knowledge Translation (BKT) that aims to achieve accurate segmentation of new categories with a small number of labeled samples by transforming visual boundary knowledge from labeled categories. To accomplish this, they propose a model called Translation Segmentation Network (Trans-Net), which consists of a segmentation network and two boundary discriminators. Through self-supervised mechanisms and adversarial training, Trans-Net is able to achieve segmentation results comparable to fully supervised methods with only tens of labeled samples for guidance. Cheng et al. [32] proposed a reference semantic segmentation network called Ref-Net, which aims to achieve accurate object segmentation by introducing reference object features and boundary knowledge transformation. Unlike traditional segmentation networks that rely heavily on labeled data, Ref-Net only requires a small number of finely annotated samples as guidance and achieves results comparable to fully supervised methods on multiple datasets.

3 Methods

We propose the BGF, which is based on the Mask R-CNN framework and uses ResNet [33]+FPN [26] as the feature extraction network. The overall framework is illustrated in Fig. 1. In this section, we provide an overview of the boundary branch (Sect. 3.1), and then we explain how we perform feature fusion to enhance the model’s performance (Sect. 3.2), followed by a description of our methods for learning and optimization (Sect. 3.3).

Fig. 1
figure 1

The overall architecture of the proposed boundary-guided global feature fusion (BGF)

3.1 Boundary branch

Our chosen baseline network is Mask R-CNN [7], and we adopt its respective configurations. The network architecture is depicted in Fig. 1, with C1 representing a convolutional layer and S2–S5 denoting stages 2–5 of the ResNet network. Each stage consists of multiple residual blocks, and the feature map size progressively decreases from 1/4 to 1/32 of the original input. Moreover, F2 to F5 refer to the top-down feature pyramid network (FPN) [26], constructed through lateral connections. The output feature map of each layer in FPN is upsampled and element-wise added to the corresponding stage feature map in ResNet. Furthermore, F6 is obtained via max pooling from F5 and directly fed into the RPN network to generate Region of Interest (RoI). Each RoI is subsequently inputted into the classification and bounding box regression branches, alongside the newly introduced mask branch. For more comprehensive information, please consult [7, 8, 26].

Despite the effectiveness of Mask R-CNN [7] in achieving favorable instance segmentation outcomes, the challenge of imprecise mask boundaries remains unresolved. Therefore, we integrate boundary information into the network to enhance the accuracy of mask boundary prediction.

As illustrated in Fig. 2, we have developed a Boundary Feature Extraction (BFE) module for extracting instance boundaries at various stages of the ResNet network. The BFE module leverages features from four distinct stages of ResNet, consisting of multiple conv_block and identity_block. The input and output sizes of the BFE module are presented in Table 1. To detect boundaries, we modify the features of the last layer in each conv_block and identity_block of every stage using a \( 1 \times 1 \) convolution with 32 channels. Subsequently, we combine the adjusted features from different blocks by summing them pixel by pixel. Finally, a \( 1 \times 1 \) convolution with a single channel is employed to generate boundary detection results for each stage.

Table 1 Input and output size of the BFE module

At the same time, we employ a \( 1 \times 1 \) convolution with a single channel to refine the feature maps obtained from the C1 stage. Subsequently, we upsample the feature maps from the S2 to S5 stages, which are outputs of the BFE module, in order to match the size of C1. We then combine the refined C1 feature map with the upsampled feature maps, resulting in a fused boundary feature map that covers all five stages (C1–S5). Finally, for boundary detection results, we utilize a \( 1 \times 1 \) convolution with a single channel and apply a Sigmoid activation function.

In instance segmentation tasks, the precise identification of object boundaries holds significant importance. The inclusion of boundary information enables the network to gain a better understanding of object shapes and structures, leading to improved instance segmentation accuracy. The boundary branch extracts valuable boundary information, supplying additional details regarding boundary positions and shapes. This, in turn, enhances the network’s ability to perceive boundaries effectively.

Fig. 2
figure 2

An illustration of the proposed boundary feature extraction (BFE) module

3.2 Feature Fusion

Boundary information exhibits a certain degree of locality and specificity, while semantic features tend to be more global and abstract. By integrating these two types of features through feature fusion, we can enhance the performance of segmentation. Feature fusion capitalizes on the complementary nature of high-level semantic features and low-level boundary features. This combination allows for better delineation of object boundaries and extraction of semantic information, consequently improving the accuracy of instance segmentation.

To achieve feature fusion, we propose a Global Attention Module (GAM) module. The network architecture of GAM is depicted in Fig. 3 as the purple solid box. Specifically, for the i-th layer of FPN [26], we initially employ a \( 1 \times 1 \) convolution to derive \(B^{\prime }\), the boundary feature map. Subsequently, we utilize three pooling methods to downsample \( B^{\prime }\). One of these methods is Adaptive Max Pooling (AMP), which downsamples \( B^{\prime }\) into four distinct spatial sizes, such as \( \left\{ 1\times 1, 5\times 6, 9\times 12, 14\times 22\right\} \). Finally, we employ the flatten-concatenate approach to produce a feature matrix \( B^{M} \in R^{447 \times 256 } \). This process is formulated as follows:

$$\begin{aligned} B^{M}=fc({\textit{AMP}}_{j}(B^{\prime })) \end{aligned}$$
(1)

where fc denotes the flatten-concatenate method, and j denotes feature maps of different sizes using adaptive max pooling output. Global max pooling (GMP) and global average pooling (GAP) are applied to downsample \( B^{\prime }\), followed by full convolution, resulting in the creation of two \( 256\times 1-d \) attention feature maps. These maps are added pixel by pixel to obtain the channel-refined feature \( A^{C} \), which is then multiplied by \( B^{M} \) and passed through a Softmax activation function to obtain \( A_{i} \). Subsequently, we apply a transposition operation on \( A_{i}^{B} \) and perform multiplication with \( B^{M} \), resulting in the generation of a global attention feature \( B_{i}^{G} \). Further, this feature is integrated with \( F_{i} \) on the channel. Ultimately, the fused feature \( F_{i}^{\prime }\) is utilized as an input for the subsequent networks. \( F_{i}^{\prime }\) generation can be formulated as follows:

$$\begin{aligned} A^{C}= & {} {\textit{FC}}({\textit{GMP}}(B^{\prime }))+{\textit{FC}}({\textit{GAP}}(B^{\prime }))\end{aligned}$$
(2)
$$\begin{aligned} B_{i}^{G}= & {} ({\textit{Softmax}}(B^{M} \otimes A^{C}))^{T} \otimes B^{M}\end{aligned}$$
(3)
$$\begin{aligned} F_{i}^{\prime }= & {} {\textit{sum}}(F_{i},B_{i},B_{i}^{G}) \end{aligned}$$
(4)

where FC denotes the fully connected layer, \( \otimes \) denotes matrix multiplication, Softmax denotes the Softmax activation function, and sum denotes channel-wise sum method.

The feature fusion module can facilitate the network to be more effective in information transmission and contextual association. The boundary information extracted by the boundary branch and the original features interact and merge in the feature fusion module. This can increase the network’s attention to boundary information and optimize the instance segmentation results by utilizing the contextual association of boundary information. Through feature fusion, the network can better utilize global and local contexts, improving the accuracy and robustness of instance segmentation.

Fig. 3
figure 3

Network architecture of the proposed feature fusion module, which contains a global attention module (GAM) module

3.3 Learning and Optimization

Inspired by edge detection [34, 35], we view boundary prediction as a pixel-level classification problem. We fuse the learned boundary features with Mask R-CNN [7] features to provide shape information for mask prediction.

Boundary Groundtruths Popular instance segmentation datasets, such as the COCO dataset [36] and the PASCAL VOC dataset [37], do not provide annotated ground truths for boundaries. However, the boundary branch requires the annotated boundaries, which must be converted into binary boundary maps for training labels. To improve the model’s training, we used mask groundtruths to generate boundary groundtruths that only show the boundaries of each instance. As shown in Fig. 4, we extracted instance masks from the annotated images and computed their corresponding boundary information by calculating the edge information of the masks. During training, we fed the images and their corresponding binary boundary images into the model, calculated the loss function, and updated the parameters. By iterative training, the Boundary Branch can learn more precise and robust boundary feature representations, thereby improving the model’s overall performance.

Boundary Loss To enhance the training of the boundary branch, we have integrated the application of both boundary loss [38] and dice loss [39]. We define boundary loss as \( L_{{\textit{bou}}}\) and dice loss as \( L_{{\textit{dice}}}\), the \( L_{{\textit{bou}}} \) and \( L_{{\textit{dice}}} \) are described as follows:

$$\begin{aligned} L_{{\textit{bou}}}&= -\frac{1}{N} \sum _{i=1}^{N} \omega _{i}[y_{i}(1-\hat{y_{i}})+(1-y_{i})\hat{y_{i}}],\end{aligned}$$
(5)
$$\begin{aligned} L_{{\textit{dice}}}&= 1 - \frac{2\sum _{i=1}^{N} (\hat{y_{i}} \cdot y_{i}) +\varepsilon }{\sum _{i=1}^{N} \hat{y_{i}}^{2}+\sum _{i=1}^{N} y_{i}^{2}+\varepsilon }, \end{aligned}$$
(6)

where N represents the number of pixels in the image, y and \( {\hat{y}} \) represents the model prediction and ground truth label, respectively. \( \omega _{i} \) is a weighting factor to balance the effect of boundary pixels and non-boundary pixels, and \( \varepsilon \) is a constant used to avoid division by zero, and here we set it to 1.

Our boundary branch loss function \( L_{b} \) is defined as:

$$\begin{aligned} L_{b} = L_{{\textit{bou}}} + \alpha L_{{\textit{dice}}}, \end{aligned}$$
(7)

where \( \alpha \) is a hyperparameter, and here we set \( \alpha = 1 \).

4 Experiments

4.1 Dataset and Evaluation Metric

To demonstrate the efficacy of BGF, we performed comprehensive experimentation on the COCO [36] and Cityscapes [40] datasets. The COCO dataset consists of a training set (train2017), a validation set (val2017), and a test set (test-dev2017), which respectively contain 118,287, 5000, and 40,670 images. We trained our model on the train2017 set, compared it with other advanced methods on the test-dev2017 set, and conducted ablation experiments on the val2017 set. The Cityscapes dataset contains 2975 training images, 500 validation images, and 1525 testing images. In terms of instance segmentation, Cityscapes contains eight object categories and provides more accurate instance segmentation annotations than COCO. The model was trained on the designated training set and subsequently evaluated against state-of-the-art techniques using the test set. We employed AP, \({\textit{AP}}_{50} \) and \({\textit{AP}}_{75} \) as evaluation metrics for the COCO dataset, using varying IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. As for the Cityscapes dataset, our evaluation was limited to AP and \({\textit{AP}}_{50} \).

Fig. 4
figure 4

Visualization of a portion of the dataset. a Original image, b Mask image, c Boundary ground truth

4.2 Implementation Details

Our proposed method is built upon Mask R-CNN [7], which serves as the baseline. We utilize ResNet-FPN as the backbone network and initialize its parameters with pre-trained weights from ImageNet [41] while initializing other layer parameters with random initialization. To enhance boundary information learning, we employ an additional PASCAL VOC Context dataset [42] for separate training of the boundary branch. A smaller learning rate and stochastic gradient descent algorithm are used for boundary branch training to prevent excessive adjustment of pre-trained network parameters and maintain the original feature extraction function. As per our regular procedure, we provide training for all models utilizing 4 NVIDIA GPUs.

Our approach to training on the Microsoft COCO dataset [36] involved utilizing a batch size of 16 and a maximum of 60,000 iteration steps. To optimize performance, we set the initial learning rate to 0.02 and applied a reduction factor of 0.1 and 0.01 at the 30,000th and 50,000th iterations, respectively. Additionally, we set the weight decay to 0.0001. For the Cityscapes dataset, we adopted a batch size of 4 and a maximum of 30,000 iteration steps, with the initial learning rate set to 0.004. The learning rate was then reduced to 0.0004 at the 20,000th iteration while retaining the same hyperparameters as in the Mask R-CNN [7] experiments.

4.3 Experiments on COCO Dataset

4.3.1 Ablation Study

In this part, we demonstrate the effectiveness of our proposed module by evaluating it on COCO val2017, using metrics such as \({\textit{AP}}\), \({\textit{AP}}_{50} \), and \({\textit{AP}}_{75} \), and the backbone network is ResNet-50-FPN, as presented in Table 2.

Table 2 Ablation study on COCO val2017 datasets of adding components to Mask R-CNN. PVC represents training the boundary branch separately using the PASCAL VOC Context dataset

The symbol “” in Table 2 indicates that the corresponding module is added to Mask R-CNN [7], while “–” means the opposite. The first row represents the baseline Mask R-CNN [7] model without our module. By comparing the results of the first two rows, we observe that adding the BFE module alone does not significantly affect the evaluation metrics. This suggests that the BFE module alone cannot enhance the model’s performance. The reason behind this is that the information extracted by our boundary branch needs to be integrated with the information in FPN to achieve the desired impact. Therefore, adding the BFE module alone to the model does not influence its performance.

Further comparison of the results in the row I and row III shows a slight improvement in evaluation metrics when adding both the BFE and Fusion modules to Mask R-CNN. Specifically, \({\textit{AP}}\), \({\textit{AP}}_{50}\), and \({\textit{AP}}_{75} \) increased from 34.5%, 55.8%, and 36.7% to 35.3%, 57.1%, and 37.6%, respectively. However, directly fusing the features extracted by the BFE module with those outputted by the FPN [26] network has a limited impact on performance because of the high degree of coupling between them. To overcome this issue, we introduced the GAM module into the Fusion module. The row V of Table 2 clearly shows that the inclusion of the GAM module results in a significant improvement in network performance. Compared to the row III, \({\textit{AP}}\), \({\textit{AP}}_{50} \), and \({\textit{AP}}_{75} \) increase from 35.3%, 57.1%, and 37.6% to 37.6%, 58.7%, and 39.6%, respectively. This indicates that our GAM module, which extracts multi-scale global attention features, can more effectively fuse with the FPN network’s output features, resulting in a significant performance boost.

Table 3 Comparison with state-of-the-art methods on COCO test-dev2017

We also analyzed whether training the boundary branch independently using additional datasets could enhance model performance. Rows III and VI in Table 2 both use the PASCAL VOC Context dataset [37] to train the boundary branch independently. Compared to the results in row II, there is a certain improvement in each metric in row III. The most significant improvement occurs in row VI, where we add all our modules to Mask R-CNN and use the PASCAL VOC Context dataset [37] to train the boundary branch separately, achieving optimal performance in terms of \({\textit{AP}}\), \({\textit{AP}}_{50}\), and \({\textit{AP}}_{75}\) at 38.7, 59.8, and 40.9%, respectively.

4.3.2 Comparative Experiments

To validate the superiority of our proposed method, we used ResNet-50-FPN and ResNet-101-FPN as feature extraction networks. We trained on COCO train2017 and tested our BGF on COCO test-dev2017. We compared our method with other advanced instance segmentation methods such as Mask R-CNN [7], Mask Scoring R-CNN [11], BMask R-CNN [24], PointRend [23], Mask R-CNN with BEM [25], and MaskLab [10]. As shown in Table 3, when the backbone network is ResNet-50-FPN and the boundary branch is trained with an additional PASCAL VOC Context dataset [37], our model achieves optimal performance, with \({\textit{AP}}\), \({\textit{AP}}_{50}\), and \({\textit{AP}}_{75} \) of 40.8, 61.2, and 43.4%, respectively.

Fig. 5
figure 5

Visual comparison of the proposed BGF and baseline methods. GT denotes the groundtruth mask, Boundary denotes the boundary generated by GT, MRCNN denotes the mask predicted by Mask R-CNN

Fig. 6
figure 6

Visualization of feature maps at different stages. a Original image, b C1 layer, c S2 layer, d Boundary, e F2 layer, f F2 with boundary

Table 4 Comparison with state-of-the-art methods on cityscapes val(AP column) and test(Remaining columns) datasets
Fig. 7
figure 7

Visualization of the results predicted by different methods on the Cityscapes dataset. a Original image, b Mask R-CNN, c Our BGF

We also visualized the prediction results of our BGF and Mask R-CNN [7]. As shown in Fig. 5, the left side shows the visualization of the prediction results of the two methods. To more intuitively compare the effects of the two methods, we binarized the prediction results (shown on the right side of Fig. 5). The yellow dashed box in the figure shows that the boundary predicted by our BGF is more accurate, and can better align with the target instance. Through precise boundary segmentation, the model can better distinguish the boundaries between different objects, thereby identifying them more accurately.

We also visualized the feature maps of different stages in ResNet101. As shown in Fig. 6f, after the fusion of boundary information into the F2 layer of FPN, the output feature map of the F2 layer contains more instance boundary information, thereby improving the accuracy of mask boundary prediction.

4.4 Results on Cityscapes Dataset

In addition, we present the results of instance segmentation for 8 categories contained within the Cityscapes dataset [40]. We only used fine annotations data to train our model, without using 20k coarse training images. All images were sized at \(2048 \times 1024\). During training, to reduce overfitting, we randomly scaled the short edge of the image to between [800, 1024].

The Cityscapes dataset [40] contains a large number of category-specific occluded instances in the person and car categories, which is the difficulty in instance segmentation. Compared to Mask R-CNN [7], our BGF achieved significant improvements in these two categories, from 30.5 to 38.5% in humans and from 46.9 to 55.7% in cars, as shown in detail in Table 4.

As shown in Fig. 7, when there are overlaps in the person and car categories, our method can obtain more accurate instance boundaries, as evidenced by the red box in Fig. 7. Our method also performs better than Mask R-CNN [7] in cases where the target is small or blurry, as shown in the blue box in Fig. 7.

Fig. 8
figure 8

Comparison of results predicted by different methods in dense scenes. a original image; b Mask R-CNN; c Our BGF

Table 5 Real-time setting comparison of speed and accuracy with other methods on COCO test-dev

4.5 Discussions

In this study, we introduced a boundary branch to extract boundary information based on Mask R-CNN, in order to determine the influence of boundary information on the prediction accuracy and quality of masks. Previous research on improving mask boundary quality mainly focused on optimizing boundaries during the prediction process [10, 24] and improving boundary quality using post-processing techniques [23, 29, 30]. These works have improved the prediction accuracy of masks to some extent. However, the prediction accuracy and quality of masks are not necessarily positively correlated, as masks with high prediction accuracy may also have rough mask boundaries. Unlike previous methods, we first extract boundary information and then integrate it into the network for subsequent detection and segmentation. A large number of quantitative and qualitative results demonstrate that our proposed boundary branch can not only improve mask prediction accuracy (with a 5.1 increase in mask AP compared to Mask R-CNN), but also obtain high-quality accurate masks.

Fig. 9
figure 9

Failure cases. We provide four examples on the COCO dataset to illustrate the model limitations of our method

We summarize the possible reasons as follows. Firstly, the boundary branch can effectively improve boundary localization and capture edge details, learning the features of target boundaries, and obtaining masks with better continuity and smoothness while improving segmentation accuracy. Secondly, our proposed GAM module can further improve most metrics, especially \({\textit{AP}}_{75} \), indicating that our GAM can help generate more accurate instance masks. It is worth noting that when GAM is not used, the performance improvement is minimal by simply fusing through element summation, while GAM can solve this problem. In addition, if there are multiple objects in an image and the instances overlap, the network may fail to locate and segment each object, as shown in Fig. 8b. However, our method can partially address the instance segmentation problem in dense scenes by providing more accurate boundary information, matching the boundaries of instances with their internal regions more accurately, as shown in Fig. 8c.

Fig. 10
figure 10

Image with camouflage target

To discuss the limitations of our proposed instance segmentation model, we compared the inference speed with some other methods in Table 5. We reported all speeds computed on an RTX 3090, so some speeds listed may be faster than those originally stated in the papers. Compared to Mask R-CNN, our BGF has shown an improvement in inference speed (approximately 20 ms). One possible reason is that the ROI pooling operation used in the boundary branch effectively maps regions of interest of different sizes onto a fixed-size feature map. This reduces the need for multiple sampling operations on the feature map, resulting in an increased processing speed. Despite the higher accuracy and quality of our mask predictions, the inference speed differs significantly from the real-time algorithm YOLACAT [44] (approximately 60 ms), making it challenging for real-time applications. Additionally, We present some failure cases in Fig. 9, we have observed that our method may encounter challenges in accurately segmenting when strong contours exist within the object, as demonstrated in the first to third columns of Fig. 9 (e.g., long fur on the bear’s butt, long fur on the boundary of the cat in the image, and the situation around the zebra’s ears). Finally, the detection and segmentation of images without obvious targets, such as camouflage objects, is a challenging task for our proposed network. We tested our proposed BGF model on example images in Fig. 10, but no targets were detected. Although the detection and segmentation of camouflage objects are beyond the scope of this paper, it still falls under the category of instance segmentation.

In future research, we will explore ways to optimize edge branching and integrate it into lightweight networks to further improve the performance and application scope of the models.

5 Conclusion

This article proposes a boundary-guided global attention feature fusion instance segmentation method based on the Mask R-CNN network, which can obtain more accurate mask boundaries and higher detection accuracy. Specifically, we introduce a boundary branch and use the Boundary Feature Extracted (BFE) module to extract the boundary information of instances. In addition, to better perform feature fusion, we propose a Global Attention Module (GAM), which generates multi-scale global attention features and fuses them with baseline method features. Experimental results on two popular datasets show that our proposed method outperforms other instance segmentation methods in mask prediction accuracy. Our future work is to further optimize boundary branch and incorporate it into lightweight networks to further improve network real-time performance and application scope.