Introduction

Gesture interaction is an intuitive and natural communication method and can provide effective simple, intuitive, and concise human–machine interaction [1, 2]. Thus, hand gesture interaction [3, 4] achieves the attention of many researchers. Generally, gesture interaction can be divided into wearable device-based methods [5] and machine vision-based methods [6,7,8]. For convenience, machine vision-based methods have become mainstream.

Machine vision-based gesture interaction usually involves three steps: gesture segmentation, feature extraction, and gesture recognition [9]. Since removing background in gesture images, gesture segmentation is the prerequisite of the entire gesture interaction system, and its result improves subsequent feature extraction and recognition accuracy. However, in practical applications, it is a challenging task because of the small difference between foreground and background, uneven lighting conditions, and various shapes of gestures. To minimize the interference of background noise and obtain a complete hand posture for subsequent gesture recognition, we focus on researching gesture segmentation.

For gesture segmentation, machine learning-based and deep learning-based methods were proposed. In practical application, methods based on conventional machine learning methods cannot remedy the problems in gesture segmentation. These methods [10, 11], always explore single and predefined operators to obtain the features of gestures, such as skin color [12, 13], gradient direction histogram [14], Haar features [15], scale-invariant feature transform [16], and motion of hand [17]. However, gesture images with complex backgrounds and various shapes of hand gestures are hardly segmented or detected using the single feature [9]. To fuse multiple features of hand gestures, several ensemble methods [18, 19] were proposed for gesture segmentation. However, the weights of weak classifiers seriously affect the results of segmentation and detection for hand gestures. The training efficiency is one of the problems in ensemble methods for hand gesture detection.

With the development and application of deep learning in many areas, deep learning-based methods have become mainstream hand gesture segmentation methods [20, 21]. Several networks based on deep learning were proposed for hand segmentation. For instance, Al-Hammadi et al. [22] utilized multiple deep learning architectures to segment hand regions. Dadashzadeh et al. [9] increased gesture segmentation precision by leveraging residual network structures and dilated spatial pyramid pooling. Wang et al. [23] used the MSF module and LWMS module to enhance the network’s multiscale feature extraction capability but ignored the size of gesture segmentation network parameters.

Generally, hand gesture interaction is always employed in edge devices, thus, the light and real-time networks are important for this application. To obtain lightweight networks, several methods [24] were proposed. Dang et al. [25] proposed gesture recognition methods based on DeeplabV3+ and U-net, which can reduce the overall parameter volume by replacing the backbone network. However, because of the imprecise feature representation for hand, the overall segmentation accuracy is still low. To improve segmentation accuracy, Dayananda [26] proposed a new hybrid approach based on RGB-D gesture images. However, compared with RGB images, it is limited to the image dataset. Similar to U-net, Das et al. [27] used an encoder–decoder architecture for real-time pixel-level semantic segmentation. ICNet [28] uses image cascading to accelerate the algorithm, while DFANet [29] uses an architecture based on depth separable convolution to build a lightweight backbone. To remedy the problems of computation for multi-scale and high-resolution, ESPNet [30, 31] was proposed based on spatial pyramid pooling modules. Despite being able to complete inference tasks instantly, these methods only adopt a single feature processing approach. Due to the lack of consideration for underlying details, the accuracy of models is severely reduced.

To fuse multi-feature for hand gesture segmentation, several methods with dual-branch models were proposed. In BiSeNet [32], the dual-branch network model is used for detail analysis and context analysis. To strengthen features from the context branch and detail branch, DDRNet [33] adopts a bridging method to a dual-branch structure. However, these dual-branch methods ignore the diversity of features and the gap between the context branch and detail branch, which would lose many details at a lower resolution, and lead to un-accurate segmentation results for gesture.

Gesture segmentation is a pixel-dense prediction task that relies heavily on the local detail and texture of gestures. However, traditional encoder–decoder networks and dual-branch networks ignore the connection with gesture details, shape and context information. To refine gesture features, we propose a new structure that divides the extracted gesture features into three types: boundary features, local features, and semantic features. By categorizing gesture features into different types, the proposed network can obtain more comprehensive and diverse features. It can flexibly adjust and utilize these features to improve the accuracy and robustness of gesture prediction. In addition, dividing the features can reduce the difficulty of network feature extraction. Based on this, we propose BLSNet for gesture segmentation, which includes three branches: local feature branch, semantic feature branch and boundary branch. The local feature branch has the characteristics of wide channels and shallow levels, mainly used for extracting detailed gesture features. The semantic feature branch involves narrow channels and deep hierarchical structures, learning high-level semantic context through a considerable degree of downsampling. And the boundary branch focuses on extracting boundary information of gestures. To construct a lightweight network architecture, the Ghost bottleneck is used in our method as the backbone.

The main contributions are summarized as follows.

  1. 1.

    A multibranch segmentation structure is proposed in this paper, proving that segmentation can be obtained by multicategory features from deep neural networks.

  2. 2.

    Based on this structure, a tri-branch lightweight network named BLSNet is proposed for gesture segmentation. BLSNet contains three branches for boundary, local and semantic feature extraction to extract three different types of gesture features.

  3. 3.

    To refine gesture features, we propose the BW module and MDSC module for gesture boundary and texture features and use the ASPP module for semantic features. To fuse the three channels, the Bag module is employed to promote network optimization.

The article is organized as follows. “BLSNet” section provides a detailed introduction to the proposed BLSNet. To demonstrate the effectiveness of the proposed BLSNet, corresponding experiments are conducted, and the results are presented in “Experiments” section. The conclusion is drawn in “Conclusion” section.

BLSNet

In this section, we propose a lightweight network, named BLSNet, based on three branches for hand segmentation. The three branches are designed to obtain three different types of gesture features.

Overview

Let \(x \in X\), x is a given pixel in hand gesture images, and X is the set of pixels. The predicted label of x is presented as \({l}'_x\), and \({l}'_x \in \left\{ 0,1\right\} \). \({l}'_x\) =1 predicts that x is the foreground of the image, while \({l}'_x\) =0 presents that x is the background of the image. The probability structure for hand segmentation can be presented as Eq. 1.

$$\begin{aligned} P\left( {l}'_x \mid x \right) = \sum _{i=1}^{n} P\left( {l}'_x \mid x,\theta _i \right) P(\theta _i ), \end{aligned}$$
(1)

where \(\theta _i\) is the parameter of the i-th branch in the segmentation model and n is the number of branches. One of the key conditions for Eq. 1 is that the features obtained using parameters \(\theta _i\) are independent. However, it is difficult to determine the parameters. To obtain the predicted label of x, we introduce the slack parameters \(\widetilde{\theta _i}\), which are considered with the parameters of branches \(\theta _i\). We can obtain the predicted structure \(\sum _{i=1}^{m} P({l}'' _x \mid x,\widetilde{\theta _i} )P(\widetilde{\theta _i})\). \({l}'' _x\) is the label of x detected by the parameters of \(\widetilde{\theta _i}\). Because of the independence of features, we can have Eq. 2.

$$\begin{aligned} P\left( {l}'_x \mid x \right) \le \sum _{i=1}^{m} P\left( {l}'' _x \mid x,\widetilde{\theta _i} \right) P\left( \widetilde{\theta _i}\right) \end{aligned}$$
(2)

We can obtain the same label using the two probability structures of \(\sum _{i=1}^{n} P({l}'_x \mid x,\theta _i )P(\theta _i )\) and \(\sum _{i=1}^{m} P({l}'' _x \mid x,\widetilde{\theta _i} )P(\widetilde{\theta _i})\). Since we can obtain global minima using deep neural networks [34], the label from our slack parameter \(\widetilde{\theta _i}\) is similar to the true label of the predicted pixel, as shown in Eq. 3.

$$\begin{aligned} l = \underset{{l}' _x}{\textrm{argmax}} P\left( {l}'_x \mid x\right) \approx \underset{{l}'' _x}{\textrm{argmax}}\sum _{i=1}^{m} P\left( {l}'' _x \mid x,\widetilde{\theta _i} \right) P\left( \widetilde{\theta _i}\right) ,\nonumber \\ \end{aligned}$$
(3)

where l is the label predicted using our structure. Since of the binary of label set and designed networks, the final structure is represented in Eq. 4.

$$\begin{aligned} l = \underset{{l}'' _x}{\textrm{argmax}}\sum _{i=1}^{m} P\left( {l}'' _x \mid x,\widetilde{\theta _i} \right) \end{aligned}$$
(4)

Based on our structure, to achieve independent features for gesture segmentation, we propose a tri-branch network based on local feature branch, boundary branch and semantic branch in this paper. The network is named BLSNet. An overview of the proposed method is shown in Fig. 1. Specifically, after simple feature extraction through a convolutional layer and 2 Ghost bottlenecks [35], the input image’s resolution is reduced to 1/4 of its original size. Then, it is fed into three branches for downsampling at different levels, which are the local feature branch, boundary branch, and semantic branch. In the three branches, different shapes of cubes represent the height, width, and number of channels of feature maps. In addition, the number in the top right corner of the cube indicates the relative size between the current feature map and the original input image. The color of the cube is mainly used to distinguish three branches. Finally, the outputs of the three branches are fused using the Bag module. In simple terms, the Bag (Boundary-attention-guided fusion) module uses boundary feature maps to obtain a gesture weight score map and then fuses it with local feature maps and semantic feature maps based on this score. The details will be explained in “Feature fusion” section. To construct a lightweight network, we use the Ghost bottleneck [35] as the building block, which can greatly reduce the computation and parameters compared with other mainstream backbones.

Fig. 1
figure 1

An overview of the proposed network (BLSNet)

The boundary branch

We set up a boundary branch and use it as the main branch to coordinate the feature extraction work of the local feature branch and the semantic branch.

In addition, gestures come in various shapes, sizes, and directions, and are easily influenced by cluttered backgrounds. Using a multi-scale feature extraction method to learn gesture features is more in line with the characteristics of gesture segmentation tasks. Therefore, we design a Multi-scale Depth-wise Strip Convolution (MDSC) module, as shown in Fig. 2. The module has five depth-wise strip convolution branches and one \(1\times 1\) convolution branch, and the feature learning results of the six branches are concatenated to obtain the output feature map of the module. The module can be represented as Eq. 5.

$$\begin{aligned} Output_{MDSC}{} & {} = Conv_{1\times 1}(F)\nonumber \\{} & {} \qquad + \sum _{i=0}^{4} Scale_i\left( DW\_Conv(F)\right) , \end{aligned}$$
(5)

where F represents the input feature map, \(Conv_{1\times 1}\) represents \(1 \times 1\) convolution, \(DW\_Conv\) represents deep convolution, \(Scale_i\), and \(i\in \left\{ 0,1,2,3,4\right\} \) represents the ith branch in the Fig. 2. Each branch contains two deep strip convolutions, with respective convolution kernel sizes of \(1 \times n\) and \(n \times 1\), which are used to simulate the standard 2D convolution with a kernel size of \(n \times n\). Here, n takes the values of 3, 7, 11, 15, and 21.

Fig. 2
figure 2

Multi-scale depth-wise strip convolution (MDSC) Module

The pseudocode for MDSC is shown below.

Algorithm 1
figure a

The MDSC module algorithm

The reasons for choosing deep stripe convolution to extract gesture boundary features are as follows. First, deep strip convolution is a lightweight convolution method. Compared with the model that does not use the MDSC, the GFLOPs of the network model increases by only 0.26%, and related discussions are conducted in the ablation experiment section. Second, the gesture boundary shape can be a linear, nonlinear, or complex curve. Using strip convolution can accurately locate the gesture boundary and more efficiently extract such band-shaped features. Finally, by changing the size of the stripe convolution kernel, the network model can adapt to different boundary shapes and scales. Since strip convolution performs convolution in only one direction, compared with standard 2D convolution, strip convolution can extract longer sequence information at the same computational complexity, which is crucial for modeling long-distance dependence relationships and improving boundary feature extraction robustness and accuracy.

The local feature branch

One local feature branch is designed to maintain the feature map resolution and reduce the loss of refined features, as shown in Fig. 3. Although this branch feature map has a higher resolution, it has fewer channels, which to some extent ensures the lightweight of the model. In addition, the introduction of the detail branch ensures the sensitivity of the gesture segmentation network to subtle features and provides better detail discrimination ability. By working in synergy with other branches such as the semantic branch and boundary branch, more accurate and detailed gesture segmentation can be achieved.

Fig. 3
figure 3

The illustration of the proposed local branch

The semantic branch

To extract contextual semantic information from gestures, we adopt a continuous downsampling strategy in the semantic branch and cooperate with the ASPP [36] context module to rapidly expand the model’s receptive field in the final stage and extract high-level semantic features of the gestures. As shown in Fig. 4, the semantic branch has a narrow channel and deep characteristics, and the deeper channel enables it to store sufficient contextual information. Although the number of channels in this branch is relatively large, the resolution size limits parameter growth, conforming to our idea of simplifying a unified segmentation task and ultimately designing a lightweight network.

Fig. 4
figure 4

The illustration of the proposed semantic branch

Feature fusion

This section primarily introduces the fusion between the boundary branch and the local branch, the fusion between the boundary branch and the semantic branch, as well as the final fusion process of the three branches.

The fusion between the boundary branch and the local branch

Since the local feature branch and boundary branch have different feature learning tasks under different supervision, certain information differences and complementarities exist between them. The local feature branch focuses more on the local detail features of the feature map, such as texture and shape; the boundary branch focuses more on the decision boundary between the gesture boundary and background. The local feature branch can selectively learn gesture boundary features, thereby optimizing the focus on each local detail to achieve better segmentation results.

Fig. 5
figure 5

Boundary weight (BW) module

Therefore, we propose a new boundary weight (BW) module, as shown in Fig. 5. First, the feature vector of the boundary branch is subject to \(1 \times 1\) convolution and batch normalization. Then, the sigmoid is calculated to obtain the attention-mapping image. The vector of the pixel position (xy) in the feature map provided by the boundary feature branch is defined as \(\buildrel {\rightharpoonup } \over v_b(x,y)\), where \(0\le x < H\), \(0\le y < W\), H and W are the length and width of the boundary feature maps, respectively. The output of the sigmoid function can be represented as Eq. 6.

$$\begin{aligned} \sigma =Sigmoid\left( f\left( \buildrel {\rightharpoonup } \over v_b\right) \right) \end{aligned}$$
(6)
Fig. 6
figure 6

Boundary-attention-guided fusion (Bag) Module

We multiply the vector \(\buildrel {\rightharpoonup } \over v_d\) at the pixel position (xy) in the local feature map by the corresponding pixel position \(\sigma \) in the boundary feature map weight matrix after passing through the sigmoid function. This process focuses more on gesture boundaries in the local feature map. Then, we add the local feature map with gesture boundary attention to the original local feature map to obtain the output of the BW module. The purpose of this addition is to prevent excessive boundary weights from causing damage to details in the original feature map. This process can be written as Eq. 7:

$$\begin{aligned} Output_{BW}=\sigma \buildrel { \rightharpoonup } \over v_d + \buildrel {\rightharpoonup } \over v_d. \end{aligned}$$
(7)

The fusion between the boundary branch and semantic branch

We adopt a direct downsampling and addition method to merge boundary features into semantic features, thereby enhancing the semantic branch’s recognition capability for gestures and backgrounds.

The final fusion process of the three branches

Since of the differences in feature representations from the local feature branch, boundary branch, and semantic branch for gesture features, it would lead to unbundling features by directly fusing features from three branches. To balance the weights of the three branches, as shown in Fig. 6, a Boundary-attention-guided fusion (Bag) module [37] is utilized to coordinate the fusion of the feature result maps of the three branches.

Specifically, the outputs of the local feature branch, boundary branch, and semantic branch are defined as \(\buildrel {\rightharpoonup } \over v_d\), \(\buildrel {\rightharpoonup } \over v_b\), and \(\buildrel {\rightharpoonup } \over v_s\), respectively. The output of the sigmoid and Bag can be expressed as Eqs. 8 and 9.

$$\begin{aligned} \sigma= & {} Sigmoid\left( \buildrel {\rightharpoonup } \over v_b\right) , \end{aligned}$$
(8)
$$\begin{aligned} Output_{Bag}= & {} f_{Bag}\left( \sigma \buildrel {\rightharpoonup } \over v_d + (1-\sigma )\buildrel {\rightharpoonup } \over v_s\right) \end{aligned}$$
(9)

As shown in Fig. 6, when \(\sigma \) is greater than 0.5, more detailed features will be obtained using the local feature and boundary feature; otherwise priority will be given to using the semantic feature with boundary information.

In addition, multiple additional loss functions are employed to help optimize the network model, as shown in Fig. 1. The S/B-Head are placed before the loss functions. Moreover, we provide the specific construction of the S-Head (semantic head) and B-Head in Fig. 7. They are both composed of convolutional layers, batch normalization layers, activation functions, and an upsampling layer. We utilize this structure to transform feature maps with a large number of channels into feature maps with specified channel numbers, adjust their resolution size, and then compare the transformed feature maps with label values for loss calculation.

Fig. 7
figure 7

The construction of the S/B-Head

The total loss function of the network is the sum of each loss function multiplied by their respective coefficients, and the calculation process is shown in Eq. 10.

$$\begin{aligned} F_{Loss}=\lambda _0l_0+\lambda _1l_1+\lambda _2l_2+\lambda _3l_3, \end{aligned}$$
(10)

where the specific positions of \(l_0\), \(l_1\), \(l_2\) and \(l_3\) are shown in Fig. 1. Specifically, \(l_0\), \(l_2\) and \(l_3\) all use the cross-entropy loss function, which is widely applied in semantic segmentation tasks. The specific definition can be seen in Eq. 11.

$$\begin{aligned} l_j=-\sum _{i}^{H \times W} G^ilog\left( P^i_j \right) , \end{aligned}$$
(11)

where \(j=\left\{ 0,2,3 \right\} \), \(G^i\) denotes the i-th pixel value on the real value, and \(P^i_j\) denotes the i-th pixel value on the predicted output images of the loss function \(l_j\).

For edge supervision, following the MDSC, a boundary head(B-Head) is employed, and to optimize the weight, the Dice loss [38] function \(l_1\), is used in this paper. Compared with conventional loss functions such as cross-entropy loss function, the Dice loss function can deal with the problem of imbalance between positive and negative samples. The \(l_1\) calculation process is given by Eq. 12.

$$\begin{aligned} l_1=1-\frac{2\cdot \sum _{i}^{H \times W} \left( P^i_1 \cdot G^i_1\right) + \epsilon }{\sum _{i}^{H \times W} P^i_1 + \sum _{i}^{H \times W} G^i_1 + \epsilon } \end{aligned}$$
(12)

where \(P^i_1\) denotes the i-th pixel value on the predicted edge output images, \(G^i_1\) denotes the i-th pixel value on the corresponding edge target images and \(\epsilon \) is set as 1.

The parameters in \(F_{Loss}\) in this paper are empirically [33, 39] set to \(\lambda _0 = 1\), \(\lambda _1 = 0.5\), \(\lambda _2= 0.4\) and \(\lambda _3 = 0.4\). In Fig. 1, the dashed lines and related blocks are ignored during inference.

Experiments

To test the BLSNet proposed in this paper, we compare it with several segmentation methods on the two datasets of OUHANDS and HGR1 with the criteria of PixAcc, mIOU, GFLOPs, and the model parameters.

Datasets, computation platform and evaluation criteria

The OUHANDS dataset [40] contains 10 different hand gestures from 23 subjects, of which 2000 were selected for the training set and the remaining 1000 for the test set. The photos in this dataset exhibit complex background and lighting variations, the shapes and sizes of the subjects’ hands vary, and their skin color varies as well. Additionally, the images exhibit varying degrees of occlusion between hands and faces.

The HGR1 dataset [41] contains a total of 899 RGB images of 25 different hand gestures performed by 12 subjects. Among them, we selected 630 images as the training set and the remaining 269 images as the test set. The hand gesture images in this dataset are greatly influenced by background variations, but there is no occlusion between the hands and faces.

The model was trained and tested on a platform composed of a single NVIDIA RTX 3080, PyTorch 1.11, CUDA 11.3, cuDNN 8.0, and Anaconda. To optimize the training process of the OUHANDS dataset, the batch size was set to 4 and the input image resolution was fixed to 320–320. We use the Adam optimizer for weight optimization, initialize the learning rate to 0.005, and set the weight decay to 0.0001. The total number of training epochs is 300. For the HGR1 dataset, the batch size was set to 16, the learning rate was initialized to 0.001, the weight decay was set to 0.0001, and all other settings were kept the same as for the OUHANDS dataset.

Mean intersection over union (mIOU) and pixel accuracy (PixAcc) are used as evaluation metrics for BLSNet. PixAcc is used to measure the pixel classification accuracy. It calculates the ratio of correctly classified pixels to total pixels. mIOU represents the mean value of pixel IOU (intersection over union). This metric provides a comprehensive evaluation of segmentation algorithms’ performance across various categories and reflects their overall effectiveness. mIOU is defined by Eq. 13.

$$\begin{aligned} mIOU=\frac{1}{C}\sum _{i=1}^{C}\frac{TP_i}{TP_i+FP_i+FN_i}, \end{aligned}$$
(13)

where \(TP_i\), \(FP_i\) and \(FN_i\) represent the number of pixels predicted as class i and correctly classified, the number of pixels predicted as class i but misclassified, and the number of pixels that should belong to class but are incorrectly classified as other categories, respectively. C is the number of categories.

Overall, PixAcc is a simple and intuitive evaluation metric, while mIOU focuses more on the matching degree between predicted results and ground truth. If the model has made progress in improving the results of target boundary segmentation, the corresponding mIOU will increase accordingly, while pixel accuracy may only show a slight increase. Thus, mIOU better reflects segmentation performance.

GFLOPs are commonly used to evaluate the computational complexity of convolutional neural networks. Parameters refer to the number of parameters in a neural network model, including weights and biases.

Ablation experiments

A series of ablation experiments were conducted on the OUHANDS dataset, including the loss function, BW module, MDSC module, and Bag module.

Effectiveness of extra losses

To investigate the impact of additional training supervision on network performance, we conduct ablation experiments by combining \(l_1\), \(l_2\), and \(l_3\). The results are shown in Table 1. We find that without adding extra semantic supervision, the model’s mIOU is only 86.28%. After adding each loss function separately, the model accuracy improved, with the most significant improvement (+ 2.53% mIOU) observed when adding \(l_1\). This provides strong evidence for the importance of boundary loss functions and boundary branches. In the case of adding only two loss functions, the model performance further improves. It is worth noting that the experiments of adding \(l_1\), \(l_3\) and adding \(l_2\), \(l_3\) yielded the same PixAcc but different mIOU. This is normal because their calculation processes are different. mIOU measures the average overlap between predicted segmentation results and true labels, while PixAcc measures the model’s ability to correctly classify pixels. mIOU is more accurate in evaluating image segmentation performance compared to PixAcc because it considers interclass relationships and spatially overlapping areas. When all three auxiliary loss functions are added, the model achieves its highest mIOU at 89.42%.

Table 1 Ablation study of extra losses

The effectiveness of BW, MDSC, and Bag, and the threefold cross-validation

The BW, MDSC, and Bag modules were combined in different ways to explore the impact of each module on the overall network performance. As shown in Table 2, we find that when none of the three modules participate in training, the overall model accuracy is the lowest (97.16%), and mIOU is only 87.61%. After adding the MDSC or Bag module, the model’s mIOU increased by 0.59% or 0.75%, respectively. When both MDSC and Bag modules are added, the model’s mIOU reaches 89.16%, a relative improvement of 1.55% compared to the lowest accuracy. When only the BW module is included in the model, the mIOU improves by 0.28%, but the effect of the BW module is not as significant as that of the MDSC and Bag modules. The combination of MDSC and Bag modules with the BW module improves network performance, and the overall performance is highest when all three are added to the network, with PixAcc at 97.65% and mIOU at 89.42%.

Table 2 BW, MDSC and Bag’s ablation study and threefold cross-validation

From the results, our network model has a high PixAcc without the BW, MDSC, and Bag modules. However, for some data in the dataset, due to the lack of fine-grained operations on features by these three modules, the model’s segmentation results may have inaccurate segmentation or larger artifacts at gesture edges as shown in samples 1 and 2 in Fig. 8. This leads to a higher PixAcc but poorer results. Therefore, on the overall dataset, incorporating these three modules does not show significant improvement in PixAcc, but it has more noticeable effects on result visualization and mIOU.

Fig. 8
figure 8

Comparison of visual results between BLSNet and BLSNet without added modules

In the threefold cross-validation, we randomly and uniformly partitioned the dataset. The model performed well, with an average PixAcc of 98.11% and an average mIOU of 91.24%.

In addition, we evaluated the GFLOPs and parameters of the model after adding each module. The addition of the BW module, MDSC module, and Bag module increased the GFLOPs of the model by 0.11, 0.26, and 0.95, respectively, while the parameters increased by 0.03 M, 0.65 M, and 0.15 M, respectively. Although the MDSC module has more parameters, it only slightly increases the computational complexity of the model. Moreover, it significantly improves the mIOU of the model (+ 0.75). This shows that with the help of the MDSC module, gesture features extracted by the boundary branch are more effective, also meeting the needs of lightweight network models. When these three modules are added pairwise to the model separately, there is an improvement in both GFLOPs and parameters. However, when all three modules are added to the model, it achieves optimal performance with only a slight increase of 1.32 GFLOPs and 0.83 M parameters compared to not adding any modules at all.

We visualize the features of the MDSC module and Bag module. Figure 9 shows the feature visualization of the MDSC module. The top row from left to right shows the original input image and the visualization results of three random channels in the output feature map of the MDSC module. The bottom row from left to right shows the ground truth and the random three-channel output results of branches without the MDSC module. Obviously, after adding the MDSC module, the boundary features of the hand are more prominent, and the contour is more complete. The output of the boundary branch without this module does not have a clear division in the gesture boundary area.

Fig. 9
figure 9

Feature visualization of the MDSC module

Table 3 Performances of different approaches on the OUHANDS dataset

Figure 10 shows the feature visualization of the Bag module, with the same layout as Fig. 9. The rightmost three images in the top row are the Bag module output, and the rightmost three images in the bottom row are the output feature maps of some branches without the Bag module. When not using the Bag module, we directly add the three branches together. From Fig. 10, it can be seen that under the guidance of the Bag module, the network model can clearly determine the hand position and gesture boundary, while the result of directly adding the feature maps of the three branches together does not have smooth and complete gesture edges, and the overall judgment of the foreground and background is also fuzzy and confused. Obviously, boundary features in the Bag module can guide the integration of local detail features and semantic features correctly.

Fig. 10
figure 10

Feature visualization of the Bag module

Comparison with other methods

To test our model, we propose experimental comparisons between BLSNet and HGRNet, DDRNet series, SegFormer, and PP-LiteSeg series in terms of PixAcc, mIOU, GFLOPs and parameter evaluation metrics.

Table 3 shows the statistical results of several segmentation methods on OUHANDS data with the criteria of PixAcc, mIOU, GFLOPs and parameters. From Table 3, we can see that our model achieves the highest accuracy compared to other segmentation models. Specifically, despite our model having a parameter size of 3.66 M, which is second only to HGRNet’s 0.28 M, its mIOU is 12.21% higher than HGRNet. DDRNet-23-slim is the lightest and fastest model in the DDRNet series. Despite having the lowest GFLOPs at 1.85, DDRNet-23-slim’s mIOU is only 80.14%. Meanwhile, DDRNet-39, having a parameter size of 32.65 M, has an mIOU of only 80.02%. SegFormer-0 is a lightweight model in the SegFormer series that uses self-attention mechanisms to coordinate contextual information. Its model parameter size is 3.71 M, but its mIOU is only 80.77%, which is much lower than our model’s mIOU. PP-LiteSeg has been used frequently as a benchmark for lightweight semantic segmentation models, with the highest mIOU reaching only 86.23%, which is 3.19% lower than our model, and the difference in parameter size is significant. Thus, our model achieves a good balance between computational complexity and accuracy while having a lower number of parameters, indicating that our method can meet the edge device interaction requirements, such as human–robot interactions.

Figure 11 presents several segmentations using comparison methods on typical images. We can see that our model can still accurately distinguish the hand subject position even under significant changes in background lighting, while other models fail. In Sample 2, the hand position is clearly overexposed, and other models mistakenly classify other background positions as hands, but our model can still accurately segment gestures. We speculate that this is largely due to accurate boundary information extraction and guidance on the local feature branch and semantical branch. Contextual information is used to fill object areas outside the boundaries, and detailed features complete gesture edges.

Fig. 11
figure 11

An illustration of the segmentation performance of different methods on the OUHANDS dataset

In this section, HGR1 is used to test the proposed method. Table 4 shows the segmentation results on the HGR1 dataset. The BLSNet achieves the highest accuracy with an mIOU of 96.83%. The mIOU of the proposed method is 0.81% higher than that of PP-LiteSeg-B, obtaining the second-highest accuracy among the comparison methods. Furthermore, despite having a higher computational complexity of 4.73 GFLOPs compared to other methods, such as DDRNet-23-slim, SegFormer-b0, and HGRNet-seg, our method far surpasses them in terms of accuracy. The parameters are an indicator of model size, and our model has a parameter count of 3.66 M, which is only slightly higher than HGRNet-seg’s 0.28 M. The segmentation results for several typical images in HGR1 are presented in Fig. 12, from which we can see that BLSNet performs significantly better than other methods in restoring finger details, obtaining overall contours and removing the background.

Table 4 Performances of different approaches on the HGR1 dataset
Fig. 12
figure 12

An illustration of the segmentation performance of different methods on the HGR1 dataset

Conclusion

Gesture segmentation in cluttered backgrounds poses a significant challenge, and traditional encoder–decoder networks are susceptible to information loss through repeated downsampling. The dual-branch architecture is inadequate in fusing detailed and contextual information about the gestures. Thus, based on the Bayesian framework, we propose BLSNet to segment the hand gesture images, comprising three branches devoted to extracting advanced boundaries, local information, and semantic information of the gestures. Extra semantic supervision is used to direct each branch’s task. By establishing bridges between branches and activating a BW module between them, the feature representation and learning ability of each branch can be enhanced. We further exploit the MDSC module to improve the feature extraction ability of the boundary branch. Finally, the Bag module blends the semantic and local characteristics governed by the boundary information, yielding accurate gesture segmentation results. Experiments demonstrate that our network delivers higher accuracy than other lightweight networks while striking an optimal balance between computational complexity and accuracy.