1 Introduction

In today’s digital age, human–computer interaction has become an indispensable part of people’s daily lives. Traditional gesture recognition algorithms are mainly based on image processing and machine learning techniques, which extract gesture features and train classifiers to achieve gesture recognition [1, 2]. However, traditional algorithms have limitations in dealing with complex gestures and background interference, and the accuracy and real-time performance of recognition need to be improved. In recent years, the rapid development of deep learning technology has opened up new opportunities for gesture recognition. As an advanced object detection algorithm, You Only Look Once version 5 (YOLOv5) has the advantages of high accuracy and good real-time performance, and has achieved significant results in the field of object detection. However, the YOLOv5 algorithm still faces some challenges in gesture recognition tasks, such as insufficient optimization in gesture feature extraction and background interference processing [3]. Attention mechanisms can help models focus on important regions in images, thereby improving the performance of gesture recognition. The combination of attention mechanism and YOLOv5 algorithm improves the accuracy of gesture localization and effective feature extraction, thereby enhancing the real-time performance and accuracy of gesture recognition. In view of this, this paper has carried out a series of innovations. First, the study drew inspiration from the concept of dense connected networks and coordinated feature maps. The Improved You Only Look Once v5 (IYOLOv5) algorithm is combined with the Convolutional Block Attention Module (CBAM) to design a CBAM–YOLOv5 gesture recognition model. The motivation of the research is to improve the accuracy and robustness of the gesture interaction feature extraction model, and to provide some references for the development of the feature extraction field.

This paper mainly consists of five parts. Part 1 is the introduction of this paper. Part 2 is the background of the research, which mainly outlines the research status and attention mechanism field of YOLOv5, and summarizes the research results and methods at home and abroad. Part 3 is to design a unified access model for various intelligent devices and communication technologies, which is divided into three sub-sections. The first section mainly introduces the design of gesture recognition algorithm based on YOLOv5. In the second section, some improvements are made to address the shortcomings of the YOLOv5 algorithm. The third section combines the IYOLOv5 algorithm and mixed attention mechanism to design the CBAM–YOLOv5 model. In Part 4, the optimization effect of CBAM–YOLOv5 is evaluated through comparative experiments and efficiency verification, and various performance aspects of the technology are tested in practical applications. Part 5 is the summary and future prospects of this paper.

2 Research Background

Intelligent systems have garnered increasing attention from enterprises and researchers across various fields due to the continuous advancement of society and rapid technological progress. Liu et al. proposed a new image adaptive YOLO framework to address the challenge of locating targets from low-quality images captured under adverse weather conditions. Experiments showed that this model enabled each image to be adaptively enhanced [4]. Dewi et al. designed a target detection method that combines YOLOv4 and Pyramid Pool to solve the problem of traffic standard detection in intelligent vehicles. The accuracy of this model reached 99.4% [5]. Du et al. artificially improved the efficiency of road damage detection during road maintenance and repair, and designed a deep learning-based object detection framework, YOLO network, to predict possible distress locations and categories. The comprehensive detection accuracy of this model reached 73.64% [6]. Cheng et al. designed a small attention landslide detection model (YOLO-SA) to improve the accuracy of high-resolution remote sensing imaging technology. Its accuracy had been improved to 94.08% [7]. Lee and Hwang proposed a YOLO architecture with adaptive frame control to address the high hardware requirements of YOLO for real-time object detection. This framework could ensure the high accuracy and convenience of YOLO [8].

Experts such as Li et al. proposed a step-by-step domain adaptive YOLO framework to address the issue of poorly executed supervised object detection models based on deep learning technology. Its target detection performance had significantly improved in various domain conversion scenarios of autonomous driving applications [9]. Lv et al. proposed a deep learning framework based on multi-attention mechanism Convolutional Neural Network (CNN) to address the challenges of sEMG in remote control gesture recognition. By introducing an adaptive channel weighting method and improving fast paths, the framework achieved efficient processing of sEMG signals. Experimental results showed that the framework achieved significant performance improvements on multiple datasets, with an average accuracy improvement of 0.46–18.88%, and it could classify seven gestures with 99.92% accuracy [10]. Zhu et al. proposed depth-wise separable, grouped, and shuffled convolutions to replace the convolutional structures in ConvLSTM to investigate the redundancy and attention mechanism impact of ConvLSTM in gesture recognition. Four variants of ConvLSTM were designed for attention analysis. The findings indicated that the spatio-temporal feature fusion was minimally impacted by the spatial convolutions in the three gates, and the attention mechanism integrated within the input and output gates did not enhance feature fusion [11]. Gu et al. proposed a WiGRUNT system based on dual-attention networks to address the issues caused by environmental changes in WiFi gesture recognition. The system dynamically focused on domain independent features of gestures, used deep residual networks to evaluate the importance of spatio-temporal cues, and used its built-in sequential correlation for fine gesture recognition. On the Widar3 dataset, WiGRUNT significantly outperformed existing technologies, achieving the best performance [12]. Peng et al. proposed a dynamic gesture recognition model based on CBAM-C3D to solve the problems of high computational complexity, complex feature extraction, and multiple network parameters in dynamic gesture recognition. The model utilized key frame extraction technology, multi-modal joint training, and BN layer network optimization to improve network performance. The experiment showed that the 3D CNN combined with attention mechanism achieved a recognition accuracy of 72.4% on the EgoGesture dataset, which was significantly improved compared to existing methods, verifying the effectiveness of the algorithm [13]. Li proposed a dual stream neural network to solve the problem of complex spatial features and similar temporal patterns in gesture recognition. The self-attention map convolution network was used to extract short-term temporal and hierarchical spatial information. The two-way independent cyclic neural network connected by residuals was used to extract long-term temporal information. Experiments showed that the method achieved 96.31%, 94.05%, and 90.26% recognition accuracy on the widely used gesture data sets, with advanced performance [14]. Wu proposed a deep learning method called Da HAR to address the issue of inaccurate attribute localization in human attribute recognition. It adopted a coarse-to-fine attention mechanism, eliminated interference from irrelevant regions through self-masked blocks and masked attention branches, and improved feature learning accuracy. Experiments were conducted on WIDER Attribute and RAP databases, achieving state-of-the-art performance and proving the effectiveness of the method [15].

Although the above methods have achieved some achievements in the field of gesture recognition, they generally have some defects, such as difficult to combine real-time and accuracy, insufficient environmental adaptability, and limited feature extraction ability. The method proposed in this study effectively solves these problems by combining the attention mechanism and the YOLOv5 algorithm. This method employs an adaptive enhancement process that optimizes the key features, expression ability, and differentiation of gesture features, thereby enhancing the recognition accuracy. At the same time, the YOLOv5 algorithm itself is highly robust to the changes of light, scale, rotation, and other factors. The integration of the attention mechanism with the model enables more precise focus on the gesture area, attenuation of background noise, and enhanced gesture recognition performance across diverse environmental contexts.

3 Improvement of YOLOv5 Algorithm and Design of Human–Machine Interaction Gesture Recognition Model

This chapter is divided into three sub-parts. The first part mainly introduces the design of gesture recognition algorithm based on YOLOv5. In the second part, some improvements are made to address the shortcomings of the YOLOv5 algorithm. The third part combines the IYOLOv5 algorithm and mixed attention mechanism to design the CBAM–YOLOv5 model.

3.1 Hand Gesture Recognition Algorithm Based on YOLOv5

YOLOv5 is a deep learning model used for object detection, which is improved based on the YOLO series of models. The design goal of YOLOv5 is to improve speed and efficiency while maintaining high detection accuracy [16, 17]. Compared to previous versions, YOLOv5 has advantages in gesture recognition such as efficiency, accuracy, real-time, robustness, scalability, and ease of use, which can meet the needs of gesture recognition in different scenarios. First, YOLOv5 applies optimization strategies such as a more efficient model structure, faster training speed, and smaller model size. These strategies enable it to efficiently complete numerous gesture recognition tasks in a short amount of time while maintaining accuracy [18, 19]. Second, YOLOv5 exhibits high accuracy and can precisely detect gestures within images, as well as classify and recognize them. In addition, YOLOv5 adopts an end-to-end structure, which can achieve fast gesture detection and complete gesture recognition in milliseconds, thus achieving real-time gesture recognition. Finally, YOLOv5 has strong robustness to changes in lighting, scale, rotation, and other factors, and can perform gesture recognition in different environments. The schematic diagram of its structure is shown in Fig. 1.

Fig. 1
figure 1

Schematic diagram of YOLOv5 structure

Backbone refers to a basic structure or skeleton commonly used in neural networks or machine learning models. It is usually a backbone network composed of multiple layers or modules, used to extract features from input data. In YOLOv5, backbone refers to the main part of the neural network model responsible for extracting advanced features. It usually consists of multiple convolutional and pooling layers and is the core component of the entire network. In the neural network architecture of YOLOv5, backbone networks are the main body of the model, responsible for extracting feature representations with high semantic information from input images. By combining with other modules, meaningful features are extracted from the data, which are crucial for subsequent object detection tasks. This design and usage enable deep learning models to extract more semantic and expressive features from raw data, thereby improving the performance and generalization ability of the model. Its basic structure is shown in Fig. 2.

Fig. 2
figure 2

Schematic diagram of backbone structure

The skeleton structure of Fig. 2 mainly includes convolution and group convolution. First, the input features enter the backbone structure, which passes through a standard convolution layer. This layer filters and globally convolves the input features using a sliding window to extract preliminary feature information. Subsequently, these features are fed into the group convolution layer, where they are divided into multiple groups for processing, with independent convolution operations performed within each group. The feature maps within each group represent abstract representations of input data at different levels and perspectives. Finally, the output features of each group are summarized to form an output feature map containing multiple groups for further processing and analysis by subsequent network layers. The backbone network is tightly integrated with neck modules (such as FPN + PAN structure) to form an efficient feature transfer and processing system. The neck module is responsible for multi-scale fusion and enhancement of the features extracted from the backbone network to adapt to target detection of different sizes. The structure of FPN + PAN is adopted in the Neck module. FPN + PAN refers to a composite model that combines feature pyramid network (FPN) and pyramid attention network (PAN). FPN is a network architecture used to solve multi-scale object detection and semantic segmentation tasks. It combines multi-scale feature maps by establishing a feature pyramid between different layers in the CNN. FPN can fuse information from multiple scales, enabling the network to accurately detect targets of different scales. PAN introduces an attention mechanism on the basis of FPN to enhance the representation learning process. It includes a spatial pyramid pooling module that can capture contextual information at different scales. The attention mechanism selectively focuses on relevant features and suppresses irrelevant features, thereby improving the network’s discriminative ability. Combining FPN and PAN can further improve the performance of object detection and semantic segmentation models. The structure of FPN + PAN is shown in Fig. 3.

Fig. 3
figure 3

Schematic diagram of FPN + PAN structure

Bounding box is a standard concept utilized in object detection and computer vision assignments to depict the location and scope of objects or targets in an image [20, 21]. In YOLOv5, the loss function of the bounding box mainly uses IoU to measure the degree of matching between the predicted bounding box and the actual bounding box. Specifically, based on the size of IoU, the predicted bounding box is divided into positive samples (with a higher IoU compared to the real bounding box) and negative samples (with a lower IoU compared to the real bounding box). Then regression loss is used to optimize the position and size of the bounding box of the positive sample, making it closer to the true bounding box. The calculation formula for IoU is given Eq. (1).

$$ IoU = \frac{u}{x} $$
(1)

In Eq. (1), \(u\) represents the intersection area. \(x\) is the union area.

3.2 Optimization and Real-Time Research of Gesture Recognition Algorithm Based on YOLOv5

The standard YOLOv5 can be used for gesture detection. YOLOv5 is a universal object detection algorithm, and gesture recognition is a specific task. Without a sufficiently large and diverse gesture dataset for training, YOLOv5 may not be able to fully learn the subtle differences and features of gestures, resulting in lower accuracy and robustness. In addition, the relatively small network structure design of YOLOv5s can provide fast real-time inference speed. For some gesture recognition applications, higher real-time performance may be required. When dealing with complex gestures or large-scale data, YOLOv5s may not meet real-time requirements. YOLOv5s has poor adaptability to the environment, especially under low lighting conditions, which can easily lead to missed and false detection, resulting in a low gesture recognition rate. In response to the above issues, the study has made some improvements to the standard version of YOLOv5. The first is to improve the real-time performance of its algorithm. The feature extraction method used for standard YOLOv5 is CNN. The computational complexity of CNN will increase as the number of network layers increases [22, 23]. The network structure of the standard YOLOv5 is relatively large and may not meet real-time requirements on resource limited devices. This may lead to slower inference speed and inability to meet the needs of real-time applications. Therefore, the study first uses Field Programmable Gate Array (FPGA) to improve parallel processing capability and reduce power consumption. FPGA is suitable for accelerating intensive operations such as convolutional computation. By deploying the YOLOv5 model on FPGA, inference speed can be further improved, allowing gesture recognition algorithms to run in real-time on devices with limited resources [24, 25]. Then use the Ghost module for feature extraction. The Ghost module replicates and expands the original feature map by creating sub-sampled sub-networks, rather than directly adding convolutional layers, which can significantly reduce computational complexity. The application of the Ghost module in YOLOv5 provides an effective way to improve model performance, especially for real-time gesture recognition tasks on resource limited devices. It is a lightweight module that can generate more feature maps with fewer parameters and computational complexity, thereby improving the model’s expressive power. Using the Ghost module in YOLOv5 can increase network width and improve model performance without adding too much computational burden. Specifically, the Ghost module generates a set of “phantom” feature maps by performing a series of linear transformations and nonlinear activation operations on the original feature maps. These feature maps have similar semantic information to the original feature maps, but with smaller parameter and computational costs. Therefore, using the Ghost module can to some extent reduce the complexity of the model and improve its generalization ability, as shown in Fig. 4.

Fig. 4
figure 4

Schematic diagram of the Ghost module structure

Figure 4 is the structural diagram of the Ghost module. Assuming that in the original convolution operation, the input is convolved with n sets of \(k*k\) kernels to generate an output with n channels and a size of \(h*w\). In the Ghost model, m sets of kernels and inputs of \({\text{k*k}}\) are convolved to generate the intrinsic graph of \(m*h*w\). Then the intrinsic graph undergoes a linear transformation to generate a Ghost graph, which is used as the output along with the intrinsic and Ghost. It is achieved by decomposing a convolutional layer into two parts. First, the input feature map is divided into two subsets: main path (MP) and ghost path (GP). MP performs normal convolution operations, while GP is a lightweight convolution operation that typically has a smaller number of channels. The conventional convolution calculation formula is shown in Eq. (2).

$$ Y = X*f + b $$
(2)

In Eq. (2), \(f\) is the size of the convolutional kernel, \(b\) represents the offset, \(X\) is the input of the convolution operation, and the output feature map of the previous layer network. To obtain the gesture feature map, it is also necessary to perform a linear transformation of its convolution calculation formula, as shown in Eq. (3).

$$ y_{ij} = \varphi_{ij} (\overline{{y_{i} }} ) $$
(3)

In Eq. (3), \(\overline{{y_{i} }}\) is the \(i\)-th feature in the original image, \(y_{ij}\) represents the output feature value at position \((i,j)\) on the feature map, \(\varphi_{ij}\) is a linear transformation. Floating Point Operations per Second (FLOPS) is a metric used to measure computer performance and estimate the computational workload in various algorithms. The FLOPS of standard CNN is given Eq. (4).

$$ Flops_{cnn} = n \cdot h \cdot w \cdot c \cdot k \cdot k $$
(4)

In Eq. (4), \(h \cdot w\) is the height and width of the output feature, \(k \cdot k\) is the size of the convolutional kernel. The FLOPS of the Ghost module is given in Eq. (5).

$$ Flops_{ghost} = \frac{n}{s} \cdot h \cdot w \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h \cdot w \cdot c \cdot d \cdot d $$
(5)

In Eq. (5), \(c\) is the number of channels, \(d \cdot d\) is the dimension of the second part of the convolutional kernel. Equation (4) is divided by Eq. (5) to obtain the acceleration ratio of the Ghost module, as shown in Eq. (6).

$$ r_{s} = \frac{n \cdot h \cdot w \cdot c \cdot k \cdot k}{{\frac{n}{s} \cdot h \cdot w \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot h \cdot w \cdot c \cdot d \cdot d}} \approx s $$
(6)

In Eq. (6), \(s\) is the number of Ghost feature maps. From Eq. (6), compared to the failure of CNN, the Ghost module can be reduced by s times. This module can generate feature maps using simple linear transformations, effectively extracting key features while reducing network parameters and computational complexity [26, 27]. To improve the efficiency of feature transfer, a dense connection network is adopted in the study. Each convolutional layer in a densely connected network combines its input with the output of all previous layers to form a densely connected structure. The advantage of doing so is that in subsequent layers, both the input features of the current layer and the features of all previous layers can be utilized, thereby making more full use of the network’s feature representation ability [28]. Gestures may exhibit different features at different scales. Therefore, this study adopts multi-scale detection methods based on YOLOv5s to improve the detection ability of the model. Research on incorporating an FPN module into YOLOv5s is being conducted, which involves the integration of high-level semantic and low-level detail information to create multi-scale feature maps that enable detection on each map. FPN is a commonly used multi-scale detection method, which constructs feature maps of different scales and detects gestures of different scales on them. In multi-scale detection, the contribution of gestures at different scales to the loss function should be different. Therefore, designing a loss function that is aware of scale is crucial for giving varying weights to gestures of differing scales. This allows the model to prioritize the detection of small-scale gestures. Assuming there are \(N\) scales of gestures, each scale has a certain number of positive and negative samples. The loss of positive and negative samples at each scale is shown in Eq. (7).

$$ L_{i}^{p} = \frac{1}{{P_{i} }}\sum\limits_{j = 1}^{{P_{i} }} {L_{ce} (p_{j} ,c_{j} )} + aL_{giou} (b_{j} ,b_{j}^{\prime } ) $$
(7)

In Eq. (7), \(L_{i}^{p}\) represents positive sample loss, \(L_{ce}\) is the cross-entropy loss function used for calculating classification loss, \(L_{giou}\) is the GIoU loss function used for calculating regression losses, \(p_{j}\) is the predicted probability of the j-th positive sample, \(c_{j}\) is the true category of the j-th positive sample, \(b_{j}\) is the predicted bounding box of the j-th positive sample, \(b_{j}^{\prime }\) is the true bounding box of the j-th positive sample, and \(a\) is a weighting factor that balances classification loss and regression loss. The expression for negative sample loss is shown in Eq. (8).

$$ L_{i}^{n} = \frac{1}{{N_{i} }}\sum\limits_{k = 1}^{Ni} {L_{ce} (n_{k} ,0)} $$
(8)

In Eq. (8), \(n_{k}\) is the predicted probability of the k-th negative sample, from which the total loss expression for each scale can be obtained as shown in Eq. (9).

$$ L_{i}^{{}} = w_{i}^{p} L_{i}^{p} + w_{i}^{n} L_{i}^{n} $$
(9)

In Eq. (9), the number of positive samples at the i-th scale is \(P_{i}\), the number of negative samples is \(N_{i}\), the weight of positive samples is \(w_{i}^{p}\), and the weight of negative samples is \(w_{i}^{n}\). To improve the recognition accuracy of the model, the idea of dense connected networks is used for feature map stitching. The schematic diagram of IYOLOv5 is shown in Fig. 5.

Fig. 5
figure 5

Schematic diagram of the IYOLOv5

When concatenating the intrinsic feature map with the convolutional feature map, a mechanism similar to a densely connected network is actually introduced. The concatenated feature map not only contains the features of the original image, but also the features extracted through the first convolutional layer. This can enhance the diversity and richness of feature representations, helping the network better learn and capture important information in images. Through this splicing method, the association between low-level and high-level features can be utilized to improve the network’s expression ability and performance. This densely connected structure has achieved significant improvements in many computer vision tasks, making the network easier to train and possessing stronger modeling capabilities.

3.3 Optimization Algorithm for Gesture Recognition Based on CBAM and YOLOv5

Attention mechanism is a computational method that simulates human attention mechanisms and is widely used in deep learning to process sequence data and extract important information. It enables the model to selectively focus on relevant parts based on the importance of inputs by assigning different weights or degrees of attention to different input elements. The application of attention mechanism in gesture detection can improve the model’s ability to pay attention to and understand gestures, helping the model better capture the details, shape, and temporal changes of gestures. A classic example can illustrate the role of attention mechanisms. When people want to translate and transform “who are you”, the traditional model processing method is to use a seq to seq model, which includes an encoder end and a decoder end. The encoder end encodes “who are you” and then passes the entire sentence information to the decoder end, which decodes “who am I”. In this process, the decoder decodes word by word. If too much information is received during each decoding process, it may lead to internal chaos in the model, resulting in incorrect results. To address this issue, attention mechanisms are used. When generating “you”, it is closely related to the word “you” and not to “who are”. Therefore, it is hoped to use attention mechanisms in this process, focusing more on “you” rather than too much on “who are” to improve the overall performance of the model. By introducing attention mechanisms, the expression ability and accuracy of gesture detection models can be enhanced, and the performance of gesture recognition can be improved. The study uses a mixed attention mechanism. It provides more comprehensive and accurate feature representation capabilities through the combination of channel attention modules and spatial attention modules. It can enhance the learning ability of CNN for channel and spatial correlations in computer vision tasks, thereby improving the performance and robustness of the model, as shown in Fig. 6.

Fig. 6
figure 6

Structural diagram of mixed attention mechanism

Figure 6 is a schematic diagram of the structure of the mixed attention mechanism. CBAM combines two attention mechanisms: channel attention and spatial attention. The combination of channel attention and spatial attention mechanisms in CBAM indeed provides powerful tools for gesture recognition tasks. Through channel attention, the model can learn the importance of gestures in different channels (i.e., different feature maps). For instance, certain channels may prioritize the shape of gestures, while others may prioritize the color or texture of gestures. Through this approach, the model can adaptively enhance gesture related features, thereby improving recognition accuracy. The spatial attention mechanism enables the model to focus on key regions in the input image, which is particularly important for gesture recognition. Since gestures typically occupy only a small portion of the image, it is essential for the model to precisely locate and concentrate on these areas. Utilizing spatial attention, the model can learn the spatial position of gestures and process them with accuracy. Gesture recognition technology is often used in challenging and diverse backgrounds in practical applications. The spatial attention mechanism of CBAM can help the model suppress irrelevant information in the background, thereby better focusing on the gestures themselves. This is particularly important for improving gesture recognition performance in complex backgrounds. Due to its adaptability, CBAM can learn the optimal combination of channel and spatial attention in different datasets and scenes. This means that regardless of how the appearance, size, or position of gestures change, the model can adapt to these changes by adjusting its attention mechanism, thereby maintaining high recognition rates. The CBAM is designed to be lightweight and can be easily inserted into existing neural network structures. This means that it can improve gesture recognition performance without significantly increasing computational burden. Overall, CBAM provides a powerful and flexible tool for gesture recognition tasks by combining channel attention and spatial attention mechanisms. It can help models better understand and represent the features of gestures, thereby improving recognition performance in various scenarios and conditions. It compresses the channel domain by averaging and maximizing, as shown in Eq. (10).

$$ Z_{i,j} = \frac{1}{C}\sum\limits_{K = 1}^{C} {F_{i,j} (K)} $$
(10)

In Eq. (10), \(C\) represents the number of channels. \(F\) represents the input feature map. Then convolution operation and batch normalization operation are used, as shown in Eq. (11).

$$ Ms = BN(\varphi (f(F))) $$
(11)

In Eq. (11), \(BN\) represents the batch normalization operation, \(\varphi\) represents a convolution operation. After these two steps, the spatial attention feature map is finally obtained. Channel domain attention mainly focuses on the channel dimension of input data, namely the correlation and importance between different feature channels. The weight of each channel is calculated to adjust feature contributions and selectively utilize information from different channels in the network. This aims to enhance network performance. Channel domain attention can help networks learn the relationships between feature representations of different channels and enhance attention to important feature channels. The channel attention algorithm requires averaging pooling of input feature maps, as shown in Eq. (12).

$$ F_{C} = \frac{1}{H \times W}\sum\limits_{i = 1}^{H} {\sum\limits_{j = 1}^{W} {F(i,j)} } $$
(12)

In Eq. (12), \(H\) represents the height of the compressed original image, and \(W\) is the width of the compressed original image. The hidden layer size of the channel attention algorithm is shown in Eq. (13).

$$ R^{1 \times 1 \times C/r} $$
(13)

In Eq. (13), \(r\) is the reduction ratio, and \(1 \times 1 \times C\) represents the size of the feature map after dimensionality reduction. The formula for calculating the feature map is given Eq. (14).

$$ M_{C} = BN(\varphi (W_{1} (\delta (W_{0} Avgpool(F) + b_{0} ) + b_{1} ))) $$
(14)

In Eq. (14), \(Avgpool()\) represents the average pooling operation, \(W_{1} ,W_{0} \in R^{C/r \times C}\). \(\varphi\) represents the ReLU function, and \(\delta\) represents the sigmoid activation function, as shown in Eq. (15).

$$ sigmoid(x) = \frac{1}{{(1 + {\text{e}}^{ - x} )}} $$
(15)

In Eq. (15), \(x\) is the input value. To further improve the performance and accuracy of the model in gesture recognition tasks, YOLOv5 and attention mechanism are fused for gesture recognition. This is because YOLOv5 is an excellent object detection model that can accurately locate target objects (such as people) in images, including their hand positions. This provides a good positioning basis for subsequent gesture classification. The attention mechanism can help the model pay more attention to important features and channels, thereby improving the model’s ability to understand key parts of gestures. By introducing attention mechanisms, the model can learn gesture features more targeted, further improving the performance of gesture recognition. This model can effectively capture key information of gestures, reduce sensitivity to irrelevant backgrounds, and improve the accuracy and robustness of gesture recognition. The structure of CBAM–YOLOv5 is shown in Fig. 7.

Fig. 7
figure 7

Structural schematic diagram of CBAM–YOLOv5

First, it is necessary to input the original image of the gesture to be recognized into the entire recognition system. This image should be of high resolution and quality to ensure accurate capture of gesture details. The CBAM module processes the image through the channel attention mechanism, enabling the model to learn the most crucial channels for recognizing gestures across multiple channels. The next step is to process the spatial attention mechanism to help the model focus on the specific positions of gestures in the image. After these two steps of processing, the output of the CBAM module will be an enhanced feature map. The feature maps processed by CBAM are fed into the prediction module. The model will make preliminary predictions and classifications of gestures. The prediction module may include a series of convolutional layers, pooling layers, etc., to extract more abstract and advanced features. After the prediction module, the feature map will be further processed through a series of convolutional layers. The convolutional layers aim to decrease the dimensionality of feature maps and extract essential features of gestures. The Neck module, a crucial component, obtains the output generated by the convolutional layers, then performs additional processing. In this module, the feature maps will undergo CONCAT operation to merge feature maps from different layers together, forming a richer feature representation. The next step is the processing of the CBL module, which may include batch normalization, activation functions, and other operations to enhance the model’s expressive power. The CONCAT operation receives the output from the CBL module and merges it with other feature maps. The merged feature maps will be fed into the Sampling layer, where up-sampling or down-sampling operations may be performed to adjust the resolution and size of the feature maps. The feature map will undergo additional processing through a CBL module after passing through the Sampling layer. The purpose of this is to improve and adjust the feature map for better suitability in recognizing subsequent gestures. Following this processing, the feature map is sent to the backbone structure. This is a deep neural network typically composed of a series of convolutional layers, pooling layers, activation functions, and so on. Its function is to perform deep processing and analysis on feature maps, extracting more advanced and abstract gesture features. Finally, after processing with the backbone structure, the model will output the result of gesture recognition. This result may be a classification label (such as “OK”, “fist”, etc.), or a specific gesture posture description (such as finger joint angles, etc.). This result can be directly used for subsequent applications or analysis. The algorithm process is shown in Fig. 8.

Fig. 8
figure 8

CBAM–YOLOv5 algorithm flowchart

First, image preprocessing is performed on the input original gesture image, including adjusting image size, normalizing, and other operations to ensure image quality and reduce computational complexity. The next is label conversion, which converts gesture labels into a format suitable for model training, including gesture categories, bounding box information, etc. The third step is to input the image and the preprocessed gesture image into the CBAM–YOLOv5 algorithm. The fourth step is the CBAM processing, where the image enters the CBAM and is processed sequentially through channel attention mechanism and spatial attention mechanism. The model is enabled to recognize which channels carry more significance for recognizing gestures in multiple channels. In addition, attention should be paid to the precise positioning of gestures in the image. The enhanced feature mapping will be the output of the CBAM. The fifth step is the prediction module and feature extraction. The CBAM processed feature map is sent to the prediction module for preliminary prediction and classification. The prediction module may comprise convolutional layers, pooling layers, and other techniques to extract advanced and abstract features. The feature map will undergo processing via a sequence of convolutional layers to decrease its dimensionality and extract key gesture features to abstract them. The sixth step is to merge the Neck module with features, where the Neck module receives the output of the convolutional layer and performs further processing. The feature maps will undergo CONCAT operation to merge feature maps from different layers together, forming a richer feature representation. The sixth step is the processing of the CBL module, which may include batch normalization, activation functions, and other operations to enhance the model’s expressive power. The CONCAT operation receives the output from the CBL module and merges it with other feature maps. The seventh step is to adjust the Sampling layer and resolution, and the merged feature maps are sent to the Sampling layer for up-sampling or down-sampling operations to adjust the resolution and size of the feature maps. The output of the Sampling layer will be processed again by a CBL module to further optimize and adjust the feature map, making it more suitable for subsequent gesture recognition tasks. The eighth step is the backbone structure and deep processing. The optimized feature map is fed into the backbone structure, which is a deep neural network typically composed of a series of convolutional layers, pooling layers, activation functions, etc. Its function is to perform deep processing and analysis on feature maps, extracting more advanced and abstract gesture features. Finally, after processing with the backbone structure, the model outputs the result of gesture recognition. This result may be a classification label (such as “OK”, “fist”, etc.), or a specific gesture posture description (such as finger joint angles, etc.). This result can be directly used for subsequent applications or analysis. The ninth step involves optimizing the model according to actual application scenarios and performance requirements. This includes adjusting the network architecture, enhancing the loss function, optimizing training parameters, and other techniques to improve gesture recognition’s real-time performance and accuracy. In this algorithm, attention mechanism can help the model focus on the area of interest, improve the ability to extract key features, and reduce the impact of background interference. Drawing on the idea of dense connected networks, feature map stitching can be performed. This stitching method helps to preserve more spatial information, enabling the model to better handle gesture recognition problems in complex backgrounds. In addition, the algorithm also combines a mixed attention mechanism, which can better focus on gesture areas and suppress background noise. Through the application of mixed attention mechanism, this algorithm can more accurately identify and handle the interference caused by occlusion, dynamic background, and the presence of other objects or individuals. However, it may still be limited when handling gesture recognition in extreme occlusion or dynamic background. For these problems, future studies can attempt to explore more advanced feature extraction methods and attention mechanisms to improve the performance and robustness of the algorithm. In terms of privacy protection, to avoid issues such as privacy, consent, and potential abuse, a series of technical measures are adopted in the experiment to ensure that user privacy is not leaked. During data transmission, encryption techniques, including Transport Layer Security (TLS), are utilized to encrypt data, preventing theft or tampering. This can ensure the security of data during transmission. Second, during data storage and processing, anonymization is performed in the experiment, separating user personal information from identification results. Specifically, technologies such as de-identification and anonymization are utilized to eliminate personal user information from recognition results and retain only the results themselves. In this way, even in the event of data leakage, the user’s personal information cannot be restored, further protecting the user’s privacy. Therefore, the proposed method does not have ethical limitations in its use.

4 Algorithm Performance Verification and Application Effect Analysis

This chapter is divided into two sub-sections. The first section mainly verified the performance of the IYOLOv5 algorithm with the help of 40 volunteers. The second section mainly compared the CBAM–YOLOv5 gesture recognition model algorithms horizontally and analyzed their applicability.

4.1 Performance Verification of IYOLOv5 Algorithm

To verify the optimization effect of the proposed IYOLOv5 algorithm, the paper first conducted volunteer recruitment, which successfully attracted 400 volunteers to participate in the experiment. Each volunteer was instructed to perform ten distinct gestures in actual settings with various backgrounds, lighting conditions, obstacles, and user contexts, and record data. Following the data collection, the research team curated a database of 40,000 samples to test the algorithm’s adaptability to diverse scenarios. Figure 9 illustrates some of the samples from the dataset.

Fig. 9
figure 9

Partial gesture data samples

Figure 9 shows some gesture data samples collected in the experiment. These samples may include various gestures, such as the shape, posture, and movements of the gestures. By collecting these diverse data samples, the team could gain a more comprehensive understanding of the performance of the IYOLOv5 algorithm in handling gesture recognition tasks. The optimization effect was evaluated by comparing the performance differences between the IYOLOv5 algorithm and the original YOLOv5 algorithm. The team conducted experiments using the collected dataset and combined quantitative and qualitative evaluation indicators to measure the degree of improvement in accuracy, speed, and robustness of the algorithm. The ratio of the training set to the test set was 4:1, and the number of training rounds was set to 350. The experimental results are shown in Fig. 10.

Fig. 10
figure 10

MAP and recall for IYOLOv5 and YOLOv5

Figure 10 shows the changes in Mean Average Precision (MAP) and recall for IYOLOv5 and YOLOv5 as the number of iterations increased. In Fig. 10a, the MAP value of the IYOLOv5 algorithm tended to converge after 160 iterations, and the final MAP value converged to 92.19%. The standard YOLOv5 algorithm only converged after 180 iterations and ultimately converged to 87.56%. Therefore, after improvement, the iteration speed of YOLOv5 in gesture recognition tasks was increased by 12.5%, and the MAP value was increased by 4.63%. In Fig. 10b, the recall rate of the IYOLOv5 algorithm improved rapidly at the beginning of the iteration, but in the end, the recall rates of both algorithms converged to around 91%, with little difference between the two. To further understand the various performance aspects of the IYOLOv5 algorithm, the classification loss curve (CL), generalized intersection and union ratio loss curve (GIOU loss, GL), and object loss curve (OL) of the algorithm were also statistically analyzed in the experiment. The results are shown in Fig. 11.

Fig. 11
figure 11

CL, GL, OL variation curves of algorithm IYOLOv5

Figure 11 shows the CL, GL, and OL variation curves during the training process of the IYOLOv5 algorithm. As per Fig. 11a, the CL curve of the IYOLOv5 algorithm rapidly decreased from the beginning of training, tended to converge after 200 rounds of training, and finally converged to 0.003. Lower classification loss meat that the model was more accurate in predicting target categories. When the GL curve in Fig. 11b converged after 210 iterations and finally reached 0.007, the predictive ability of the model was gradually improving and stabilizing. GL not only considered the overlapping area between the predicted box and the real box, but also focused on the non-overlapping area between the two, thus comprehensively evaluating the prediction quality of the bounding box. A lower GL value indicated a greater fit between the predicted box and the actual box, as well as a higher positioning accuracy of the model. This high-precision positioning was crucial for gesture recognition models. According to Fig. 11c, the OL of the IYOLOv5 algorithm ultimately converged to 0.0025. This indicated that the model could better determine whether there were targets in the image, successfully solving the shortcomings of the traditional YOLOv5 algorithm that could not recognize fuzzy and dim backgrounds. To compare the IYOLOv5 algorithm horizontally, three algorithms, Yolov4, Yolov3, and SSD, were also introduced in the experiment. The collected dataset was used to train the above four algorithms 350 times, and the experimental results are shown in Fig. 12.

Fig. 12
figure 12

Performance of four algorithms

Figure 12 shows the precision, recall, and MAP values of IYOLOv5, Yolov4, Yolov3, and SSD algorithm training. The precision, recall, and MAP of the IYOLOv5 algorithm in the figure were 94.5%, 95.1%, and 95.1%, respectively, which were the highest values among the four algorithms. The three values of the other three algorithms were all within the range of [78, 89]. In summary, the IYOLOv5 algorithm performed well in terms of precision, recall, and MAP values, while the other three algorithms needed further optimization. These data and conclusions were of great significance for determining algorithms suitable for gesture target detection tasks.

4.2 Performance Testing and Application Experiment of CBAM–YOLOv5 Gesture Recognition Model

Gesture recognition is one of the important issues in the field of computer vision. It has broad application prospects in fields such as human–computer interaction, smart home, virtual reality, etc. This study used YOLOv5 as the basic model and introduced CBAM to improve and design a CBAM–YOLOv5 gesture recognition model. The experiment first verified the applicability of the model. In the experiment, multiple volunteers participated and interacted with various gestures, including numerical gestures, action gestures, etc. The CBAM–YOLOv5 model was deployed on a system with real-time inference capabilities, utilizing cameras to capture real-time gesture images of volunteers. The video data were transmitted to the model for real-time interaction with volunteers. The experimental results are shown in Fig. 13.

Fig. 13
figure 13

Schematic diagram of human–computer interaction experiment results

Figure 13 shows the results of the human–computer interaction experiment. Through real-time interaction with volunteers, the CBAM–YOLOv5 model could accurately detect and recognize different gestures displayed by volunteers. This model could quickly locate and classify gesture bounding boxes, achieving real-time gesture recognition. The interaction experiment between volunteers and the CBAM–YOLOv5 model showed that the model had good application performance. It could accurately classify and detect different types of gestures, meeting practical application requirements. At the same time, the model also exhibited fast inference speed, making real-time interaction possible. To investigate the impact of different components on the performance of object detection models, the experiment started with the basic mode YOLOv5 and gradually added different components such as Ghost module, FPN, PAN, CBAM, etc. After each addition, the accuracy, inference time, robustness, user acceptance, and comfort of the model were evaluated. During the experiment, the robustness evaluation of the model was completed by testing the performance of the model in complex situations such as different scenarios, lighting conditions and partial occlusion. The satisfaction and acceptance of model detection results by users were measured through user surveys, feedback, or behavioral data. Comfort was obtained by evaluating the ease of use, fluency, and user interface friendliness of the model in practical application. These evaluation indicators together constitute the comprehensive performance consideration of the model. The experimental results are shown in Table 1.

Table 1 Ablation experiments of the CBAM–YOLOv5 model

The experimental results in Table 1 show that as different components were gradually added, the accuracy of the model gradually improved, and the inference time, robustness, user acceptance, and comfort also improved. This demonstrated the significant impact of the collaborative effects of various components in deep learning models on performance. Especially backbone networks, such as YOLOv5, played a crucial role in practical applications. It was the core structure of the model, responsible for extracting effective features from input data. By continuously optimizing the backbone network, such as adding Ghost modules, FPN, PAN, and CBAM, the detection accuracy and efficiency of the model could be significantly improved. To test the performance of this model, the experiment used the ChaLearn Gesture Dataset (CGRD) to train and test the above five models. This dataset contained a large number of gesture video clips, which involved multiple characters and different scenes, and was widely used for evaluation and comparison of gesture recognition algorithms. The experimental results are shown in Fig. 14.

Fig. 14
figure 14

Change of loss function of each algorithm

Figure 14 shows the curve plots of the accuracy of various algorithms as the number of iterations increases. Among the five accuracy curves, the CBAM–YOLOv5 algorithm converged after 170 iterations, with the fastest convergence speed and ultimately converging at 95.1%. The Stacking LSTM algorithm converged after 360 iterations, with an accuracy rate of 90.6%. GRU converged after 290 iterations, and the final accuracy converged to 88.1%. However, the performance of Support Vector Regression (SVR) and LSTM algorithms was poor, with both undergoing more than 300 iterations before reaching convergence, and the accuracy during convergence was low. In summary, the CBAM–YOLOv5 algorithm had the fastest convergence speed and the highest accuracy. To exclude the impact of dataset selection on the experiment, the Microsoft Research Institute’s gesture dataset (MSRC-12 Kinect Gesture Dataset, MSRC-12 KGD) was also used for 420 iterations of training 5 algorithms. The experimental results are shown in Fig. 15.

Fig. 15
figure 15

Precision and average error values of each dataset

Figure 15 shows the comparison of accuracy using five different algorithms on different datasets. In Fig. 15a, the CBAM–YOLOv5 algorithm outperformed other algorithms in accuracy, demonstrating its outstanding performance. LSTM ranked the last in accuracy and performed worse than other algorithms. In Fig. 15b, the CBAM–YOLOv5 algorithm had little difference in accuracy among the models trained on different datasets, demonstrating its good stability. However, the accuracy of Stacking LSTM algorithm and SVR algorithm varied greatly on different datasets, showing significant instability. Compared to utilizing the CGRD dataset, SVR had a decrease in accuracy by 7.4% upon switching to the MSRC-12-KGD dataset for training. Conversely, the Stacking LSTM algorithm achieved an increase in accuracy by 3.8% under the same circumstances, ultimately reaching an accuracy of 94.3%. In summary, the CBAM–YOLOv5 algorithm had strong adaptability to datasets and its accuracy remains relatively stable across different datasets. Therefore, in the pursuit of high accuracy and stability in gesture recognition tasks, the CBAM–YOLOv5 algorithm might be a more ideal choice. The experiment introduced other datasets: ImageNet dataset (Data 1), ChaLearn dataset (Data 2), EgoGesture dataset (Data 3), as well as HaGRID dataset (Data 4) and MSTAR dataset (Data 5). These datasets provided a large number of high-quality gesture samples from different body features and cultural backgrounds. They were used to evaluate the proposed CBAM–YOLOv5 algorithm, the algorithm proposed in reference [29], and the algorithm proposed in reference [30]. They were trained and evaluated using F1 value, as shown in Fig. 16 of the experimental results.

Fig. 16
figure 16

F-measure for the three models

Figure 16 shows the training results of CBAM–YOLOv5, the algorithm proposed in Ref. [31], and the algorithm proposed in Ref. [30] on five different datasets. The CBAM–YOLOv5 model performed very well in F1 scores on all datasets, with F1 values significantly higher than the other two models, reaching 93.22%, 90.56%, 91.46%, 92.66%, and 93.46%, respectively. The proposed CBAM–YOLOv5 gesture recognition model demonstrated high recognition accuracy, robustness, and adaptability to varying body features, cultural backgrounds, and recognition tasks for different gestures. On the one hand, while the algorithm proposed in Ref. [29] could achieve higher F1 scores than the method proposed in this study on the ChaLearn dataset, its poor robustness across various datasets suggested significant variations in performance. On the other hand, the algorithm outperformed the proposed method in certain datasets. The method proposed in Ref. [30] had low F1 scores on all five datasets and poor overall performance. It should be noted that different datasets may have different data distributions and characteristics, such as gesture categories, background complexity, image quality, etc. If a dataset had significant deviations from other datasets in these aspects, the model trained on that dataset might perform poorly on other datasets. For instance, if the gestures in the ImageNet dataset were primarily executed against a stationary background, while those in the EgoGesture dataset occurred against a dynamic background, the models trained on these two datasets might exhibit varied performance. In addition, due to the fact that the dataset came from different physical characteristics and cultural backgrounds, there might be differences in the expression and understanding of gestures. For example, certain gestures might be common in certain cultures, while in other cultures, they might be rare or have different meanings. This difference might lead to a decline in the performance of the model in certain cultural contexts.

5 Conclusion

To solve the problem of high error rate and poor robustness in traditional gesture recognition models, this study combined the YOLOv5 algorithm with a mixed attention mechanism to design a CBAM–YOLOv5 model. The experiment recorded the CL, GL, and OL change curves during the training process of the IYOLOv5 algorithm. The results showed that the CL curve of the IYOLOv5 algorithm rapidly decreased from the beginning of training, tended to converge after 200 rounds of training, and finally converged to 0.003. The GL curve of the IYOLOv5 algorithm tended to converge after 210 iterations, and finally converged to 0.007. The OL of the IYOLOv5 algorithm ultimately converged to 0.0025, indicating that the model could better determine whether there were targets in the image. CBAM–YOLOv5 had the highest accuracy compared to other algorithms, while LSTM had the lowest accuracy. The accuracy of CBAM–YOLOv5 models trained with different datasets did not differ significantly. However, there was a significant difference in the accuracy of the models trained using different datasets between Stacking LSTM and SVR. Compared to using CGRD, the accuracy of SVR training using MSRC-12-KGD decreased by 7.4%, while the accuracy of Stacking LSTM increased by 3.8%, resulting in a final accuracy of 94.3%. In this case, its accuracy was 0.2% higher than the proposed CBAM–YOLOv5 algorithm, which was also an area that needed improvement in future research. In summary, this algorithm successfully solves the shortcomings of the traditional YOLOv5 algorithm in recognizing fuzzy and dim backgrounds, and improves the accuracy and robustness of gesture recognition. In summary, the CBAM–YOLOv5 algorithm has strong adaptability to datasets and its accuracy remains relatively stable across different datasets. Therefore, to achieve both high accuracy and stability in gesture recognition tasks, the CBAM–YOLOv5 algorithm presents a preferable option. However, there is still room for improvement. It is important to recognize that in real-world scenarios that the viable implementation of gesture recognition systems depends not only on technological precision but also heavily on user adoption and comfort. Human factors, specifically user fatigue, can have a notable impact on the system’s practical efficacy. Extended use of gesture recognition systems can cause user fatigue, thereby impairing accuracy and efficiency. To address this issue, developers can design efficient interaction interfaces and operation processes in subsequent applications. It is also important to periodically remind users to rest and stretch to sustain their focus and comfort. In addition, gesture recognition systems may be limited in specific environments, such as insufficient lighting, noise interference, etc. To overcome these limitations, a number of measures can be taken. These include enhancing the contrast of the image, using a filter to reduce noise, or introducing a robust design of environmental factors into the algorithm. This can be achieved through data enhancement technology, which simulates gesture images in different environments. This improves the stability and accuracy of the model under various conditions. By fully considering these human factors, gesture recognition systems can be better designed and implemented, improving user acceptance and comfort, and thus better adapting to practical application needs. In the future, further research will be conducted to investigate the impact of fine-tuning model structural parameters on the performance of CBAM–YOLOv5. By adjusting parameters such as network layers, convolution kernel size, and learning rate, the study aims to identify optimal configurations that can enhance the accuracy, robustness, and efficiency of the model, thereby enabling it to better adapt to various complex scenarios.