Introduction

The exchange of information is the fundamental means through which individuals engage with the outside world. To explore their surroundings and learn about their environment, people rely on their sensory organs. People simultaneously communicate information to others around them through their motions, eyes, faces, gestures, and other physical cues. Human contact is created through the communication of information in various ways.

Gestures are an integral part of everyday human life. Vision-based gesture recognition is a technique that combines sophisticated perception with computer pattern recognition. It is used in many different sectors, including engineering and research, and is essential for enhancing human–machine interaction. Because natural gestures change constantly, the current gesture detection technology is unable to fully achieve genuine human–machine communication.

Human-computer interaction history is the progress of humans adapting to machines and machines continuously adjusting to humans. Each new type of human-computer interaction has resulted in the substantial changes in associated businesses. The advent of the computer mouse made computer operation more natural and contributed to the rapid spread of computers. And the debut of the iPhone based on touch screen technology considerably improved the user experience and influenced the cell phone sector. In recent years, there has been a boom of interest in natural human-computer interaction approaches such as facial recognition, human motion analysis, gesture recognition, etc. Gestures are a natural and intuitive means of human-computer communication that has emerged as a significant technology in contemporary human-computer interaction. A successful implementation of gesture recognition will provide a revolutionary human-computer interaction experience.

Taxonomy of hand gesture recognition

Gesture recognition methods are divided into sensor-based and vision-based.

Sensor-based methods

Due to the range of sensors, the primary recognition may be grouped into three categories: using data gloves, using EMG signals and using Wi-Fi.

  1. (1)

    Data gloves With the widespread use of sensors, wearable sensor-based gesture recognition has advanced quickly. To recognize gestures using wearable technology, a glove with numerous sensors must be worn on a hand, and the data from the glove must be analyzed. In particular, gesture recognition based on data gloves may more intuitively gather three-dimensional spatial information from hand posture by utilizing many sensors and is not limited by the surrounding environment. A sensing glove was proposed by Komura and Lam [1] for controlling 3D game characters. Kim et al. [2] suggested a data glove-based sign language recognition system and achieved a 99.26% motion detection rate and an approximately 98% finger state recognition rate. Using data gloves, Rodriguez et al. [3] investigated the use of gestures for human-computer interaction (HCI) in virtual reality (VR) applications. Individuals with hearing and speech impairments can readily wear and use the gloves that Helen Jenefa and Gokulakrishnan [4] made for people with speech impairments. Such a glove is equipped with bending sensors, accelerometer sensors, and touch sensors that measure the bending and movement of the user and enable nonmute persons to comprehend the gestures produced by a speech-impaired individual. To recognize several motions, including move-ready, grip, loose, landing, takeoff, and hover motions, as well as to enable remote control of a six-axis vehicle, Huang et al. [5] created a data glove. The researchers achieved an overall recognition rate of 84.3%. Using five fundamental classification algorithms—decision trees, support vector machines, logistic regression, Gaussian naive Bayes, and a multilayer perceptron—Antillon et al. [6] created a smart diving glove that was trained and tested. Additionally, a study was performed underwater to determine whether the environment had any impact on how each algorithm classified objects. Mummadi et al. [7] demonstrated a data glove prototype with a glove-embedded gesture classifier that used information from inertial measurement units on the user’s fingertip.

  2. (2)

    Electromyography (EMG) Using electrodes affixed to the skin or injected into the muscles, electromyography records the electrical activity of the muscle tissues. This technique primarily uses sensors to gather electrical signals from the skin and muscles on the surface of a human body. After expanding the signal data and processing it further, the method screens the information that might be contained in each gesture before recognizing gestures. In 2002, Vuskovic and Du [8] used two-channel sEMG signals simultaneously to identify six different gestures with an accuracy of 78%. Nazarpour et al. [9] performed feature extraction using high-order statistics in 2005, and correctly identified 4 forearm movements using a clustering method with an accuracy of 91%. In 2018, Wu et al. [10] considered 15 features, including integral electromyographic data, to recognize five hand motions with an average accuracy of 90% using upgraded k-nearest neighbor algorithms. The maximum accuracy of the random forest neural network and the best accuracy of the support vector machine (SVM) classifier, were 88.7% and 85.9%, respectively, in 2015, when Guo et al. [11] used four different types of data as feature input. To identify four different types of gesture motions, Kim et al. [12] considered power spectral density as feature input and chose the SVM classifier with the accuracy of 91.97%.

  3. (3)

    Wi-Fi and radar Due to the growing use of Wi-Fi devices in indoor settings, gesture recognition technology based on Wi-Fi signals has drawn increasing attention. Wi-Fi devices are now present in many different indoor settings. Wi-Fi signals were first used for sensing in 2000 by Bahl and Padmanabhan [13], who suggested a system for indoor localisation based on the received signal strength of such signals. In 2013, Pu et al. [14] proposed Wisee, which used Doppler shift as a feature for gesture recognition to identify nine gestures with an accuracy of 94% for gestures such as pushing and pulling gestures. Wisee, however, used specialized software defined on radio devices and could not be immediately implemented on the existing Wi-Fi devices. The method required a distance of more than 10 cm between the two antennas to obtain good results, although Mudra [15] developed a Wi-Fi-based finger-level gesture detection method with a 96% accuracy by utilizing the difference in signals between antennas in different locations. WiFall [16] applied the random forest method and an SVM classifier to categorize various human activities and implement fall detection , yielding an average false alarm rate of 18% and a detection accuracy of 87%. To address the issue of signal propagation through walls, Wu et al. [17] suggested a passive human activity identification system based on Wi-Fi signals without any extra equipment.

Users of sensor-based typically must wear gloves that have sensors or probes attached to users’ arms. Additionally, the methods used in a laboratory setting are frequently constrained by the instruments that must be set up before recognition.

Computer vision-based methods

Image sensor technology has undergone continuous updating and iteration since its inception. Due to the inability of 2D-based image sensors to provide the additional information required to fulfill the needs of the contemporary society, the interest of academics in such sensors is currently declining, and the AI internet-of-things (IoT) field is shifting toward 3D. Monocular, binocular, and depth (RGB-D) cameras are the three main types of cameras used in vision-based gesture detection systems.

Microsoft’s Kinect V1 (the first generation of Kinect) is a depth camera that was first unveiled on June 14, 2010, combining OpenNI and the SDK library to monitor the bones of human joints as a foundation for gesture recognition research. A dynamic Arabic sign language recognition system for Kinect was introduced by Hisham and Hamouda [18]. It combined decision trees with Bayesian classifiers for gesture identification and then used the AdaBoost method to improve the system, resulting in a recognition rate of approximately 93.7%.

Leap Motion, a body controller manufacturer focused on the PC and Mac platforms, introduced its body controller on February 27, 2013, utilizing the stereo vision principle and two cameras to determine coordinates of spatial objects similarly to the human eye.

RealSense cameras from Intel are also depth cameras with gesture recognition capabilities. Extracting valid descriptors from the hand skeleton’s connected joints returned by the Intel RealSense depth camera, De Smedt et al. [19] proposed a skeleton-based 3D gesture recognition method. The Fisher vector obtained by using a Gaussian mixture model represented the encoding of each descriptor used to obtain the final feature vector. The original data were not filtered, and SVM was the only classifier used, leading to a comparatively poor recognition rate.

Gesture recognition processes

Static gesture recognition and dynamic gesture recognition are two categories of gesture recognition technologies [20]. The former implies that the hand is fixed for recognition and that aspects such as hand posture, shape, and location do not change [21]. Dynamic gestures are composed of sequential frames of static gestures, implying that the latter are a subset of dynamic gestures [22].

The main process of visual gesture recognition is as follows:

  • Data acquisition: acquiring gesture images with video camera and preprocessing images;

  • Gesture detection and segmentation: detecting the position of the hand in the gesture image and segmenting the hand region;

  • Gesture recognition: extracting image features from the hand region and recognizing the gesture type based on the features. In “Hand gesture recognition process”, the discussion will be divided into the two respective parts.

Hand gesture recognition process” will thoroughly analyze and define the essential tactics involved in gesture recognition, using the basic steps of gesture recognition as the main consideration. “Experimental Evaluation” will present several evaluation metrics for gesture recognition and segmentation. With the emergence of depth cameras, there has been significant growth in studies of gesture recognition based on depth data; “Hand gesture recognition based on RGB-D cameras” will detail the research of various scholars on gesture recognition based on RGB-D cameras. The applications of gesture recognition, particularly in robotics and human-computer interaction, will be described in “Hand gesture recognition applications”. “Problems, outlook, and conclusion” will discuss the current difficulties and the future directions in gesture recognition.

Hand gesture recognition process

Designed to enable the information exchange between users and intelligent devices, vision-based gesture recognition technology refers to the acquisition of video images containing operation gestures through acquisition devices such as cameras and the corresponding processing of video images, such as gesture segmentation, gesture feature extraction, gesture feature classification, etc.

Data acquisition

The data collected for vision-based gesture recognition is an image frame. The image needs to be preprocessed because it cannot be recognized directly after being acquired by the camera. To improve the overall performance of the system, the image preprocessing stage modifies the input image or video. The following are some common preprocessed steps.

Image grayscaling

Image grayscaling is the conversion of color images to grayscale images for better processing of image information. Ĉadik et al. [23] and Benedetti et al. [24] reviewed the research on grayscaling of color images. In grayscaling, the value of each pixel in an image is calculated from the values of its red, green, and blue channels by a specific algorithm to obtain a gray value that represents the luminance of that pixel [25]. The purpose of grayscaling is to simplify image processing and reduce computational and storage requirements. Common image grayscaling algorithms include the average method, the weighted average method, the maximum value method, and the minimum value method. In general, the average method and the weighted average method are used more often because they are simpler and more effective.

Image smoothing

Noise is removed from images using image smoothing techniques (e.g., Gaussian filtering, median filtering, etc.) to extract gesture information more accurately [26]. Gaussian filtering is a smoothing method based on a Gaussian function that can effectively smooth an image while preserving edge information. Such filtering involves convolving the image and replacing the value of each pixel with the weighted average of pixels in the region around that pixel. The kernel size and standard deviation of Gaussian filtering determines the degree of image smoothing; in gesture preprocessing, the appropriate Gaussian kernel size and standard deviation are usually chosen to achieve the best smoothing effect. Median filtering, on the other hand, is a smoothing method based on ranking statistics that can eliminate outliers such as pretzel noise while preserving the details of the image. The basic idea is to replace the value of a pixel with the median of its neighbors. Certain objects with hue and saturation characteristics similar to those of skin can produce pretzel noise in images generated by skin region detection, and the noise can be suppressed using median filtering and morphological methods [27]. In [28, 29], researchers used median filtering to process images.

Edge detection

Edge detection algorithms (e.g., the Canny algorithm, the Sobel algorithm, etc.) are used to detect edge information in an image for better segmentation of gestures. It is a process of identifying and locating points of sharp discontinuities in an image that represent the boundary between an object and the background or between adjacent objects [30, 31]. The Canny algorithm detects edges by calculating the gradient and orientation of pixel points in an image, and its main features are high accuracy, low noise sensitivity, and clear details of the detected edges [30]. The advantage of this algorithm is that it can effectively remove noise from an image and extract clear and continuous edges. The disadvantage is that it is computationally intensive, requires multiple filtering and processing operations, and is highly complex. The Sobel algorithm is similar to the Canny algorithm in that it detects edges by calculating the gradients in the x and y directions of each pixel and has the advantage of being computationally simple, fast, and capable of detecting fine edges [32]. The disadvantage is that it is not sufficiently accurate to detect straight edges, and it easily generates noise. Therefore, the image is usually smoothed using methods such as Gaussian filtering in the early stages of edge detection to reduce the effect of noise.

Morphological image processing

Morphological operations (e.g., expansion, erosion, etc.) are used to morphologically process an image to better extract the shape information of a gesture [26]. Commonly used morphological operations include expansion, erosion, open operation, and close operation. The expansion operation can expand the target region in the image to make it more visible, which is suitable for extracting gesture edge information. The erosion operation can shrink the target area in the image, which is facilitates removing noise and small details in the image. The open and close operations can remove burrs and holes in the image, respectively, to make the shape of the gesture clearer.

Optimum thresholding

Threshold segmentation algorithms (such as the Otsu algorithm, the Niblack algorithm, etc.) are used to divide an image into two parts, foreground and background, to better segment the gestures [33]. The Otsu algorithm is a global threshold segmentation algorithm; its basic idea is to divide the pixel gray values of an image into two classes such that the sum of the variances of the two classes is minimized [34]. This algorithm can adaptively determine the threshold and is hence suitable for image segmentation tasks in various scenes. The Niblack algorithm is a local binarization algorithm that divides the image into several small regions and then binarizes each region; it uses gray value thresholding to decide whether each pixel belongs to the foreground or the background [35]. This algorithm is more suitable than the global thresholding algorithm for segmenting images with highly diverse gray value distributions and can effectively handle images with uneven illumination and complex backgrounds.

In general, an input image is first thresholded as a binary image, and then noise is subtracted using the median and Gaussian filters; a preprocessing stage using morphological operations follows.

Gesture detection and segmentation

The gesture must first be separated from the background for the computer to recognize it. The reason is that the computer records the details of both the gesture and the scene in which it occurs. Hand segmentation is the division of the collection of pixel point coordinates obtained in the earlier gesture detection phase, which thus reduces the computation of pixel points and facilitates the subsequent operations. The first crucial stage in the algorithm for recognizing gestures is gesture segmentation, and a successful completion of this stage is necessary for accurate gesture identification. There are many gesture segmentation techniques, but from the practical and application point of view, almost all of them still face great difficulties in terms of accuracy, stability and speed (such as in the case of gesture segmentation in complex backgrounds), the impact of the distance between the camera and the person on gesture segmentation, etc. A comparison of gesture detection and segmentation methods is shown in Table 1.

Table 1 Comparison of gesture detection and segmentation methods

Skin color segmentation

The most fundamental apparent characteristic of a human hand is skin color. Even though everyone has a unique skin tone, the skin tones of the human body are concentrated in a certain region of a particular color space. In addition, the orientation, size, and perspective of an image itself have very little impact on skin color, which is highly invariant in terms of rotation, translation, and scale reduction. Hence, a significant portion of current studies of gesture recognition rely on skin tone information for gesture segmentation. The three most commonly used color spaces are the RGB, HSV, and YCbCr color systems..

In [36], the hand skin tone was segmented using the threshold method. In [37], a skin color detection method was used to detect hands and faces. A skin tone model was utilized in [38] for segmentation, and the HSV color model was selected after a comparison with the RGB model because of the influence of luminance. A simple segmentation technique based on calculating the maximum and minimum skin probabilities of the input RGB image was used in [39]. In [40], a skin detection model and an approximate median model were applied to segment the image. The approximate median model was utilized for background subtraction, and the skin detection model was used to identify the hands and fingers in the image. At the same time, without relying on any artificial neural network training, Dhule determined the precise sequence of moving hands and fingers by calculating the change in RBG color pixel values in a video and controlled the mouse movement in a window in real time according to the hand and finger movements. A gesture recognition scheme based on the skin color model approach and the threshold approach combined with effective template matching using the principal component analysis was proposed in [41]. Veluchamy et al. [42] used a skin color thresholding model for segmentation; numerous characteristics were extracted using the scale-invariant feature transform and monogenic binary coding algorithms before being identified using an efficient classifier. In [43], the problem of segmentation of the hands involved in gesture production was solved differently, using a ribbon-based segmentation algorithm, the first using special color stickers for the fingers, and the second based on normal skin color segmentation. Wang et al. [44] segmented the hand region by locating the skin tone region in the CbCr plane of the YCbCr color space using the “elliptical boundary model”. Considering the YCbCr color space, Patel [45] noted that the hand region could be cropped from all the images in the dataset by thresholding segmentation. The skin color detection algorithm used in [46] facilitated communication between the user and the computer.

Contour information segmentation

Another crucial component of gesture segmentation is detection of the presence of contours and edges. A new technological advance in gesture segmentation has been provided by target segmentation techniques based on contour information. Edge detection operators, template matching, and active contour models are the three primary categories of conventional gesture segmentation techniques based on contour information.

Traditionally, gesture contours have been extracted from photos by using edge detection operators to identify edges in images. In the context of gesture segmentation, template matching—a traditional target localization technique—has also been applied to some extent. To discover the best match, template matching requires placing a preset template on a point in the image, calculating how well the template matches the image at that point, and then iteratively moving through the entire image.

A segmentation of gestures’ grayscale images was performed in [47] using a histogram thresholding segmentation technique. A morphological filtering technique was created to represent the gesture contours to successfully filter out the background and target noise from the segmented image. The wrist cropping approach in [48] used a width and contour heuristic to estimate the wrist position among the segmented hand patches and then extracted the hand from the estimated wrist position to separate the segmented hand from the rest of the arm. The nodes of the human body were identified in [49] by using background subtraction, edge detection, contour detection, and edge detection using the Sobel operator. Generating a population gesture feature collection for multiview gesture photos was proposed in [50], along with a new Pareto optimum frontier-based multiview gesture detection method.

Of course, there are other methods of gesture segmentation based on appearance features apart from skin color and contour. The shape and direction of the hand, taken from an input video stream captured under stable lighting and simple background conditions, were proposed in [51] as a basis for recognition of static gesture images. In [52], an adaptive threshold binarization-based homomorphic filter was used in the construction of a system resistant to changing lighting conditions. Additionally, edge-based grayscale segmentation ensured that the method could be applied to users with a range of skin tones and backgrounds. In [53], a hand detection method that incorporated skin filtering and three-frame differencing was proposed.

Other segmentation approaches

The depth sensor-based gesture segmentation method uses a depth camera to gather hand-depth structure information and then segment the hand region, as detailed in “Hand gesture recognition based on RGB-D cameras”.

Gesture segmentation methods mentioned above require manual segmentation by designing features of target images, which can achieve gesture segmentation by creating features in simple contexts; however, it is difficult to design effective gesture features in complex environments, and thus this approach is difficult to apply in natural human-computer interaction systems. Development of deep learning opens up new options for gesture segmentation since a model can be trained using vast amounts of data to automatically learn target gesture attributes, allowing it to complete target gesture recognition and gesture segmentation using the identified target gestures [54]. Paul et al. [55] offered a method for extending the hand segmentation approach based on convolutional neural networks from still photos to video images. The proposed technique was more resistant to distortion and occlusion issues, resulting in improved accuracy and delay tradeoffs. Compared to traditional approaches, the deep learning-based gesture segmentation method eliminates the need for manual analysis of gesture data for segmentation, making segmentation more convenient and being a promising approach to gesture segmentation [56, 57]. However, in terms of the current development status, this method has flaws: first, some network layers are complicated, slowing gesture segmentation; second, edge detection may yield blurred results, and its accuracy still needs to be improved [56, 57].

Multiple methods have indeed been used to acquire hand segmentation images. For example, a single Gaussian model was applied in [58] to describe the hand color in the HSV color space. To achieve reliable hand tracking, Ding and Su [58] combined optical flow and color cues. Zhao and Jia [59] proposed a manual segmentation technique for depth images based on a random decision forest architecture.

Tracking

In this paper, tracking is considered a component of segmentation because the goal of both tracking and segmentation is to separate the hand from the background. The frame-by-frame analysis of temporally continuous images and the determination of the tracked target during the image change interval are the fundamentals of gesture tracking.

Fukunaga proposed the MeanShift algorithm in 1975; its basic idea is to use the gradient ascent of probability density to find the local optimum [60]. It is a straightforward algorithm with excellent real-time performance. If the target size changes, however, it is prone to tracking drift [61]. Khan et al. [62] combined the spatial information of moving targets with the traditional MeanShift algorithm to effectively solve the problem of tracking effectiveness degradation caused by the moving target’s occlusion.

The CAMShift algorithm is a modification of the MeanShift algorithm, and its full name is “Continuously Adaptive MeanShift”. Its basic idea is that all frames of video images are used for the MeanShift operation, and the result of the previous frame (i.e., the search window’s center and size) is used as the initial value of the search window of the next frame of the MeanShift algorithm, etc., iteratively [63]. It can adapt to target deformation by changing the window size adaptively, but if the surrounding environment is complex, the tracking window will diverge, and the target will be lost [64]. Several studies have used the CAMShift method to track the position of gestures; examples are the applications in [65, 66] to detect and track gestures. The CAMShift method detects the position of gestures by continuously resizing the search window.

Motion detection is a popular technique for segmenting dynamic targets. Its fundamental principle is to fully localize and extract currently moving targets by combining the visual information of previous moments in a video. The temporal difference approach [53, 67], the background subtraction method [68, 69] and the optical flow method [70, 71] are the three primary categories of conventional gesture segmentation techniques based on motion information.

The temporal difference method’s primary premise is to choose a number of adjacent frames in a video sequence, execute the difference operation, and then extract the moving target by separating it from the backdrop using a predetermined threshold. The two frames before and after the image’s pixels are subtracted from one another. If the impact quality difference is negligibly different from the environmental brightness, the item can be assumed to be stationary. However, if a significant change in pixel values occurs anywhere in the image region, the change is assumed to be the result of a moving object. If a moving object was previously stationary or blocked by another object, the temporal difference approach and its upgraded algorithm frequently lose object information. Additionally, because the temporal difference method assumes the image backdrop’s invariance, it is inappropriate if the background is moving. In [72], the issue of the frame difference pairs between the stored gestures and the query gestures used for matching was resolved. The hand trajectory following the center was monitored according to the direction between consecutive frames and the distance from the center of the frame.

The background subtraction method’s fundamental concept is similar to that of the temporal difference method in that the input image is compared to the background model, and the moving target is segmented by looking for changes in either statistical information such as histograms or features such as a grayscale representation or changes in the histogram. Prior to storing the backdrop image, a background model is constructed. The input picture is subtracted from the background image, and a certain threshold T is used to decide which pixel belongs to the foreground target when the current frame is subtracted from the background image based on exceeding that threshold. After thresholding, the background picture is represented by the point in the image with a value of 0, and the pixel point in motion in the scene is represented by the point with a value of 1. If the entire backdrop image is available, this approach can capture objects more effectively with good real-time performance. However, the method is less reliable, and the outcome is significantly affected by changes in the dynamic scene. Simple background subtraction was avoided in [73] due to the complex background and potential dynamic hand movements. Instead, certain morphological approaches and two-stage skin color identification were applied to reduce noise. The method suggested in [74] used background subtraction techniques and skin color-based schemes to identify palms in video feeds, exploring the potential for a usable computer vision framework for gesture detection. To overcome the restriction on gesture input caused by gesture background subtraction, hand, and rotation invariance, a new effective recognition elimination method was provided in [75].

An optical flow field is a velocity field that depicts three-dimensional movements of object points via a two-dimensional map. Optical flow is defined based on pixel points that indicate picture changes caused by motion over the time interval. The pixel regions that best match the motion model can be found using the optical flow method and the regions can then be combined to create moving objects for object detection. The optical flow method can be used to detect the object independently without the need for additional camera data, but in most cases, this method is more time-consuming than desired, computationally complex, and susceptible to noise. Real-time detection can only be accomplished using specialized hardware support, meaning that the optical flow method has a significant time overhead and poor real-time performance and is impractical. In a related study, the hand was separated from the background by using the inverse projection of the color model and motion cues in [76]. Ganokratanaa and Pumrin [77] proposed a dynamic gesture identification system for older individuals that tracked six dynamic movements and categorized their meanings using optical flow and speckle analysis used in vision-based gesture recognition.

For visual tracking, the particle filter method is commonly applied. It is a sequential Monte Carlo significance-sampling method for estimating the latent state variables of a dynamic system from a series of observations. Particle filtering is commonly used in conjunction with other techniques for gesture tracking. The combination of particle filtering and the MeanShift algorithm was shown in [78, 79] to accurately identify hands. In the study of [80], the Kalman particle filter was introduced as an improvement to particle filtering in gesture tracking. The Kalman filter was also applied for gesture tracking in [81, 82].

Feature Extraction

Gesture feature extraction is the key step in gesture recognition. The feature extraction part involves processing the input gesture image and then extracting the features that can represent the gesture from the image. Gesture features are the features that can characterize the gesture form and motion state extracted through the analysis of hand movements and postures [83]. The selection and design of gesture features are related not only to the accuracy of gesture recognition but also to the complexity and real-time performance of the system. Gesture features are mainly divided into global and local features.

  1. (1)

    Global features Global features are features that describe the morphology and movement of the entire hand. They include the size, shape, direction, speed, acceleration, rotation angle, etc., of the hand. These features can describe the overall action state of the hand and are applicable to some simple gesture recognition tasks. Global features also usually include color histograms, grayscale histograms, Gabor filters, etc. A color histogram refers to the statistics of the occurrence frequency of various colors in an image, represented in the form of a histogram. A gray histogram is a count of the frequency of occurrence of each gray level in an image, represented in the form of a histogram. A Gabor filter is a filter capable of extracting image texture information; it detects the texture direction and frequency in an image and represents it in the form of a feature vector [84]. Global features are simple to compute and have intuitive representations and good invariance properties. However, such features mostly have pixel point-based feature representations, and hence there are problems such as high feature dimensionality and large computational effort. In addition, such feature descriptions are inapplicable in the case of image blending and occlusion.

  2. (2)

    Local features Local features are features extracted by analyzing local areas such as fingers, palms, and wrists. They include the curvature of fingers, the degree of palm protrusion, the rotation angle of the wrist, etc. These features can more accurately characterize the detailed information about the hand and are more applicable to gesture tasks that require high-precision recognition. Commonly used local features include the histogram of the oriented gradient (HOG), the local binary pattern (LBP), the shift-invariant feature transform (SIFT), the sped-up robust features (SURF), features derived from the principal component analysis (PCA), and linear discriminant analysis (LDA), etc. HOG is a feature descriptor proposed by Navneet Dalal and Bill Triggs in 2005 [85]. It is used as a feature descriptor for target detection and is a statistical value used to compute the directional information of local image gradients in computer vision and image processing [85, 86]. LBP is a feature extraction method applied in image processing and computer vision. It converts a pixel point into binary code by comparing the magnitude of the pixel’s gray value with that of the surrounding pixels to obtain a feature that describes the texture of the image. LBP features have significant advantages such as grayscale invariance and rotation invariance [86, 87] SIFT is a scale- and rotation-invariant feature extraction technique proposed by Lowe [88], which has been widely used in gesture recognition. It is a feature detection and description algorithm for image processing that can detect and describe feature points in images at different scales and rotation angles. SURF is a descriptor developed from SIFT that improves the speed and robustness of feature detection and description by using techniques such as integral image and fast Hessian matrix computation. SURF has the absolute advantage of being computationally fast compared to SIFT [89]. Sykora et al. [90] applied a support vector machine classifier to classify SIFT and SURF features extracted from 500 test images with recognition rates of 81.2% and 82.8%, respectively. PCA is a commonly used data dimensionality reduction and feature extraction method that converts high-dimensional data into low-dimensional data while preserving as much information as possible about the original data [83]. LDA is a common classification algorithm and feature extraction method that converts high-dimensional data into low-dimensional data while maximizing the variability between different categories and minimizing the variability within each category [91].

In the study of gesture image features, global features have difficulty extracting the information of interest within the hand region due to the small proportion of the gesture region in the image, and therefore, the image performance is poor. In contrast, local image features are numerous and stable, have low interfeature correlation compared with that of the global features, can avoid occlusion of the hand region to some extent and are robust to image transformations such as illumination, rotation, and viewpoint change. Therefore, to enrich gesture feature information, most researchers tend to fuse local features with global features to achieve higher recognition rates. In this paper, we summarize the gesture features and the accuracy rates mentioned in several recent studies; the results are shown in Table 2.

Table 2 Partial gesture feature extraction techniques

Feature elimination and selection is an important step in machine learning that aims to select the most useful features for a classification or regression task to improve the accuracy and generalizability of the model.

  • Variance filtering: Eliminate features with variance below a certain threshold because they have less impact on the classification or regression task.

  • Correlation filtering: Eliminate features that have a low correlation with the target variable.

  • Regularization method: Eliminate features by making the weights of some features converge to zero through L1 or L2 regularization.

Features can be selected by the following methods:

  • Filtering: Evaluate each feature according to dispersion or relevance, set a threshold or the number of thresholds for features to be selected, and select features.

  • Wrapper: Select a number of features or exclude a number of features each time according to the objective function until the best subset is selected.

  • Embedding: First, use machine learning algorithms and models for training to obtain the weight coefficients of each feature, according to the coefficient from the largest to the smallest selection of features. This approach is similar to the Filtering method, but training is used to determine the utility of features.

In traditional methods, after extracting multiple features such as HOG, Harr, skin color, etc., classifiers such as SVM are used to segment or recognize gestures, which involves feature elimination and selection. In deep learning methods, multiple branches can be used to extract different features; for example, one branch can be used to extract the motion trajectory of the gesture, and another can be used to extract the color information of the gesture. Then, these branches can be used accordingly to obtain the final gesture features, which also involves feature elimination and selection.

Gesture classification

Following the acquisition of the segmented image, important information from the image is extracted via feature extraction, and the gesture type is recognized using these features. Gesture classification is the classification of the extracted spatiotemporal features of gestures and is the last stage of gesture recognition. The main methods of gesture classification are listed and compared in Table 3.

Table 3 Comparison of gesture classification methods

Template matching

The first suggested recognition technique was a very simple template matching technique, usually used for static gesture recognition. The approach involves classifying an input image in accordance with how closely it matches a template (a point, a curve, or a shape). To calculate the matching degree, one can use the coordinate distance, the point set distance, contour edge matching, elastic map matching, etc. Although the classification accuracy is not very high and the types of gestures that can be recognized are limited, the template matching method has the advantages of being very quick in the case of small samples, being adaptable to lighting and background changes, and having a wide range of applications.

Similarity of gestures was determined in [47] by assessing the similarity of sequences of local gesture outlines. Bhame et al. [39] used variable distance features and straightforward logic to calculate the active fingers involved in a gesture, which sped up recognition and made the technique suitable for real-time human-computer interaction applications in contrast to the conventional method of extracting input features and comparing them with all database features. In [74], an iterative polygonal shape approximation technique was proposed and combined with a unique chain coding scheme for shape similarity matching to recognize gestures. RGB and depth descriptors were combined to classify the movements in [106]. Chaudhary and Raheja [107] argued that lighting inconsistencies and backdrop irregularities had an impact on image segmentation, and the scholars provided a method for recognizing gestures based on constant light intensity. The system was tested using the Euclidean distance approach and artificial neural networks, and the method relied on a gesture image database to match the test movements.

Methods based on geometric information

Gestures can also be recognized using geometric information, e.g., by fingertip detection, convex packet detection, the circle drawing method, the cross-hatch method, etc.

Depending on the application, fingertip detection can be separated into single-fingertip detection and multiple fingertips’ detection. The “distance from the center of gravity” method can be used to detect a single fingertip. This method involves finding the point in the hand area that is furthest from the center of gravity and then determining whether the point is the fingertip. If the distance from the point to the center of gravity is greater than 1.6 times the average distance from the edge to the center of gravity, the point is the fingertip; otherwise, it is not.

Wen and Niu [108] suggested a fingertip angle calculation method to detect the fingers of the hand after discovering that the fingertip angle values of the turning points of the curve were much larger than the fingertip angle values of other points. Shin and Kim [109] detected fingertips based on the coordinates of the hand position derived from the skeletal information. Meng and Wang [110] read and preprocessed gesture sample templates, which included filtering, segmenting using the HSV color space threshold, and extracting contours. The contours were then calculated approximately to produce polygons, enabling the detection of the fingertips. The Hu moment and the number of fingertips were finally determined.

A convex packet, which can hold all the points in a contour, is a convex polygon created by linking the outermost points. Following contour analysis, convex packet identification is frequently utilized. A convex packet can be built for each contour following the contour analysis of a binary image, and the set of points contained in the packet is returned once the building is finished. The convex packet matching the contour can be drawn using the returned collection of points. Convex profiles are often fully convex or at least flat. A convexity defect is having a concavity in at least one location. In a related study, Wang et al. [111] applied the Douglas-Peucker technique for contour approximation to produce polygons throughout the feature recognition procedure. The type of gesture was then determined by bump detection on the polygons.

Dynamic time warping

Dynamic time warping (DTW) is a nonlinear time-normalized matching a technique that is frequently used in speech recognition, image matching, gesture classification, etc. It overcomes the matching problem of inconsistent lengths of two sequences. To identify the optimal alignment strategy for the two end sequences, a dynamic programming approach is used to allow the point values of the input sequence and the template sequence to achieve one-to-many or one-to-one matching [112]. The DTW algorithm performs well at matching and recognizing gestures with various motion speeds if the gesture sample template library is not too large. However, if there is a vast number of gesture sample templates, especially if the gestures are complicated or in the case of a mix of two-handed gestures, the identification speed and stability of the algorithm decrease significantly.

The DTW technique was utilized in [113] to determine the optimal alignment between the query features and the stored database features for the recognition of Indian sign language. Zhi et al. [114] implemented and trained two classifiers for static and dynamic gesture recognition: an N-dimensional DTW classifier and a multiclass support vector machine classifier. The running time was greatly reduced, and the average recognition rate was 95.5%.

Hidden Markov model

Russian scientist Vladimir V. Markovnikov developed the hidden Markov model (HMM), a statistical model, in the 1970s [115, 116]. HMM offers a broad range of applications and a great learning capacity and is efficient in modeling time-varying and nonstationary time series. Due to the context-sensitive nature of gesture actions, in the domain of gesture recognition problems HMM is better suited for continuous gesture recognition scenarios. However, HMM training and recognition are computationally demanding in continuous signal applications, where the state transition aspect will require a significant amount of probability density computing, and as the number of parameters rises, the pace of model training and target identification will decline. Discrete HMM is frequently utilized in generic gesture recognition systems to overcome this problem.

Numerous studies [37, 48, 117,118,119] used HMM for gesture classification. HMM in [37] was used to compute the log-likelihood of the symbols and to identify the most likely route through a network work. In [48], the discrete-density and continuous-density HMMs were trained and tested, and it was demonstrated that the continuous HMM performed better than the discrete HMM at classifying data. In [119], the retrieved symbols were classified and identified using HMM, which was commonly used following the testing of alternative techniques such as Independent Bayesian classifier combinations.

Machine learning

With the use of machine learning, computers can create models that follow the fundamental rules of data in big datasets. For example, Tutsoy [120] proposed an artificial intelligence-based multidimensional policy-making algorithm, which was an advanced predictive model developed under a large number of uncertain factors and time-varying dynamics, aimed at controlling epidemic casualties. Also, Tutsoy [121] proposed a new high-order, multidimensional, strongly coupled, parametric suspicious-infected-death model. This model uses machine learning algorithms to learn from large data sets, including epidemiological data, demographic information and environmental factors, to identify complex relationships and make accurate predictions. There are many well-known machine learning classification algorithms, such as support vector machines, neural networks, and conditional random field and k-nearest neighbor algorithms, which can resolve the problem of gesture recognition.

  1. (1)

    Support vector machines Support vector machines, first described in [122] in 1995, are a class of generalized linear classifiers for binary classification of data via supervised learning [123], which is mainly governed by the idea of structural risk minimization and Vapnik–Chervonenkis dimension theory. SVM-based gesture recognition is currently an important research method in gesture recognition technology [124]. SVM is a revolutionary learning technique that optimizes both empirical risk and model complexity. The training error is the constraint of the optimization problem, and the goal is to minimize the confidence range. Because increasing the dimensionality of the sample space has no effect on computational complexity, the SVM approach is frequently applied to high-dimensional problems. The main challenges currently faced by research on gesture-driven interaction are how to process the acceleration values of gesture signals, build a multiclassification model using the SVM algorithm, improve the accuracy of gesture recognition, and create a gesture-based interaction model. Predicted wrist positions in [125] are used to extract the HOG image descriptors, and a multiclass SVM trained offline is used to categorize the hand shapes. In [126], a previously trained SVM is used to normalize and classify the collected feature vectors. In [127], a method for RGB video-based gesture identification using SVM is proposed. In [128], real-time video-based signs are retrieved using a skin tone segmentation method, and appropriate feature vectors are produced from motion sequences. The features are then categorized using an SVM. In [129], two modules, namely an SVM model for static gesture classification and an HMM for dynamic single-stroke gesture detection, are used to understand a user command consisting of a set of static and dynamic gestures. The classifier for hand gesture recognition in this system is a linear SVM described in [130]. Local binary patterns and a binary SVM classifier are utilized as feature vectors in [131] to look for probable hand motions in every frame of a video stream.

  2. (2)

    Neural networks Deep learning algorithms rely on neural networks, which are a subset of machine learning. The name and structure of such algorithms are inspired by the human brain, and they is designed to mimic the way biological neurons communicate with one another [132]. A common structure in artificial neural networks, a multilayer perceptron (MLP), is a feedforward artificial neural network that is typically implemented with the backpropagation (BP) algorithm. Artificial neural networks are highly parallel, have a powerful ability to process information, establish a nonlinear mapping from the input space to the output space, have better fault tolerance and memory function, and store memory information in neurons [133]. The MLP, which accepts the feature set of clusters as input, correctly classifies the clusters and outputs their intensity levels, is the learning algorithm used for classification in [49]. A context-aware gesture-based intelligent system architecture is presented in [134]. Connection weights between the hidden layer and the input layer in radial basis function (RBF) neural networks, which can be of both approximate and exact varieties, are determined in a fixed manner rather than at random [135]. A classification of gestures in photographs using the chosen combinatorial characteristics is proposed in [136]based on an upgraded version of the RBF neural network. In the latter, the estimated weight matrix is iteratively updated for better hand gesture image identification using the least-mean-square algorithm, and the center is automatically determined using the k-means algorithm.

  3. (3)

    Conditional random fields The theory of a conditional random field (CRF) model, first proposed in 2001, was quickly applied to a variety of problems because of its extensions and extensions to the structure of an undirected graph model, which could characterize data dependence problems more accurately than other models. CRF, a probabilistic graph model, was first described by Lafferty et al. [137]. This original construction of a conditional random field was based on HMM in terms of model structure and was influenced by the maximum entropy model (MEM) in terms of model probability representation. In [138], a gesture identification approach was suggested to detect forward and backward movement toward and away from the camera, respectively, using CRF as a classifier and a parallax map-based center-of-mass motion and its intensity’s fluctuation as features.

  4. (4)

    K-nearest neighbors In 1968, Cover et al. proposed the K-nearest neighbors (KNN) algorithm. Although the idea behind this traditional classification algorithm is straightforward and easy to understand, the algorithm is now very mature and stable [139, 140]. The classification effectiveness of this algorithm, which is one of the fundamental classification algorithms, is good, but due to the use of numerous square root operations, the computational efficiency is insufficient compared to that of other classification algorithms when dealing with complex classification scenarios, particularly in regard to image classification [141, 142]. Jasim et al. [143] used the KNN algorithm to classify static hand gestures and the longest common subsequence (LCS) algorithm to classify dynamic hand gestures.

  5. (5)

    Naive Bayes classifiers The Naive Bayes classifier [144] is an algorithm for classification based on the Bayes’ theorem. First, it is assumed that the extracted features are uncorrelated and are independent of each other; this assumption simplifies the subsequent operations. Hands that match skin color patches were identified using a Bayesian classifier, and spots with the desired color distribution were recognized and modeled in [145]. In addition to discussing the common machine learning algorithms for gesture recognition described above, many researchers compared different classifiers. Cropped images’ HOG characteristics were generated in [45] and used to train the classifier. The study discussed the recognition rates of classifiers such as LDA, SVM, and KNN. In [53], the chosen features were inputs to the ANN, SVM, and KNN models, which were then fused to create a classifier fusion model with the accuracy of 92.23%. In [146], three classification techniques—the mean classifier (NMC), KNN, and the Naive Bayes classifier—were applied to categorize and compare data. In [147], five classifiers—SVM, KNN, Naive Bayes, ANN, and extreme learning machines (ELM)—were utilized for a comparative analysis. Their respective accuracy rates were 96.17% (ELM), 96.95% (SVM), 96.60% (KNN), and 96.38% (NB). Neural networks and Naive Bayesian classification methods based on data mining were applied in [148] for gesture learning and recognition, with the neural network attaining an accuracy of 98.11% and the plain Bayesian classification method having an accuracy of 88.84%.

Deep learning

A recent area of research in machine learning is deep learning. The latter is the process of discovering the innate patterns and depths of sample data representation; the knowledge gained from such learning can be very useful in understanding the meaning of data such as text, photos, and sounds. Its ultimate objective is to provide machines with analytical learning abilities similar to those of humans, enabling machines to recognize data types including text, images, and sounds. Deep learning does not require manual engineering, in contrast to conventional learning algorithms, making it possible to utilize the rapidly expanding amounts of data and computational power that are currently available [149]. Convolutional neural networks and recurrent neural networks are two popular deep learning algorithms.

  1. (1)

    Recurrent neural network Saratha Sathasivam proposed the recurrent neural network (RNN), a Hopfield network, in 1982 [150]. Each moment t in a recurrent neural network is processed sequentially and is closely related to the moment t before it. The RNN’s potent temporal modeling ability introduces a novel strategy for gesture recognition. However, if there are more than 10-time steps between the relevant input and the target event, it is challenging to train a simple RNN structure [151]. Neverova et al. [152] were the first to use recurrent neural networks for gesture recognition. The researchers’ proposal included a multimodal gesture recognition system for speech, skeleton pose, and depth. Prior to a long time-dependent model being built by an RNN for data fusion and final classification, each modality was first processed independently on a short time series, and its features were manually extracted or obtained by learning. To examine the benefits of RNN using various training methods and to suggest an efficient learning process based on suitable adjustments to the real-time recursive learning algorithm, RNN was also utilized to recognize gestures in [48] and compared with HMM. For skeleton action recognition, Geng et al. [153] proposed a sequence-to-sequence hierarchical RNN structure. Shin and Kim [154] separated the features into various components and fed each hand’s input into a GRU-RNN. This enhanced performance and lowered the number of parameters needed for the neural network. Zhang et al. [155] proposed a variant of a long short-term memory(LSTM) model for dynamic gesture recognition by combining ResC3D and ConvLSTM.

  2. (2)

    Convolutional neural network A convolutional neural network (CNN) [156] is a feedforward neural network that has emerged quickly in the field of image analysis and processing. CNN effectively avoids the preprocessing stage and a substantial amount of manual involvement in the project compared to traditional image processing algorithms. However, a large amount of data consists of more than just one image. To process video data, a 3D CNN [157] should be created and used as soon as possible for the task of behavior recognition in surveillance videos. Two-dimensional convolutional neural networks are mostly used to process static gestures or dynamic gesture sequences on a frame-by-frame basis. John et al. [158] used a long-term recursive neural network to classify gesture video sequences. All 24 motions from Thomas Moeslund’s gesture recognition database were used to apply deep learning to the gesture identification problem in [159], demonstrating that deep neural networks were capable of learning complicated gesture categorization tasks with a low error rate. The approach in [160] combined a skeletonization algorithm with CNN, which lessened the impact of capture angle and surroundings on the recognition effectiveness and increased the precision of gesture recognition in complicated contexts. In [161], a transformation of Arabic sign language letters into Arabic voice using a vision- and CNN-based system was suggested. In [162], a comparison of various gesture recognition techniques showed that CNN outperformed other classification systems. Noreen et al. [163] proposed a multiparallel streaming two-dimensional CNN model to recognize hand gestures. Several three-dimensional convolutional neural network (3D-CNN) models have been proposed for gesture recognition. To address the lack of a large number of labeled gesture datasets, an efficient deep convolutional neural network method called 3D-CNN was proposed in [164]. A 3D-CNN model is suggested by Molchanov et al. [165] to identify driving gestures based on depth and intensity data and to combine data from various spatial scales for the final prediction. Using a recurrent mechanism for dynamic gesture detection and classification, Molchanov et al. [166] enhanced the 3D-CNN model. A 3D-CNN structure for extracting spatiotemporal features and a recurrent layer for global temporal modeling were both included in the network model. Li et al. [167] enhanced the 3D-CNN model of Tran et al. [168] for large-scale gesture recognition using depth and RGB videos. Similarly to the above study of Tran et al., Camgoz et al. [169] developed an end-to-end 3D-CNN model for extensive gesture recognition. Lightweight convolutional neural networks were created in recent years as a result of the development of convolutional neural networks by numerous academics. A hardware-friendly neural network was made possible by a lightweight neural network, which was a lighter model that performed on par with a heavier model. The accuracy of the MobileNetv2- and CNN-based gesture recognition in [170] was 99.96%. Baumgartl et al. [170] proposed a lightweight, robust and fast CNN for manual gesture recognition by image classification. In [171], a hybrid network structure of a lightweight VGG16 model and a random forest was presented for recognition of gestures based on visual input.

Experimental evaluation

In “Hand gesture recognition process”, the process of gesture recognition has been described. In this section, several evaluation metrics for gesture recognition and segmentation are presented.

Accuracy

Accuracy is the ratio of the number of samples correctly classified by a classifier to the total number of samples. In gesture recognition and segmentation, the accuracy rate can be used to measure the overall performance of the classifier. The calculation formula is

$$\begin{aligned} \textrm{Accuracy}=\frac{\textrm{TP}+\textrm{TN}}{\textrm{TP}+\textrm{TN}+\textrm{FP}+\textrm{FN}}, \end{aligned}$$

where TP denotes true-positive cases, TN denotes true-negative cases, FP denotes false-positive cases, and FN denotes false-negative cases.

Precision

The precision rate is the percentage of samples identified by the classifier as belonging to positive classes that indeed belong to positive classes. In gesture recognition and segmentation, the precision rate can be used to measure the accuracy of the classifier. The calculation formula is

$$\begin{aligned} \textrm{Precision}=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}}. \end{aligned}$$

Recall

The recall rate is the proportion of samples that indeed belong to positive classes and that the classifier correctly identifies as belonging to positive classes. In gesture recognition and segmentation, recall can be used to measure the completeness of the classifier. The formula is as follows:

$$\begin{aligned} \textrm{Recall}=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}. \end{aligned}$$

F1 score

The F1 score is the summed average of accuracy and recall, and hence assesses both the accuracy and the completeness of the classifier. In gesture recognition and segmentation, the F1 score can be used as an evaluation metric to help select the optimal classifier. The calculation formula is

$$\begin{aligned} \textrm{F1} =2 \times \ \frac{\textrm{Precision} \times \ \textrm{Recall}}{\textrm{Precision}\ +\ \textrm{Recall}}. \end{aligned}$$

Intersection over union (IoU)

IoU is the ratio of the overlapping part between the predicted region and the real one to the overall size. In gesture segmentation, IoU can be used to measure the segmentation effectiveness of the model. The calculation formula is

$$\begin{aligned} \textrm{IoU}=\frac{\textrm{Intersection}}{\textrm{Union}}, \end{aligned}$$

where intersection denotes the intersection of the predicted frame and the true frame, and union indicates the merging of the predicted frame and the true frame. The larger the intersection is, the closer the predicted result is the truth.

Hand gesture recognition based on RGB-D cameras

The paper has introduced the use of three widely used pairs of depth cameras, namely, Kinect, Leap Motion and RealSense, in “Introduction”. The data structure of RGB-D images produced by depth cameras is more complicated than that of earlier 2D images, opening new possibilities for gesture recognition studies. A simple thresholding algorithm can accurately split the hand zone in the depth map using depth data, reducing the gesture detection problem to a problem of recognition of the 3D shape of the hand.

Given the popularity of depth sensors, scholars have conducted extensive research on gesture segmentation based on depth information. In [172], Jiang used the Kinect sensor to gather depth information, established a threshold for each frame in accordance with each pixel’s depth value, extracted the largest region as the foreground, and then removed the other patches with smaller areas. Kane and Khanna [173] described the creation of an acquisition module that used depth thresholding and velocity tracking to execute pen lifting and falling movements. The hand needed to be in the foreground of the camera for the depth thresholding-based hand segmentation method to work well. Zhao and Jia [59] presented an enhanced hand segmentation approach based on the random choice forest framework for depth sensor-acquired images by manually integrating the essentials of depth thresholding-based segmentation methods. To ensure the accuracy of hand segmentation, the method generated new depth features from the centroid of the hand structure, improved the generalizability of earlier depth features, and specified the depth invariance of hand pixels as much as possible.

Using a depth camera, more information about the appearance features can be obtained. In [174], a hand contour model was suggested to make gesture matching easier by incorporating the Kinect sensor, which could make gesture matching less computationally complex. Brazilian sign language’s phonetic structure was investigated in [175], relying on RGB-D sensors to collect intensity, location, and depth data. Seven vision-based traits were extracted by Almeida et al. [175] from RGB-D photos. Almeida et al. studied the relationship between the extracted features and the structural elements based on hand shape, motion, and location in Brazilian sign language. A 3D hand shape was separated from a cluttered background using a depth map of hand gestures recorded by the Kinect sensor to extract patterns of 3D shape features, and a 3D shape context description approach was proposed in [176] for 3D gesture representation. A TOF depth camera was used in [177] to gather depth data, determine the wrist cut edge, and and capture palm images. In [178], the coordinates of the 21 bonding points of the human hand were recorded using Leap Motion, and the motion images were captured using an RGB camera.

One of the more popular state space-based techniques for matching time-varying data is a hidden Markov model with two main stages, namely, training and classification. The Baum–Welch algorithm is a fundamental algorithm used to solve the training problem, whereas the Viterbi algorithm is a fundamental algorithm used to solve the classification problem [179]. Hoque et al. [180] proposed a real-time gesture recognition system based on Kinect that could manipulate desktop objects by identifying the hand’s 3D position using Kinect’s depth sensor. To identify predefined gestures, such location points were subsequently examined. The HMM was trained using the Baum–Welch algorithm, which resulted in the accuracy rate of 89%. A dynamic hand gesture detection system based on an RGB-D camera was proposed by Simao et al. [181]. Hand segmentation in color photos was performed using a large illumination-invariant skin tone model, and hand detection in depth images was performed using a chamfer distance matching-based technique. Hand movements were modeled and classified using the HMM with the left-right band state graph topology.

Table 4 shows the results of the studies mentioned above.

Table 4 Results of the studies

Hand gesture recognition applications

Gesture recognition has a wide range of applications, such as healthcare, safe driving, sign language awareness, virtual reality, and device control. This section mainly focuses on human-robot interaction using vision-based hand gestures captured by monocular cameras and RGB-D cameras. The main application areas for gesture recognition technologies are listed below.

  • Healthcare: Emergency rooms and operating rooms can be chaotic, with a significant amount of noise from individuals and equipment. In such an environment, voice commands are not as effective as hand gestures. Touchscreens are also not an option because of the strict boundaries between sterile and nonsterile domains. However, accessing information and images during surgery or other procedures is possible with gesture recognition technology, as demonstrated by Microsoft. GestSure, a gesture control technology that can be used to control medical devices, allows physicians to examine MRI, CT and other images with simple gestures without scrubbing. This touch-free interaction reduces the number of times doctors and nurses touch patients, reducing the risk of cross-contamination.

  • Safe driving: Advanced driver assistance systems that incorporate gesture recognition can somewhat increase driving safety. Through an advanced driver assistance system, drivers can modify many parameters inside the automobile using gestures, allowing them to focus more on the road and perhaps reducing traffic accidents. The BMW 7 Series has an integrated hand gesture recognition system that recognizes five gestures to control music, incoming calls, etc. Reducing interaction with the touchscreen makes the driving experience safer and more convenient.

  • Sign language awareness: The primary means of communication for hearing-impaired individuals is sign language; however, understanding sign language is difficult for those who have not received formal instruction. The ability of hearing-impaired and other individuals to communicate will be enhanced substantially using sign recognition technology for sign language cognition. The Italian startup Limix combines IoT and dynamic gesture recognition technology to record sign language, translate it to text, and then play it back on a smartphone via a voice synthesizer.

  • Virtual reality: Gesture recognition allows users to interact with and control virtual reality scenes more naturally, enhancing users’ immersion and experience. In 2016, Leap Motion demonstrated updated gesture recognition software that allowed users to track gestures in virtual reality in addition to controlling computers. ManoMotion’s hand-tracking application recognizes 3D gestures through a smartphone camera (on Android and iOS) and can be applied to AR and VR environments. Use cases for this technology include gaming, IoT devices, consumer electronics, and robotics.

  • Device control: Intelligent robots can also be controlled by gestures. With the advancement of artificial intelligence, home robots or smart home equipment will progressively appear in millions of households, and consumers will feel more at ease using gesture control as opposed to traditional button or touch screen input. A company called uSens develops hardware and software that enables Smart TVs to recognize finger movements and gestures. Gestoo’s artificial intelligence platform uses gesture recognition technology to enable touchless control of lighting and audio systems. With Gestoo, gestures can be created and assigned from a smartphone or another device, and a single gesture can be used to activate multiple commands.

Robots are becoming more prevalent in our daily lives as robotics develops quickly. Robots must learn how to communicate with people to be fully integrated into the human civilization. Researchers choose gesture recognition over other emerging human-robot interface technologies because of its straightforward and natural interaction qualities, rich expressive capabilities, and potential for a wide range of applications. Tables 5 and 6 present the applications of gesture recognition in nonmobile and moving robots, respectively.

Table 5 Gesture recognition applied to nonmobile robots
Table 6 Gesture recognition applied to moving robots

Problems, outlook, and conclusion

Problems

Gesture recognition technology has developed rapidly in recent years. However, due to the interference of external environmental factors and the limitations of gestures themselves, it is very easy to introduce various effects into the system; thus, gesture recognition still faces insurmountable difficulties. To facilitate the improvement of gesture recognition-driven human-computer interaction, this paper summarizes the following problems.

Data gathering

Most existing studies of hand detection assume a simple background for gestures during data acquisition; indeed, it is challenging to capture a hand because it is a relatively small object with many complex articulations [29]. However, in the practical application of robotics, workers are typically in complex environments, and thus we need to improve gesture recognition methods and apply them to real-world scenarios. For instance, the skeleton segmentation method can be utilized to distinguish locations other than the hand due to the specificity of skin color segmentation. One of the main challenges in identifying hand motion is frequently the complex posture of the hand, which is affected by the occlusion of the fingers [203, 204].

Training data environment

Scholars have compared gesture recognition methods, training on various datasets and obtaining noticeably different results, hence proving the importance of datasets in gesture training. The accuracy of various methods on different datasets was reviewed by Rawat et al. [205]. Future research can concentrate on honing the training dataset, making it as rich and varied as possible, and enhancing the capability of gesture recognition techniques to recognize gestures in any situation because of the nature of deep learning, e.g., by using convolutional neural networks.

The lack of high-quality datasets captured from various angles makes it difficult to create a highly realistic model that takes into account the actual contours of the hand [206, 207]. Due to the hand’s complex structure and varied dimensions, it has many degrees of freedom, which increases occlusion in noncomplex environments [208, 209].

It has also been noted that most databases used in gesture recognition research originate from various nations; in the case of the sign language, for instance, different nations have different sign languages and gestures. Therefore, more focus on uncontrolled situations is required to improve vision-based gesture recognition systems for real-world applications. To assess the transfer learning ability, future experiments may try to transfer knowledge from the gestural symbol system proposed by Zengeler et al. [210] to other gestural languages.

Identification speed

We also need to consider real-time networks: some have strong identification rates but poor real-time performance, while others have the opposite problem. When used in the real world, for instance, in surgical robots, this poses serious and troubling issues. Since gesture recognition systems require both high recognition accuracy and good real-time performance, we must increase the latter and reduce time consumption without sacrificing recognition accuracy. To further reduce computing costs and boost recognition effectiveness for application-level and real-time gesture identification, Liu et al. [211] chose to incorporate image segmentation methods in the future.

Segmentation in complex background

Currently, most existing studies of gesture recognition processes assume that the background of gestures is simple; however, the background in applications is complex. For example, during the human–machine interaction between a machine and a worker, the gestures captured by a sensor device will contain many complex influences of the background environment, including changes in lighting and the background environment, which will increase the difficulty of gesture detection and lead to a decrease in the accuracy of gesture recognition. Many researchers have sought to enhance the robustness of gesture recognition in complex backgrounds and to improve the interaction capability of gesture recognition in complex scenes.

Sheenu et al. [212] proposed a new method for gesture recognition in images with complex backgrounds based on histograms of the orientation gradient and sequential minimal optimization, which had an overall recognition rate of 93.12% for complex backgrounds. Chen et al. [213] suggested a gesture recognition method based on an improved YOLOv5 approach that reduced various types of interference in gesture images with complex backgrounds and improved the robustness of the network to complex backgrounds. Zhang et al. [214] proposed a two-stage gesture recognition method. In the first stage, the convolutional pose machine was used to localize a hand’s key points, which could effectively localize the hand’s key points even in cases of complex backgrounds. Vishwakarma [215] researched and developed a method for effective detection and classification of hand gestures in cases of complex backgrounds. Pabendon et al. [216] suggested a gesture recognition method based on spatiotemporal domain pattern analysis, which could significantly reduce the irregular noise affecting gesture recognition in cases of complex backgrounds. Elsayed et al. [217] described a robust gesture segmentation method based on adaptive background subtraction with skin color thresholding, which aimed to automatically segment gestures from a given video in cases of different lighting conditions and complex backgrounds. Qi et al. [218] suggested an improved atrous spatial pyramid pool to improve the accuracy of gesture feature representation in images. Zhou et al. [219] proposed a two-stage gesture recognition system to solve the problem of recognizing gestures in cases of complex backgrounds.

Distance and hand anatomy

The distance between the cameras and the person is an important factor in hand gesture segmentation. If the cameras are too far away, hand gestures may not be captured accurately, resulting in incorrect segmentation. However, if the cameras are too close, there may be occlusion as hands move in and out of the frame, leading to incomplete segmentation.

Additionally, the hand anatomy should be considered. Different types of hand gestures involve different parts of the hand, and some gestures may be more difficult to segment accurately than others. For example, gestures that involve the fingers being close together may be more challenging to distinguish from each other.

To improve hand gesture segmentation accuracy, researchers may use various techniques, such as depth sensing (e.g., RealSense and Kinect), machine learning algorithms, and hand-tracking algorithms. These methods can help identify the different parts of the hand and track their movement accurately even in complex gesture sequences.

Future outlook

With the rise of artificial intelligence, deep learning is undoubtedly a benign accelerator for gesture recognition, and gesture recognition systems will become more accurate and stable. Future gesture recognition systems will also be more diversified and applicable to more fields such as medical care, education, entertainment, etc., bringing more convenience and innovation to people. In the future, gesture recognition technology will continue to develop in the following directions:

  • More intelligent: Gesture recognition will become more intelligent with the continued development of deep learning and artificial intelligence technology. Training a model will allow it to understand more complex gestures while reducing the user requirements, making gesture recognition more natural and intelligent.

  • More accurate: As computer vision and sensor technology continue to improve, gesture recognition will become more accurate. For example, higher-resolution cameras and more sensitive sensors can capture more subtle hand movements, improving the accuracy of gesture recognition.

  • More capable of real-time performance: Future gesture recognition technology will operate closer to real time, and be capable of processing large numbers of gestures and translating them into commands or actions. This will enable gesture recognition’s wider use in virtual reality, gaming, medical, and other fields.

  • More reliable: As the applications of gesture recognition technology expand, its reliability becomes increasingly important. Future gesture recognition technologies will require more rigorous testing and validation to ensure their reliable operation in a variety of environments.

  • More personalized: Future gesture recognition technologies will be more personalized and able to adapt to different users’ gesture habits and preferences. For example, users may be able to customize specific gestures to accomplish a particular operation or function.

Conclusions

This paper focuses on the processing steps and techniques for gesture recognition. Gesture recognition methods are divided according to four steps: data acquisition, gesture detection and segmentation, feature extraction and gesture classification. The focus of this paper is on RGB-D camera-based gesture recognition techniques, but it also covers several related studies in the field that use monocular and depth cameras. Contrasting these two methods, we observe that the depth camera-based gesture recognition method is more practical and effective. It can be used to perform both dynamic and static gesture recognition. The research on gesture recognition’s implementation in robotic scenarios is then reviewed and analyzed, and algorithms for gesture recognition in human-robot interaction are discussed. The problems faced by vision-based gesture recognition methods over the years, the progress that can be made, and the possible future directions are reviewed.