Iran Journal of Computer Science

, Volume 1, Issue 1, pp 47–64 | Cite as

Gesture-based human–machine interfaces: a novel approach for robust hand and face tracking

  • Farhad Dadgostar
  • Abdolhossein Sarrafzadeh
Original Article


Nonverbal communication forms a substantial portion of human–human interaction. In recent years, there has been increasing interest in developing gesture-based user interfaces for better human–machine interaction. Hand and face tracking is a central issue in the development of real-time gesture recognition systems. In this article, a new approach for boundary detection in blob tracking based on the mean-shift algorithm is proposed. Our approach is based on continuous sampling of the boundaries of the kernel and changing the size of the kernel using our novel Fuzzy-based algorithm. We compare our approach to the kernel density-based approach which is known as the CAM-shift algorithm in a set of different noise levels and conditions. The results show that the proposed approach is superior in stability against white noise and also provides correct boundary detection for arbitrary hand postures which is not achievable by the CAM-shift algorithm. This algorithm provides the required framework for vision-based real-time gesture recognition and hand and face tracking. It can be applied in scientific and commercial extensions of either vision-based or hybrid gesture recognition systems.


Boundary detection Hand and face tracking Human–human interaction Gesture-based user interfaces 

1 Introduction

Advancements in computer vision partly because of the availability of more powerful computer hardware in recent years have made it possible for human–computer interaction research community to focus on the development of gesture-based user interfaces. Alkemade et al. [1] have reported an experiment on the use of a simple VR and hand-tracking interface prototype which shows similar performance to those using a traditional mouse and screen interface. Various other applications of gesture-based interfaces have been reported in recent literature [e.g., [2, 3]]. These promising results show that these interfaces are potentially the interfaces of the future, although there are still shortcomings that need to be addressed [4]. Vision-based gesture recognition systems require identifying and tracking the boundaries of the skin segments which in the case of this study were specifically the hand and the face. Color information provides an efficient feature for this purpose because of its robustness to partial occlusion, geometry invariance, and computational efficiency. The output of the skin detection algorithms on the other hand is a disperse set of detected skin pixels. The quality of the detection is dependent on several parameters including the background noise and lighting condition. These parameters may in some applications not be controllable. Considering that providing the ideal conditions for a real-world application is impractical, improving the quality of the output using enhancement techniques is considered a desired solution in skin detection systems. These techniques may themselves require considerable amounts of computation. Moreover, as we showed in [5], even using simple morphological operations (e.g., erode–dilate) will not fully eliminate the sparse falsely detected pixels.

Using a tracking algorithm which does not require a rigid segment and can handle ambiguous boundaries is another approach for tracking a disperse set of points. In this article, we introduce the mean-shift algorithm and its application in hand and face tracking by tracking the disperse set of detected skin pixels detected by the skin detection algorithm. Among the different tracking algorithms, the mean-shift tracking algorithm has recently become popular due to its simplicity and robustness. The mean-shift algorithm is a nonparametric statistical method for seeking the nearest mode of a point sample distribution.
Fig. 1

Iterations to find the highest dense kernel using the mean-shift algorithm. The initial position of the kernel is [6] which has a small overlap with the face blob. 2, 3, 4, and 5 are the new positions of the kernel to match the centre of gravity of the kernel to the geometrical centre of the kernel

The mean-shift algorithm and its applications in pattern recognition were originally introduced by Fukunaga and Hostetler [7] in 1975 for data clustering. They referred to their algorithm as a “valley-seeking procedure”. The first reported application of the mean-shift algorithm in image processing is probably the one introduced by Cheng [8] in 1995 for mode seeking and clustering. Following some successful applications including Cheng’s work, the mean-shift algorithm attracted the interest of the image processing community in different application areas including feature tracking.

1.1 The mean-shift algorithm

The basic idea of object tracking using the mean-shift algorithm is finding the highest density of features within the image using a search window. The search window has been called the “kernel” by some in the literature, and we will use the same term throughout this paper. The modeling of features and the feature space is an implementation issue and may vary based on the target application. Particularly for object tracking, this function can be the value of a measurement function evaluating the feature within the kernel. This value is also called the density of the kernel. The initial placement of the kernel is based on a placement strategy.

The mean-shift algorithm is an iterative algorithm. In each iteration, the kernel moves to a place, where its centre of gravity matches its geometrical centre. The iterations continue, while the kernel value is increasing. Figure 1 presents the iterations that took place to find the kernel with the highest density. The detailed description of the mean-shift algorithm for blob tracking is introduced in Sect. 1.2.

1.2 Automatic resizing of the kernel

Choosing the proper size and the initial placement of the kernel are other problems which are not yet addressed in the literature and have remained as open research questions. Random placement and max-fit placement of the kernel are two possible strategies for placement of the kernel which may be superior to one another in terms of the required amount of computation and specifications of the application.

Improper initial placement of the kernel may result in finding a background object instead of the object of interest. The kernel size, on the other hand, is a crucial parameter in the performance of the mean-shift algorithm. If the kernel size is chosen too large, it will contain too many background pixels. Kernels that are too small not only determine an incorrect position for the object of interest but may also “roam” around on the object in a video sequence leading to poor object location (Fig. 2). Changing the size of the kernel may also be necessary to cope with changes in the shape of the tracked object (e.g., rotation, moving toward or away from the camera).
Fig. 2

Choosing the size and initial placement of the kernel. a Incorrect placement of the kernel, b choosing a kernel that is too large, c choosing a too small kernel

The general approach to single object tracking using the mean-shift algorithm is referred to as the “Continuously Adaptive Mean-shift” or the CAM-shift algorithm. It was introduced by Bradski [9] in 1998. This is one of the earliest works in using the mean-shift algorithm in object tracking. The CAM-shift tracking algorithm which adapts the size of the search window to the object using the knowledge of the aspect ratio of the desired object and the \(\hbox {zero}{\mathrm{th}}\) moment of the kernel is as follows (Algorithm 1.1).

Let M(p) represents the degree of membership of pixel p to the target object. Then, for a kernel in quantized 2D space, the coordinates of the centre of gravity are calculated as follows:
$$\begin{aligned} M_{00}= & {} \sum _y {\sum _x {M(x,y)} } \end{aligned}$$
$$\begin{aligned} M_{01}= & {} \sum _y {\sum _x {x\cdot M(x,y)} } \end{aligned}$$
$$\begin{aligned} M_{10}= & {} \sum _y {\sum _x {y\cdot M(x,y)} } \end{aligned}$$
$$\begin{aligned} \mathbf{X}_{\mathbf{c}}= & {} \mathbf{M}_{\mathbf{01}}{} \mathbf{/M}_{\mathbf{00}} \end{aligned}$$
$$\begin{aligned} \mathbf{Y}_{\mathbf{c}}= & {} \mathbf{M}_{\mathbf{10}}{} \mathbf{/M}_{\mathbf{00}} \end{aligned}$$
\({{\varvec{M}}}_{{\varvec{00}}}, {{\varvec{M}}}_{{\varvec{01}}}\) and \({{\varvec{M}}}_{{\varvec{10}}}\) are called the \(\hbox {zero}{\mathrm{th}}\) moment, first moment-x and first moment-y, respectively.

2 Research background

Comaniciu et al. [10] in their survey on “Kernel-based Object Tracking” have indicated two major components in visual object tracking, “Target Representation” and “Localization”. Target representation is normally a bottom–up process which has to cope with changes in the appearance of the target. Filtering and data localization is mostly a top–down process dealing with the dynamics of the tracked object, learning of scene priors, and evaluation of different hypotheses. The way the two components are combined and weighted is application dependent and plays a decisive role in the robustness and efficiency of the tracker. Some of the applications of the mean-shift algorithm as a general tool for analyzing the feature space are introduced in Comaniciu and Meer [11]. Another successful application of the mean-shift algorithm is object tracking with the core idea of representing the object as a set of features which may be varying in terms of number, distance, and time [12]. This assumption is realistic in real-world applications of video analysis, because a 100% accuracy in detecting features is not reachable.

Comaniciu [13] used the mean-shift algorithm for object tracking with a moving camera. This meant that the feature extraction of the object is more difficult because of the changing of the background, but the basic idea of the tracking is the same as Bradski’s [9] work. Allen [14] used the CAM-shift algorithm for tracking of multiple color patches. Wang et al. [15] used the features extracted from the wavelet to track the object.

2.1 The mean-shift algorithm and feature tracking

In general, the mean-shift algorithm can be applied as a feature tracking algorithm for video sequences. One of the features of this approach is continuous evaluation of the kernel which can be done using a distance evaluation function. Different distance functions have been introduced and applied in research. Depending on the features and the target application, one may be superior to the other. These functions may require a significant amount of computation.

For color-based object tracking, also different similarity measurements have been introduced in the literature. The Euclidean distance, Mahalanobis distance [9], the Bhattacharayya coefficient [16], and the Kullback–Liebler divergence are the most typically used similarity measurements.

Yang et al. [17] proposed a new similarity measurement for object tracking using the mean-shift algorithm. They have shown that Bhattacharayya coefficient and Kullback–Liebler divergence are inaccurate in higher dimensions on Gaussian synthesized data. Alternatively, they proposed another similarity measurement for two Gaussian distributions which is more accurate and reliable in higher dimensions. Instead of evaluating the information-theoretic measures from the estimated PDF (probability density function), they defined the similarity between two distributions as the expectation of the density estimates over the model or target image. Given two distributions with samples \(I_{x}= {\{ }x_{i}, u_{i}{ \}}\). \(i = 1 \ldots {N}\), and \(I_{y} = \{y_{j}, v_{j}\}, j = 1 \ldots M\), where the centre of sample points in the model is x, and the current centre of the target points is y, the similarity between \(I_{x}\), \(I_{y}\) in the joint feature-spatial space is as follows:
$$\begin{aligned} J({\mathop {I}\nolimits _{x} ,\mathop {I}\nolimits _{y} } )=\frac{1}{MN}\sum _{i=1}^N {\sum _{j=1}^M { w\left( {\mathop {\left| {\frac{\mathop {x}\nolimits _i -\mathop {y}\nolimits _j }{\sigma }} \right| }\nolimits ^2 } \right) } } k\left( {\mathop {\left| {\frac{\mathop {u}\nolimits _i -\mathop {v}\nolimits _j }{h}} \right| }\nolimits ^2 } \right) \end{aligned}$$
k(x) is an RBF (Radial Basis Function) kernel function, and w, h are descriptors of the size of the kernel. They have shown that their approach for tracking of objects with known size performs better than Bhattacharayya coefficient and Kullback–Liebler divergence distance measurements.

Xu et al. [18] proposed a color-based object-tracking method using the mean-shift algorithm based on a user selected window in the initial frame which should contain the object of interest. They used the Bhattacharayya coefficient as the convergence measurement of the mean-shift algorithm. Their color model is a bitwise method for YUV 32 bit color space which contains 32 color clusters. To tolerate initial incorrect manual selections, they have used the Epanichnekiv kernel smoothing function to give higher value to the pixels located in the centre of the selected area and lower value to the pixels of the background that may be located on the edges of the selected area. However, they have not shown that their color model is better than other color models.

Finally, Yang et al. [17] approach introduced the size and shape of the kernel as one of the features. This is in fact not suitable for some movement directions, e.g., moving toward or away from the camera, which result in a significant change in size, or for tracking an articulate object like hand, where the shape is changing over time. These shortcomings make this approach not suitable for hand tracking and gesture recognition.

Using a higher number of dimensions which is the result of using different features at the same time for tracking is one of the current requirements of tracking. Research shows that more feature dimensions results in more accurate classification [19]. For instance, considering motion features, wavelets, and color together may improve the tracking accuracy. On the other hand, the feature selection itself and the amount of computation are the disadvantages of multiple feature space.

2.2 The mean-shift algorithm and variable sized kernel

Choosing the size of the kernel is one of the challenges with the mean-shift algorithm. As mentioned in Sect. 1.2, kernel resizing in the CAM-shift algorithm is based on the density of the features within the kernel and the known aspect ratio of the object. This approach is not robust and therefore alternative methods are still being studied.

Setting the kernel size in the CAM-shift algorithm [9] is based on several assumptions: (i) the quality of the detection together with the solid shape of the object of interest provides a certain density of the pixels within the kernel; (ii) the aspect ratio of the size of the object of interest (e.g., face) is always constant; and (iii) the face silhouette is the only silhouette in the image. Based on these assumptions, Bradski has proposed the following equation for calculating the kernel size in each frame:
$$\begin{aligned} w= & {} 2\sqrt{\frac{M_{00} }{256}} \end{aligned}$$
$$\begin{aligned} h= & {} 1.2\times w \end{aligned}$$
where w and h are the width and the height of the new kernel and \(M_{00}\) is the \(\hbox {zero}{\mathrm{th}}\) moment of the kernel. We refer to this method as the “kernel density-based” method or for the sake of simplicity, sqrt(m00), because the kernel size is based on the square root of the kernel density. Sherrah and Gong [20] proposed a similar method for tracking discontinuous motion of multiple occluding body parts of an individual from a single 2D view. They used the following width and height for estimating the kernel size for face tracking which is basically the same as Bradski’s approach:
$$\begin{aligned} w= & {} \sqrt{\mathop {n}\nolimits _\mathrm{{skin}} } \end{aligned}$$
$$\begin{aligned} h= & {} 1.2 \times w \end{aligned}$$
where n is the number of non-zero pixels inside the kernel.

The resizing method of the kernel in Comaniciu et al. [13] is based on measuring the kernel density by three different kernels simultaneously applied on the image. The three kernels consist of one kernel with the current size and the other two with window sizes of \(\pm 10\%\) of the current size. For each, the color distribution of the kernel is compared to the color distribution of the target using the Bhattacharayya coefficient and the most similar distribution is chosen as the new current scale. One of the disadvantages of the Bhattacharayya coefficient is that in a uniformly colored region, any location of a window that is too small will yield the same value of the window as would be obtained by shrinking the window even more [21]. Therefore, decision making about shrinking or enlarging the kernel is impossible.

KaewTraKulPong and Bowden [22] proposed a method for tracking low-resolution moving objects in an image using color, shape, and motion information. The height and width of the kernel were modeled by a white noise velocity model based on the assumption that the bounding box does not change dramatically. It should be noted that this assumption cannot be made in some applications including face tracking in HCI.

Yang et al. [17] proposed using Gaussian transform to model the kernel to decrease the number of iterations in the mean-shift algorithm for tracking a moving blob. This enhancement requires some extra computation for pre-processing which degrades the overall performance of the algorithm. Collins [21] points out that setting the window size based on the CAM-shift algorithm with negative weights does not produce a reliable estimation of the object boundaries. Alternatively, he proposes a method for resizing the search window based on the weight of the samples and the scale of the mean-shift kernel. The scale space is generated by convolving a filter bank of spatial Difference of Gaussian (DOG) filters with a sample weight image. The results are then convolved with an Epanichikov kernel in the scale dimension. Although this method produces more reliable results for a wider range of applications, the computational expense of calculating convolutions is very high. This makes the approach unfavorable for real-time applications.

Shan et al. [23] proposed a method for skin tracking using the mean-shift algorithm based on particle filtering called the Mean-Shift Embedded Particle Filter (MSEPF). They have introduced the in-motion skin pixels as an extra weight for the tracking using the mean-shift algorithm. Introducing the motion feature as another weight factor makes the tracking algorithm more sensitive to moving hand or face. On the other hand, the motion tracking method may cause the detected centre of gravity to be varying after applying the mean-shift algorithm. For instance, using frame subtraction, the in-motion pixels are mostly located on the edges of the hand and face. This makes the tracked rectangle to move biased to the edges of the hand silhouette. They have reported that this algorithm is being used for giving simple hand commands to an intelligent wheelchair as visual input.

3 Fuzzy-based kernel resizing

Our observations show that selecting a large kernel size equal to the input image together with resizing it with a proper algorithm can find the biggest blob in the image which would normally be the face or the hand region in an HCI application.

3.1 Initialization and boundary detection of the kernel

Based on the output of the skin color segmentation algorithm, we observed the following limitations of the skin detection:
  1. 1.

    The quality of detection is not deterministic and may vary over time.

  2. 2.

    The quality of detection may be inhomogeneous on the same blob. For instance, the density of the detected pixels in the forehead area may be different from the density of the detected pixels in the chin area (Fig. 3).

  3. 3.

    The density of the falsely detected pixels for some colors (e.g., wood color) may be considerable in comparison with the density of the pixels of the object of interest and, therefore, cause incorrect tracking.

  4. 4.

    The shape of the silhouette of the object of interest (e.g., hand) is non-deterministic, and therefore, the size of the kernel cannot be interpreted using the \(\hbox {zero}{\mathrm{th}}\) moment information.

  5. 5.

    The blob of the object of interest may be disconnected due to different lighting condition.

To overcome the limitations discussed, we used a kernel initially equal to the size of the input image and resized it based on the density of the edges of the kernel. The results of applying this method in different scenarios using Algorithm 3.1 were superior to the CAM-shift algorithm.
The primary comparisons of the proposed algorithm to the CAM-shift algorithm show that in tracking of a single object both algorithms have advantages and disadvantages. The convergence speed of the CAM-shift algorithm is faster than the proposed algorithm, because the kernel size changes exponentially to the inner density of the kernel in comparison with the proposed algorithm in which the kernel size changes linearly. To improve the convergence speed of the proposed algorithm, we have proposed a fuzzy approach to resizing the kernel based on the edge density of the kernel which is described in the following section.
Fig. 3

Inhomogeneous detection of the skin pixels in the image. The density of the pixels are varying in the face area which makes detection and tracking of face blob more difficult

3.2 Fuzzy boundary detector

The fuzzy function for resizing the kernel includes three functions as described in the following paragraphs.

Fuzzifier The fuzzifier changes the input values to fuzzy values. We have considered three input fuzzy values. The boundaries of these values are based on the empirical results that we found by trial and error in face tracking using a Dragonfly digital camera in separate experiments. Empirical results show that the proposed number of levels is sufficient for face and hand tracking (Fig. 4).

Inference engine The inference procedure for the proposed fuzzy controller is presented in Table 1. We have considered three fuzzy values for changing the output, as indicated in Fig. 5. The values \(-5\) and 5 are arbitrary and determine the shrinking and enlarging speed of the algorithm.

Defuzzifier For converting the fuzzy outputs to numerical values, the centre of gravity method was used.

The algorithm for the mean-shift face tracker with fuzzy boundary detection is as follows:

Fig. 4

Fuzzy values for the inputs of the fuzzy boundary detector

Table 1

Fuzzy controller for the fuzzy boundary detector

Edge density (input)

Resize rate (output)




No change



Fig. 5

Fuzzy outputs for the fuzzy boundary detector

4 Experiments and results

In this section, we present the experiments and the results showing the behavior of our proposed algorithm in comparison with the CAM-shift algorithm in noiseless and noisy environment. The implementation of the CAM-shift algorithm is based on estimating the size of the kernel as a function of the \(\hbox {zero}{\mathrm{th}}\) moment as described in Bradski [9] and Sherrah [20]. We call this method sqrt(m00), and compare it to our method which we have called the “fuzzy method”. We measured (a) the top-left corner position, (b) the position of the centre of gravity, (c) the area of the kernel, and (d) the error in positioning the kernel, which collectively give an overview of the stability and correctness of the kernel.

We also applied preset levels of white noise in each experiment. The noise might change the value of a skin pixel causing it to not appear as a skin pixel in the silhouette image. Seven levels of noise (0 to 30% in 5% steps) were tested on the input sequence and all of each experiment’s parameters were measured . The video sequences were recorded using a Dragonfly video camera equipped with a Sony CCD. Each frame was a \(640 \times 480\) RGB color image recorded in 15 fps.

The noise was applied before skin color detection which means that the number of missing skin pixels were increased by performing the erode–dilate morphological operators inside the skin detector. We believe that this model is more realistic and more similar to the real-world conditions.

To compare the algorithms, we used an ideal tracker as the ground-truth data for evaluation. The ideal tracker is a tracker which holds at least 10% of the filled pixels on each edge, and tracks the object from an ideal (no noise) skin detection algorithm. The ideal tracker finds this characteristic for the edges in each frame, which makes its behavior more like the inner density-based algorithm which has faster convergence speed. In addition, it makes it flexible in resizing and to able to match properly around the shape of the desired object. Results from four experiments are presented in the following sections.1

4.1 Experiment 1: detecting a blob with no movement

The platform for the first experiment was a manually segmented face image used as the input video sequence. The purpose of using a fixed frame instead of a real video sequence of an in-motion blob was to test the algorithms with ideal input. We also wanted to have the ability to control the noise level without having to deal with other variables such as camera noise and change in lighting condition. The results of tracking in a noiseless environment in both the algorithms were almost the same (Fig. 6).
Fig. 6

Detected boundary by the two algorithms surrounding a static image silhouette

Fig. 7

a Correct detection determined by edge density-fuzzy. The smaller rectangle is the result of kernel density-based—sqrt(m00)—method. b Behavior of the algorithms with a noise level of 15%

Fig. 8

Tracking the hand in a “grabbing” hand gesture with white noise of 20%, a original image sequence, b centre of gravity, c error of displacement in comparison with an ideal tracker (\({X}_{\mathrm{c}}-{X}_{\mathrm{c}\,\mathrm{ideal}})\)

4.2 Behavior of the kernel density-based algorithm in noisy environment

By increasing the noise, the reliability of the sqrt(m00) method decreased and it finally could not properly specify the boundaries of the face. However, there is a logical explanation for this behavior. This method depends on the density of the kernel to specify the boundaries. Adding noise to the kernel causes a decrease in the density by a power of two, and therefore, smaller width and height for the kernel are computed. By increasing the noise level, this method starts to loose track more frequently (Fig. 7b).
Fig. 9

Tracking the centre of gravity—zoom out with 25% white noise, a original image sequence, b X\(_\mathrm{c}\) centre of gravity, c area of the kernel, and d error of placement of the kernel in comparison with an ideal tracker \(({X}-{X}_{\mathrm{ideal}})\)

Fig. 10

Moving hand, in occluded situation, a the original image sequence, b X\(_\mathrm{c}\) of the centre of gravity in noise of 20%, d Error in X\(_\mathrm{c}\) of the kernel in comparison with an ideal tracker \(({X}-{X}_{\mathrm{ideal}})\)

4.3 Behavior of the fuzzy boundary detection algorithm in noisy environment

The fuzzy boundary detection algorithm proposed here shows more robustness against white noise. After locating the approximate position of the object, it examines the density of the edge of the kernel instead of the inner density. This feature makes it more robust against change of the density inside the kernel. In addition, the fuzzy behavior applied on this algorithm makes resizing smoother and less sensitive to noise. With the highest noise level, the fuzzy-based approach demonstrates significant stability in comparison with the inner density algorithm. Figure 7a shows the boundary detected by the fuzzy-based approach and the density-based approach. The position of the kernel using the fuzzy-based approach was correct and stable in almost the whole period of tracking while position of the kernel using the density-based approach was unstable. Figure 7b shows the occurrence of the “roaming effect” for sqrt(m00), while the fuzzy edge density method is stable.

In the following experiments, we present the behavior of these algorithms in different scenarios including occluded blobs, changing shape, and zoom effect.

4.4 Experiment 2: tracking the blob of a moving hand

The second experiment was performed on the video sequence of a hand demonstrating a grabbing gesture. The characteristic of this gesture is a continuous change of the shape and fast movement of the position of the blob, as shown in Fig. 8.

Figure 8b, c presents position of the centre of gravity of the trackers and the measured error, respectively. Although both trackers are able to follow the object of interest, the tracker based on inner density shows fluctuating error in boundary detection while the fuzzy tracker is more stable and more accurate in detection of boundaries.

4.4.1 Experiment 3: tracking an object moving away from the camera

The third experiment was performed on the video sequence of a face moving away from the camera. This experiment was designed to analyze the behavior of the tracking algorithm on determining the boundaries of an object continuously shrinking. Both the size and the centre of gravity of the kernel change during the sequence (Fig. 9). The roaming effect for the inner density tracker is observable in Fig. 9b, c, representing fluctuation of the tracker in tracking the object of interest.

4.5 Experiment 4: tracking a moving hand in occluded situation

The forth experiment was a simple hand movement in the presence of another hand, as shown in Fig. 10. This video sequence has two main characteristics. First, the presence of the “other hand” causes occlusion for the tracking algorithm especially while the hands overlap. The tracking algorithms cannot distinguish which one is the hand in front of the camera. Second, the movement is a rotation of the hand around the elbow which causes the rectangular boundaries of the hand to change in time.

4.6 Accuracy of tracking

Evaluating a general purpose tracker is not a straight forward procedure, because the acceptable accuracy and error tolerance are dependent on the application. For instance, considering an acceptable error tolerance for the position of the kernel in experiment 3, the accuracy of the sqrt(M00) dramatically decreases by limiting the error tolerance, while the fuzzy algorithm shows robustness against white noise and high accuracy, as shown in Table 2.
Table 2

Comparison of accuracy of the algorithms

Experiment 3


Acceptable error (pixel)

Sqrt (M00) Noise 20%

Fuzzy Noise 20%






















Fig. 11

Average mean distance between the trackers and the ideal tracker in noise level 0–30%. a “Grabbing hand gesture dataset” (experiment 2), b “Face zoom out” dataset (experiment 3), and c “Occluded hands” dataset (experiment 4)

Another distance measurement between two segments introduced by Gevers [24] is based the sum of distances of all of the points in one shape to another shape called the mean distance.

We used the concept of the mean distance to evaluate the distance between an arbitrary tracker and an ideal tracker, as discussed in the following paragraphs.

Let A denote an arbitrary shape in the 2D space in discrete Cartesian system containing n points. Then, the shape A can be defined as
$$\begin{aligned} A=\{u_i \}_{i=1 \ldots n}. \end{aligned}$$
The distance between an arbitrary vector x and shape A is defined as
$$\begin{aligned} d(\vec {x},A)=\min \left[ {d(\vec {x},\vec {u})} \right] . \end{aligned}$$
Now, using the definition of the distance between two points in the 2D Cartesian system, the distance between x and u, is defined as
$$\begin{aligned} d(\vec {x},\vec {u})=\sqrt{(v_x -u_x )^{2}+(v_y -u_y )^{2}}. \end{aligned}$$
Finally, the distance between two shapes A and B in 2D space is defined as
$$\begin{aligned} d(A,B)=\sum _{\vec {u}\in A} {d(\vec {u},B)} +\sum _{\vec {v}\in B} {d(\vec {v},A)}. \end{aligned}$$
Based on the definition and the above equation, the distance between an inner point of a shape and the shape itself is zero. This means that the distance between a tracker and another which surrounds that tracker is zero as well. To penalize these cases, we used both the distances as shown in the above equation. The result of the mean distance is presented in Fig. 11.

5 Implementation issues for hand and face tracking

Based on the proposed algorithm, we implemented an application for tracking multiple objects including face and hands. In the following sections, we describe the details of the implementation and demonstrate the results.

5.1 The multi-tracker implementation

The mean-shift tracker introduced in this research was designed for tracking a single object silhouette within the image (Fig. 12). While for HCI applications, tracking the hands and face of the user is required, capability of tracking multiple silhouettes is an implementation challenge using this algorithm.
Fig. 12

Face tracking using the proposed algorithm for skin detection, and the mean-shift algorithm for blob tracking. Please refer to video#3 in the enclosed CD to view the full video

Since running multiple instances of the same algorithm on the same silhouette will produce the same tracking result, a strategy for distinguishing the different instances should be considered. For this purpose, we implemented an algorithm for running multiple instances of the tracker algorithm. This implementation has two main components: (a) tracker manager and (b) tracker scheduler.

The tracker manager scans the image and searches for the blocks containing a certain density of the skin pixels within the frame. For each block satisfying this constraint, it assigns a tracker to it. In the next stage, the tracker scheduler runs the available instances of the tracker and erases the content of the image silhouette after each individual tracker has detected its boundaries. This strategy prevents the trackers from moving to the detected area of each other. If the tracker scheduler recognizes that a certain tracker does not carry a certain number of skin pixels, it marks it for deletion in the next scan.

The result of running this algorithm on an image sequence is presented in Fig. 13. In Fig. 13, the green rectangles represent the trackers just initialized by the tracker manager and the yellow rectangles represent those trackers which lasted more than one frame. As shown in Fig. 13d, there is no tracker on some of the non-skin areas, where initially were trackers. This is the effect of the adaptive skin algorithm described in [5].
Fig. 13

Running multiple trackers on the image sequence of hand movement based on the algorithm described in Sect. 5.1

5.2 Tracker–tracker implementation

The result of the implementation in the previous section is a set of trackers, tracking the skin area. Although using a good quality silhouette, there will be one tracker per visible object in the image, in some scenarios, there will be more than one tracker per object. The silhouette of the hand palm with open fingers near the camera is one of those scenarios. To overcome this limitation, we implemented an extension of the introduced multi-tracker algorithm. This implementation is based on two sets of trackers. The first set consists of the trackers which track the blobs of the skin pixels. We call this set “the micro trackers”. The second set is the trackers which track groups of trackers in the first set. We call this set “the macro trackers”. The more precise membership function and definition are as follows:

Let \({M} = \{{M}_{1},\ldots , {M}_{\mathrm{n}}\}\) represent the set of macro trackers and \({m} = \{{m}_{1}, \ldots , {m}_{\mathrm{k}}\}\) represent the set of micro trackers. Each macro tracker carries a collection of micro trackers as \({M}_{\mathrm{i}}= \{{m}_{\mathrm{p}}\; {\vert }\; {m}_{\mathrm{p}}\;\upvarepsilon \,\hbox {m}\}\), such that for each pair of macro trackers \((i, j): {M}_{\mathrm{i}} \cap {M}_{\mathrm{j}} = \emptyset \).

The result of running this algorithm on an image sequence containing different hand movements is presented in Fig. 14.
Fig. 14

Robust hand and face tracking using the tracker–tracker algorithm on an image sequence of different bi-hand movements

5.3 Using depth information for hand and face tracking

One of the limitations of the mean-shift tracking algorithm together with the skin detection is occlusion handling. Because there is no significant information about the boundaries of the hand and face in the skin blobs, using an extra cue to distinguish the boundaries of the hand and face may be useful. In this section, we describe the technique of using the depth information together with the mean-shift algorithm. The depth information is extractable using a stereo-vision system.

A stereo-vision system can provide the depth information of the scene in terms of distance to the camera. This information can be used for enhancing the algorithms which we introduced in the previous section.

The simplest configuration of a stereo-vision system is bi-camera. However, three or more cameras may be used in a stereo-vision system instead. Stereo-vision can produce a dense disparity map which can be translated to the depth information map. The resultant disparity map should be smooth, detailed, and continuous. Moreover, surfaces should produce a region of smooth disparity values with their boundaries precisely delineated, while small surface elements should be detected as separately distinguished regions. Unfortunately, satisfying all of these requirements simultaneously is not achievable. Algorithms that can produce a smooth disparity map tend to miss the details and those that can produce a detailed map tend to be noisy. The depth maps obtained by bi-camera stereo-systems are not very accurate and reliable. A higher number of cameras may gain better quality depth information [25].

While the luxury of stereo-machines with more than two cameras is not yet commonly available, the normal bi-camera stereo-vision systems are the most available choice. However, the depth estimation ability is somewhat limited. We used a Bumblebee2 bi-camera stereo-vision system for applying the techniques proposed in this section. Connected to the fire-wire port, this camera is able to record two \(1024 \times 768\) color or gray-scale images in 25 fps. The test platform was Windows XP and the programming platform was C++ in Visual Studio.Net 2003. Although the depth map obtained using this camera is not highly accurate (Fig. 15), it can provide an estimation of the distance of different objects to the camera which was sufficient for evaluation of our ideas.
Fig. 15

a Sample image, b depth information of the sample image. The light area is the object closer to the camera, and the black patches are of unknown depth

5.3.1 The depth information and the adaptive skin detection algorithm

Let us take a step back and review the concept of motion detection for our adaptive skin detection algorithm. In fact, one of the reasons for the motion detection is separating the user and the background. Considering the fact that in an HCI environment, the user is closer to the camera than the background, the depth information can significantly improve the accuracy of background elimination from the image. This idea can be implemented using depth thresholding. Figure 16b represents the eliminated background and the remaining foreground after applying the depth thresholding technique.
Fig. 16

Background elimination using the depth thresholding technique

Fig. 17

Some of the scenarios in which depth information can be useful for occlusion resolution

Fig. 18

Occlusion resolution using the depth information and the mean-shift algorithm. The tracked blobs are displayed on the disparity image (right)

The detected skin pixels within the separated foreground are more likely to be the actual skin color, and therefore, the adaptive skin detection algorithm will provide more accurate detection. Some of the unwanted areas like the surface of the table behind the user will also automatically be eliminated.

5.3.2 The depth information and the fuzzy mean-shift blob tracker

The depth information can be used for occlusion resolution in some scenarios. Figure 17 presents two cases in which the distance of the hand and face to the camera are different. Obviously, without the depth information, the hand and the face blobs in these images are being considered as one single blob. In this section, we describe how the depth information can be applied for enhancing the fuzzy mean-shift blob tracker algorithm for occlusion prevention.

Based on the idea of applying the depth information as an extra cue for blob tracking within the mean-shift algorithm, we implemented another variant of the fuzzy mean-shift algorithm. The basics of this variant in Kernel shifting is similar to its ancestor. However, the kernel management is slightly different.

The kernel in this version of the algorithm in addition to the vertical and horizontal boundaries of the tracking blob also carries the minimum and the maximum depths of the points within. The shrinking and enlarging procedure on the boundaries of the kernel are based on the pixels with depths within \(\pm 10\%\) of the kernel’s min–max depth. The pixels which do not satisfy this constraint or their depth are unknown will be ignored. Therefore, blob trackers which are tracking blobs in different depth levels will not interfere with each other. Figure 18 represents the result of this algorithm on an image. We should note that without the depth information, the hand and the face blobs are considered as one connected blob. Using the depth information, the mean-shift algorithm is able to stay around the blob which has a homogeneous depth.

6 Summary and conclusion

In this paper, we presented a new approach for boundary detection in blob tracking based on the mean-shift algorithm that is robust enough to be used in gesture recognition systems. Our approach was based on continuous sampling of the boundaries of the kernel and changing the size of the kernel using our novel algorithm. We also showed that the proposed method is superior in terms of robustness and stability compared to the density-based tracking method known as the CAM-shift algorithm.

The robustness of our method against noise makes it a good candidate for use with cheap cameras and real-world vision-based human computer interaction applications. This approach has been used for recognizing nonverbal interaction of learners in an affective tutoring system [26, 27, 28, 29]. This method is to be applied in conjunction with a fast pixel-based skin color segmentation algorithm as the level of noise and the quality of the skin detection are not deterministic.


  1. 1.

    The video sequence is available online at

  2. 2.

    Bumblebee stereo-vision system is manufactured by PointGrey research labs.


  1. 1.
    Alkemade, R., Verbeek, F., Lukosch, S.: On the efficiency of a VR hand gesture-based interface for 3D object manipulations in conceptual design. Int. J. Hum. Comput. Interact. 33(11), 882–901 (2017)CrossRefGoogle Scholar
  2. 2.
    Montanaro, L.: A touchless human-machine interface for the control of an elevator. In: Proceedings of the 2nd International Conference on Recent Trends and Applications in Computer Science and Information Technology (RTA-CSIT 2016), Tirana, Albania, November 18–19 (2016)Google Scholar
  3. 3.
    Jackowski, A., Gebhard, M.: Evaluation of hands-free human-robot interaction using a head gesture based interface. In: Proceedings of the Companion of the ACM/IEEE International Conference on Human-Robot Interaction, Vienna, Austria, March 06–09 (2017)Google Scholar
  4. 4.
    Attwenger, A.: Advantages and Drawbacks of Gesture-based Interaction. GRIN, Munich (2014)Google Scholar
  5. 5.
    Dadgostar, F., Sarrafzadeh, A.: An adaptive real-time skin detector based on Hue thresholding: a comparison on two motion tracking methods. Pattern Recognit. Lett. 27, 1342–1352 (2006)CrossRefGoogle Scholar
  6. 6.
    Zhu, Y., Ren, H., Xu, G., Lin, X.: Toward real-time human-computer interaction with continuous dynamic hand gestures. In: Presented at Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (2000)Google Scholar
  7. 7.
    Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21, 32–40 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17, 790–799 (1995)CrossRefGoogle Scholar
  9. 9.
    Bradski, G.R.: Computer vision face tracking for use in a perceptual user interface. Intel Technol. J. 2, 1–15 (1998)Google Scholar
  10. 10.
    Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 564–577 (2003)CrossRefGoogle Scholar
  11. 11.
    Comaniciu, D., Meer, P.: Mean shift analysis and applications. In: Presented at Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece (1999)Google Scholar
  12. 12.
    Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24, 603–619 (2002)CrossRefGoogle Scholar
  13. 13.
    Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: Presented at IEEE computer vision and pattern recognition (2000)Google Scholar
  14. 14.
    Allen, J.G., Xu, R.Y.D., Jin, J.S.: Object tracking using CamShift algorithm and multiple quantized feature spaces. In: Presented at Workshop on Visual Information Processing, Conferences in Research and Practice in Information Technology, Sydney, Australia (2003)Google Scholar
  15. 15.
    Wang, R., Chen, Y., Huang, T.S.: Basis pursuit for tracking. In: Presented at Proceedings of the IEEE International Conference on Image Processing (ICIP’01), Thessaloniki, Greece (2001)Google Scholar
  16. 16.
    Djouadi, A., Snorrason, O., Garber, F.D.: The quality of training-sample estimates of the bhattacharayya coefficient. IEEE Trans. Pattern Anal. Mach. Intell. 12, 92–97 (1990)CrossRefGoogle Scholar
  17. 17.
    Yang, C., Duraiswami, R., Davis, L.: Efficient mean-shift tracking via a new similarity measure. In: Presented at Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) (2005)Google Scholar
  18. 18.
    Xu, R.Y.D., Allen, J.G., Jin, J.S.: Robust mean-shift tracking with extended fast colour thresholding. In: Presented at Proceedings of the International Symposium on Intelligent Multimedia, Video & Speech Processing, Hong Kong (2004)Google Scholar
  19. 19.
    Stenger, B.D.R.: Model-based hand tracking using a Hierarchical Baysian filter, PhD Thesis, St. John’s Colledge, University of Cambridge, Cambridge (2004)Google Scholar
  20. 20.
    Sherrah, J., Gong, S.: Tracking discontinuous motion using Bayesian inference. In: Presented at ECCV (2000)Google Scholar
  21. 21.
    Collins, R.T.: Mean-shift blob tracking through scale space. In: Presented at IEEE Conference on Computer Vision and Pattern Recognition (2003)Google Scholar
  22. 22.
    KaewTraKulPong, P., Bowden, R.: An adaptive visual system for tracking low resolution colour targets. In: Presented at The British Machine Vision Conference (BMVC’01). Manchester, UK (2001)Google Scholar
  23. 23.
    Shan, C., Wei, Y., Tan, T., Ojardias, O.: Real time hand tracking by combining particle filtering and mean shift. In: Presented at Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR 2004), Seoul, Korea (2004)Google Scholar
  24. 24.
    Gevers, T.: Robust segmentation and tracking of colored objects in video. IEEE Trans. Circuits Syst. Video Technol. 14, 1–6 (2004)CrossRefGoogle Scholar
  25. 25.
    Kanade, T.: Development of a video-rate stereo machine. In: Presented at Proceedings of the 1994 ARPA Image Understanding Workshop (IUW’94) (1994)Google Scholar
  26. 26.
    Sarrafzadeh, A., Fan, C., Dadgostar, F., Alexander, S., Messom, C.: Frown gives game away: affect sensitive tutoring systems for elementary mathematics. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, October 10–13, The Hague, The Netherlands (2004)Google Scholar
  27. 27.
    Alexander, S., Hill, S., Sarrafzadeh, A.: How do human tutors adapt to affective state? In: Proceedings of the 10th International Conference on User Modeling, Edinburgh, 24–30 July (2005)Google Scholar
  28. 28.
    Alexander, S., Sarrafzadeh, A.: Interfaces that adapt like humans. In: Masoudian, M., Jones, S., Rogers, B. (eds.) Computer Human Interaction, APCHI’04, Lecture Notes in Computer Science, pp. 641–645. Springer, Berlin (2004)Google Scholar
  29. 29.
    Fan, C., Johnson, M., Messom, C., Sarrafzadeh, A.: Machine vision for an intelligent tutor. In: Proceedings of the International Conference on Computational Intelligence, Robotics and autonomous Systems, Singapore, December 15–18 (2003)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.AerVision TechnologiesSyndeyAustralia
  2. 2.Unitec Institute of TechnologyAucklandNew Zealand

Personalised recommendations