International Journal of Computer Vision

, Volume 101, Issue 3, pp 403–419

Attention Based Detection and Recognition of Hand Postures Against Complex Backgrounds

Authors

    • Department of Electrical and Computer EngineeringNational University of Singapore
  • Prahlad Vadakkepat
    • Department of Electrical and Computer EngineeringNational University of Singapore
  • Ai Poh Loh
    • Department of Electrical and Computer EngineeringNational University of Singapore
Article

DOI: 10.1007/s11263-012-0560-5

Cite this article as:
Pisharady, P.K., Vadakkepat, P. & Loh, A.P. Int J Comput Vis (2013) 101: 403. doi:10.1007/s11263-012-0560-5
  • 1.7k Views

Abstract

A system for the detection, segmentation and recognition of multi-class hand postures against complex natural backgrounds is presented. Visual attention, which is the cognitive process of selectively concentrating on a region of interest in the visual field, helps human to recognize objects in cluttered natural scenes. The proposed system utilizes a Bayesian model of visual attention to generate a saliency map, and to detect and identify the hand region. Feature based visual attention is implemented using a combination of high level (shape, texture) and low level (color) image features. The shape and texture features are extracted from a skin similarity map, using a computational model of the ventral stream of visual cortex. The skin similarity map, which represents the similarity of each pixel to the human skin color in HSI color space, enhanced the edges and shapes within the skin colored regions. The color features used are the discretized chrominance components in HSI, YCbCr color spaces, and the similarity to skin map. The hand postures are classified using the shape and texture features, with a support vector machines classifier. A new 10 class complex background hand posture dataset namely NUS hand posture dataset-II is developed for testing the proposed algorithm (40 subjects, different ethnicities, various hand sizes, 2750 hand postures and 2000 background images). The algorithm is tested for hand detection and hand posture recognition using 10 fold cross-validation. The experimental results show that the algorithm has a person independent performance, and is reliable against variations in hand sizes and complex backgrounds. The algorithm provided a recognition rate of 94.36 %. A comparison of the proposed algorithm with other existing methods evidences its better performance.

Keywords

Computer visionPattern recognitionHand gesture recognitionComplex backgroundsVisual attentionBiologically inspired features

1 Introduction

Visual interaction is a natural, easy, and effective way of interaction, which does not require any physical contact and does not get affected by noises produced by sound. Hand gesture recognition, which is an important area of research in visual pattern analysis, have wide applications in sign language recognition, human-computer interaction (HCI), human-robot interaction (HRI), and virtual reality (VR). The presence of complex and cluttered backgrounds make the recognition of hand gestures difficult.

The mainstream computer vision research has always been challenged by human vision, and the mechanism of human visual system is yet to be understood well. The human visual system rapidly and effortlessly recognizes a large number of diverse objects in cluttered, natural scenes and identifies specific patterns. This capability of the human vision system inspired the development of computational models of biological vision systems. Intermediate and higher visual processes in primates select a subset of the available sensory information before further processing (Tsotsos et al. 1995), in order to reduce the complexity of scene analysis. This selection is implemented in the form of a focus of attention (Niebur and Koch 1998). Recent developments in the use of neurobiological models in computer vision try to bridge the gap between neuroscience, computer vision and pattern recognition (Poggio and Bizzi 2004; Poggio and Riesenhuber 1999; Serre et al. 2007; Itti et al. 1998; Itti and Koch 2001; Siagian and Itti 2007; Rao 2005; Chikkerur et al. 2010).

1.1 Hand Gesture Recognition

Gestures are expressive, meaningful body motions involving physical movements of the fingers, hands, arms, head, face, or body (Mitra and Acharya 2007). Gestures can be classified based on the moving body part (Fig. 1(a)). There are two types of hand gestures; static and dynamic gestures. Static hand gestures (hand postures/poses) are those in which the hand position does not change during the gesturing period. Static gestures mainly rely on the shape and the flexure angles of the fingers. In dynamic hand gestures (hand gestures), the hand position is temporal and it changes continuously with respect to time. Dynamic gestures rely on the hand trajectories, scales and orientations, in addition to the shape and fingers flex angles. Dynamic gestures, which are actions composed of a sequence of static gestures, can be expressed as a temporal combination of static gestures.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig1_HTML.gif
Fig. 1

Classification of (a) gestures and (b) hand gesture recognition tools. The proposed algorithm recognizes static hand gestures, using a learning based approach

There exist several reviews on hand modeling, pose estimation, and hand gesture recognition (Erol et al. 2007; Ong and Ranganath 2005; Pavlovic et al. 1997; Wu and Huang 1999). The tools used for vision based hand gesture recognition can be classified into three categories (Fig. 1(b)). They are (1) Hidden Markov Model (HMM) based methods (Lee and Kim 1999; Yoon et al. 2001; Ramamoorthy et al. 2003; Just and Marcel 2009; Chen et al. 2003; Yang et al. 2007), (2) Neural network (NN) and learning based methods (Pramod Kumar et al. 2010a, 2010c, 2011; Alon et al. 2009; Su 2000; Licsar and Sziranyi 2005; Ge et al. 2008; Yang et al. 2002; Yang and Ahuja 1998; Zhao et al. 1998; Teng et al. 2005; Hasanuzzamana et al. 2007; Eng-Jon and Bowden 2004), and (3) Other methods (Graph algorithm based methods (Pramod Kumar et al. 2010b; Triesch and Malsburg 1996a, 1998, 2001), 3D model based methods (Athitsos and Sclaroff 2003; Ueda et al. 2003; Yin and Xie 2003; Lee and Kunii 1995), Statistical and syntactic methods (Chen et al. 2008; Wang and Tung 2008), and Eigen space based methods (Patwardhan and Roy 2007; Daniel et al. 2010)).

Eng-Jon et al. proposed an unsupervised algorithm for hand detection and static hand shape recognition (Eng-Jon and Bowden 2004). A tree of boosted (hand) detectors is reported with two layers; the top layer for hand detection and the branches in the bottom layer for hand shape classification. A shape context based distance matrix is utilized for clustering similar looking hand shapes in order to construct the tree structure. The algorithm provided good detection and recognition accuracy.

A systematic approach to building a hand appearance detector is presented in Kolsch and Turk (2004). The paper proposes a view specific hand posture detection algorithm based on the object recognition method proposed by Viola and Jones. A frequency analysis based method is utilized for instantaneous estimation of class separability, without the need for training. The algorithm is applied for the detection of six hand postures. Wu et al. proposed an algorithm for view independent hand posture recognition (Wu and Huang 2000). The suitability of a number of classification methods is investigated to make the algorithm view independent. The work combined supervised and unsupervised learning paradigms to propose a learning approach called Discriminant-EM (D-EM). The D-EM uses an unlabeled dataset to help supervised learning to reduce the number of labeled training samples. The image datasets utilized to test the above algorithms have simple (uniform) and relatively similar backgrounds, and these works did not address the issues with complex backgrounds (which contain clutter and other distracting objects).

An algorithm for the recognition of hand postures in complex natural environments is useful for the real-world applications of interactive systems. Triesch and Malsburg (1996a, 2001) addressed the complex background problem in hand posture recognition using elastic graph matching (EGM). Bunch graph method (Triesch and Malsburg 1996a) is utilized to improve the performance in complex environments. In graph algorithms, the entire image is scanned to detect the object, which increases the computational burden1. In addition, in a bunch graph each node is represented using a bunch of identical node features which further decreases the processing speed. Athitsos et al. proposed another algorithm to estimate hand pose from cluttered images (Athitsos and Sclaroff 2003). The algorithm segmented the image using skin color, and it needs fairly accurate estimates of the center and the size of the hand. The above algorithms cannot deal with complex backgrounds which contain skin colored regions, and large variations in the hand size.

1.2 The Proposed Approach

This paper focuses on the detection and recognition of hand postures in cluttered natural environments. The proposed algorithm utilizes a biologically inspired approach, which is based on the computational model of visual cortex (Serre et al. 2007) and the Bayesian model of visual attention (Chikkerur et al. 2010). The Bayesian theory of attention is utilized to detect and identify the hand region in complex background images. The where information is extracted using feature based visual attention. The features utilized are based on shape, texture and color.

The shape and texture based features are extracted from a map which represents the similarity of pixels to human skin color, using the computational model of the ventral stream of visual cortex (Serre et al. 2007). The color features are extracted by the discretization of chrominance color components in HSI and YCbCr color spaces, and the similarity to skin map. A saliency map is created by calculating the posterior probabilities of pixel locations to be part of a hand region, using the Bayesian model of attention. The presence of hand is detected by thresholding the saliency map, and the hand region is extracted by the segmentation of input image using the thresholded saliency map. The hand postures are recognized using shape and texture based features (of the hand region), with a Support Vector Machines (SVM) classifier.

The proposed algorithm is reliable against complex and skin colored backgrounds as the segmentation of hand region is done using the attention mechanism, which utilizes a combination of color, shape, and texture features. The experimental results show that the algorithm has a person independent performance. The proposed algorithm has robustness against variations in hand size and its position in the image.

The number of hand posture databases available to the research community is limited (Triesch and Malsburg 1996b). This paper contributes a new 10 class complex background hand posture dataset, namely NUS hand posture dataset-II.2 The postures are obtained from 40 subjects, with various hand sizes, and with different ethnicities. The images have a variety of indoor as well as outdoor complex backgrounds. The hand postures have wide intra class variations in hand sizes and appearances. The database also contains a set of background images which is used to test the hand detection capability of the proposed algorithm. The recognition algorithm is tested with the new dataset using a 10 fold cross validation strategy. It provided an accuracy of 94.36 %.

Section 2 of this paper explains the shape and texture based feature extraction system, and the Bayesian model of attention. The proposed system is presented in Sect. 3. Section 4 discusses the experimental results. The paper is concluded with the remarks provided in Sect. 5.

2 The Feature Extraction System and the Model of Attention

The biologically inspired, shape and texture based feature extraction system, and the Bayesian model of visual attention (Chikkerur et al. 2010) are briefed in this section. The shape and texture based features are extracted using a cortex like mechanism (Serre et al. 2007), and the visual attention is implemented using these features and a set of color features.

2.1 Biologically Inspired Features for Visual Pattern Recognition

The image features which are to be used for pattern recognition is an ongoing research topic in computer vision. Hubel and Wiesel discovered the organization of receptive fields, and the properties of simple and complex cells in cat’s primary visual cortex (Wiesel and Hubel 1962). The cortical simple cell receptive fields are modeled (Jones and Palmer 1987) using a Gabor filter (Gabor wavelet) (1), (2).
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Equ1_HTML.gif
(1)
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Equ2_HTML.gif
(2)
where,
γ

the spatial aspect ratio of the Gaussian function,

σ

the standard deviation of the Gaussian function,

λ

the wavelength of the sinusoidal term,

θ

the orientation of the Gaussian from the x-axis.

Gabor wavelet based features have good discriminative power among various image textures and shapes. Riesenhuber and Poggio extended the use of 2D Gabor wavelets to propose a hierarchical model of ventral visual object-processing stream in the visual cortex (Poggio and Riesenhuber 1999). Serre et al. implemented a computational model of the system, and utilized it for robust object recognition (Serre et al. 2005, 2007). The features extracted by this model are known as the C1 and C2 standard model features (SMFs). These features are scale and position-tolerant, and the number of extracted features is independent of the input image size. The C2 SMFs are later used for hand writing recognition (Van der Zant et al. 2008) and face recognition (Lai and Wang 2008; Pramod Kumar et al. 2010a). The proposed algorithm utilizes the C2 features for multi-class hand posture recognition.

2.1.1 Extraction of Shape and Texture Features

The feature extraction system consists of four layers (Table 1). Layer 1 (S1) consists of a battery of Gabor filters with 4 orientations (0, 45, 90, 135) and 16 sizes (divided into 8 bands). The S1 layer imitates the simple cells in the primary visual cortex (V1),3 detecting edges and bars. Layer 2 (C1) models the complex cells in V1, by applying a MAX operator locally (over different scales and positions) to the first layer’s outputs.4 This operation provides tolerance to different object projection sizes, positions, and rotations in the 2-D plane of the visual field. In layer 3 (S2), radial basis functions (RBFs) are utilized to imitate the visual area V4 and posterior inferotemporal (PIT) cortex. Layer 3 aids shape and texture recognition by comparing the C1 images with prototypical C1 image patches. The prototypical C1 image patches (the prototype patches) are learned and stored during the training (in humans, these patches correspond to learned patterns of previously seen visual images and are stored in the synaptic weights of the neural cells). Finally, the fourth layer (C2) applies a MAX operator (over all scales, but not over positions) to the outputs of layer S2, resulting in a representation that expresses the similarities with the prototype patches. The outputs of layer 4 are C2 response matrices, which are the shape and texture based features utilized in the attention model.
Table 1

Different layers in the shape and texture feature extraction system

Layer

Process

Represents

S1

Gabor filtering

simple cells in the primary visual cortex (V1)

C1

Local pooling

complex cells in the primary visual cortex (V1)

S2

Radial basis functions

visual area V4 & posterior inferotemporal cortex

C2

Global pooling

inferotemporal cortex

Note: S stands for simple cells and C stands for complex cells. The simple and complex cells are the two types of cells in the visual cortex. The simple cells primarily respond to oriented edges and bars. The complex cells provide spatial invariance

Figure 2 shows an overview of the shape and texture based feature extraction system. Simple cells in the RBF stage (third layer, S2) combines bars and edges in the image to more complex shapes. RBFs are real valued functions that compares the distance between an input signal and a prototype signal (Bishop 1995). Each S2 unit response depends in a Gaussian-like manner on the Euclidean distance between crops of the C1 image (Xi) and the stored prototype patch (Pj). The prototype patches (centers of the RBF units) of different sizes are extracted from the C1 responses of the training images. The centers of the patches are positioned at the geometrically significant and textured positions of the hand postures (Fig. 2). Each patch contains the four orientations. The third layer compares these patches by calculating the summed Euclidean distance between the patch (Pj) and every possible crop (Xi) of the C1 image (combining all orientations). This comparison is done with all the C1 responses in the second layer (the C1 responses at different scales).
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig2_HTML.gif
Fig. 2

Extraction of the shape and texture based features (C2 response matrices). The S1 and C1 responses are generated from a skin similarity map (Sect. 3.1.2) of the input image. The prototype patches of different sizes are extracted from the C1 responses of the training images. 15 patches, each with four patch sizes, are extracted from each of the 10 classes leading to a total of 600 prototype patches. The centers of the patches are placed at the geometrically significant and textured positions of the hand postures (as shown in the sample hand posture). There are 600 C2 response matrices, one corresponding to each prototype patch. Each C2 response depends in a Gaussian-like manner on the Euclidean distance between crops of the C1 response of the input image and the corresponding prototype patch

The scale invariant C2 responses are computed by taking a global maximum over all the scales for each S2 type. Each C2 response matrix corresponds to a particular prototype patch with a specific patch size. The more the number of extracted features the better is the classification accuracy. However, the computational burden (for feature extraction as well as classification) increases with the number of features. In the present work 15 prototype patches with 4 patch sizes are extracted from each of the 10 classes. The total number of patches5 is 600, leading to 600 shape and texture based features.

2.1.2 Modifications in the Shape and Texture Feature Extraction System

The major differences between the proposed shape and texture features (utilized in the attention system) and that presented in Serre et al. (2007) are summarized as follows.
  1. 1.

    The shape and texture features are extracted from a similarity to skin map (Sect. 3.1.2) (not from the grey scale image).

     
  2. 2.

    The prototype patches are extracted from different class images. The centers of the patches are placed at geometrically significant and textured positions of the hand postures.

     
  3. 3.

    The output feature component is a C2 response matrix (instead of a real number) which retains the hand location information.

     

The parameters of the shape-texture feature extraction system are those reported in Serre et al. (2007) except for the number (600) and position of the prototype patches.

2.2 Feature Based Visual Attention

Visual attention is the part of the inference process that addresses the visual recognition problem of what is where (Chikkerur et al. 2010). Visual attention helps to infer the identity and position of objects in a visual scene. Attention reduces the size of the search space and the computational complexity of recognition.

The visual attention is directed selectively to objects in a scene using both bottom-up, image-based saliency cues and top-down, task-dependent cues. The top-down task based attention is more deliberate and powerful (Itti and Koch 2001), and depends on the features of the object. The proposed pattern recognition system is task based and it utilizes a top-down approach. The attention is focussed on the region of interest using the object features.

Visual perception is interpreted as a Bayesian inference process whereby priors (top-down) help to disambiguate noisy sensory input signals (bottom-up) (Dayan et al. 1995). Visual recognition corresponds to estimating posterior probabilities of visual features for specific object categories, and their locations in an image. The posterior probabilities of location variables serve as a saliency map. A Bayesian model of spatial attention is proposed in Rao (2005). Chikkerur et al. (2010) modified the model to include feature based attention, in addition to the spatial attention. The model imitates the interactions between the parietal and ventral streams of visual cortex, using a Bayesian network (Bayes net).

Figure 3 shows the two types of visual attention, spatial attention and feature attention. The present work utilizes the feature attention (with different feature priors) to create the saliency map. The location priors are set to be uniform, as the hand can be randomly positioned in the image. The visual attention model is developed utilizing the shape and texture based features, and the color features. The saliency map (the posterior probabilities of pixel locations) is generated using the learned feature priors and evidences from the images. The hand region is segmented by thresholding the saliency map. The complex backgrounds of the images considered contain skin colored pixels. Due to this, the utilization of color based features alone is not effective. However, when the color features are combined with shape and texture features, it resulted in better identification of the hand region, than that achieved with shape and texture features alone.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig3_HTML.gif
Fig. 3

Two types of visual attention as per the Bayesian model (Rao 2005; Chikkerur et al. 2010). Spatial attention utilizes different priors for locations and helps to focus attention on the location of interest. Spatial attention reduces uncertainty in shape. Feature attention utilizes different priors for features and helps to focus attention on the features of interest. Feature attention reduces uncertainty in location. The output of the feature detector (with location information) serve as the bottom-up evidence in both spatial and feature attention. Feature attention with uniform location priors is utilized in the proposed hand posture recognition system, as the hand can be randomly positioned in the image

3 Hand Posture Detection, Segmentation and Recognition

The proposed algorithm addresses the complex image background issue by utilizing a combination of different features. The hand region in image is identified by calculating the joint posterior probability of the feature combination. Bayesian inference is utilized to create a saliency map, which helps in the segmentation of the hand region.

In Bayesian inference, the likelihood of a particular state of the world being true is calculated based on the present input and the prior knowledge about the world. The significance of an input is decided based on the prior experience. In images with complex backgrounds, the shape patterns emerging from the background affect the pattern recognition task negatively. To recognize a pattern from a complex background image, the features corresponding to the foreground object are given higher weightage compared to that corresponding to the background.

In this work, the shape and texture features of the hand postures, and the color features of the human skin are utilized to focus attention on the hand region. The posterior probability of a pixel location to be the part of a hand region is calculated by assigning higher priors (which is learned from the training images) to the features corresponding to the hand area. The hand postures are detected and segmented by thresholding the posterior probabilities. Classification of hand postures is done using the shape and texture features of the hand region, with an SVM classifier.

Figure 4 shows the block diagram of the proposed system. The functions of different blocks in the system are elaborated in the following subsections.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig4_HTML.gif
Fig. 4

The proposed attention based hand posture recognition system

3.1 Image Pre-processing

The image pre-processing includes color space conversions and the generation of similarity to skin map.

3.1.1 Color Space Conversions—RGB to HSI and YCbCr

The input image in RGB space is converted to HSI and YCbCr color spaces. The conversion between RGB and HSI is nonlinear whereas that between RGB and YCbCr is linear. The chrominance components in these color spaces (HSCb, and Cr) are utilized to detect the hand region in images.6 The hue value H refers to the color type (such as red, blue, or yellow), and the saturation value S refers to the vibrancy or purity of the color. The values of Cb and Cr represent the blue component (BY) and the red component (RY) (Chaves-González et al. 2010) respectively (Y stands for the luminance value). The values of H and S are in the range [0 1], and those of Cb and Cr are within [16 240].

3.1.2 Similarity to Skin Map

A skin similarity map (3) is created using the similarity of each pixel in HSI space to the average pixel values of the skin color.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Equ3_HTML.gif
(3)
where,
Sskin

the similarity of the pixel to skin color,

H & S

the hue and saturation values of the pixel,

Hs0 & Ss0

the average hue and saturation values of the skin colors,

Hsmax & Hsmin

the maximum and minimum of the hue values of the skin colors, and,

Ssmax & Ssmin

the maximum and minimum of the saturation values of the skin colors.

The average hue and saturation values are calculated by considering 10 skin colored pixel values of all the subjects.7 The values of different parameters in (3), obtained from the present study, are provided in Table 2. The hue value span (0.1770) is smaller than that of saturation (0.5692). The coefficient of the saturation term ((SSs0)2) in (3) is a scaling factor to compensate for this difference in span.
Table 2

Skin color parameters

Hs0

Ss0

Hsmax

Hsmin

Ssmax

Ssmin

Hue span (HsmaxHsmin)

Saturation span (SsmaxSsmin)

0.1073

0.3515

0.1892

0.0122

0.6250

0.0558

0.1770

0.5692

The similarity to skin map enhances the edges and shapes within the skin colored regions in the images, while preserving the textures (Fig. 5). The proposed system extracts the shape and texture based features of hand postures from the skin similarity map. The feature extraction system detects and learns the edges and bars (at different orientations), and the textures in images. The utilization of the skin similarity map enhances the capability of the system to detect the hand region in complex background images.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig5_HTML.gif
Fig. 5

Sample hand posture images (column 1RGB, column 2grayscale) with corresponding skin similarity map (column 3). The skin similarity map enhanced the edges and shapes of the hand postures. The marked regions in column 3 have better edges of the hand, as compared with that within the corresponding regions in column 1 and 2. The edges and bars of the non-skin colored areas are diminished in the skin similarity map (column 3). However the edges corresponding to the skin colored non-hand region are also enhanced (row 2, column 3). The proposed algorithm utilizes the shape and texture patterns of the hand region (in addition to the color features) to address this issue (Color figure online)

3.2 Extraction of Color, Shape and Texture Features

The proposed algorithm utilizes a combination of low and high level image features to develop the visual attention model. A saliency map is generated and the attention is focussed on the hand region using color features (low level) and shape-texture features (high level). The postures are classified using the high level features.

3.2.1 Color based Features

Jones and Rehg proposed a statistical color model for the detection of skin region in images (Jones and Rehg 1999). They suggested that skin color can be a powerful cue for detecting people in unconstrained imagery. However they did the skin color detection only using the RGB color space. Analysis and comparison of the usage of different color spaces for skin segmentation are provided in Phung et al. (2005) and Chaves-González et al. (2010). Chaves-González et al. (2010) rates HSI space as the best choice for skin segmentation. Cb and Cr components in the YCbCr space provided better results (compared to that provided by HS and RG components) for the experiments conducted in Phung et al. (2005). A combination of HSI and YCbCr color spaces can improve the segmentation of skin colored region in images. The proposed algorithm utilizes a combination of the chrominance color components (H, S, Cb, and Cr) of these color spaces as the color features for hand region segmentation.

The proposed algorithm generates the skin similarity map using the average skin color components (Hs0 and Ss0 in Table 2). The skin colors and the corresponding component values vary about these mean values. Figure 6 shows six skin samples which have inter and intra ethnic variations in skin colors. The average chrominance component values of these samples are provided in Table 3. The H, Cb, and Cr components have approximately 10 % variation whereas that for S component is 50 %. In order to detect different skin colors in spite of the variations, the proposed algorithm considers different ranges of color components.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig6_HTML.gif
Fig. 6

Skin samples showing the inter and intra ethnic variations in skin color. Table 3 provides the average H, S, Cb, and Cr values of the six skin samples (Color figure online)

Table 3

Average H, S, Cb, and Cr values of the six skin samples in Fig. 6

Sample

H

S

Cb

Cr

Turkish

0.1859

0.4118

111.6

150.0

German

0.1207

0.5733

118.1

156.4

Burmese Chinese

0.1106

0.4529

101.9

141.6

Chinese

0.1830

0.0954

127.8

129.8

Indian 1

0.0493

0.5894

112.8

148.0

Indian 2

0.0572

0.3816

107.7

152.4

Note: The bolded figures represent the maximum and minimum values in each column

Color features utilized in the proposed algorithm are the discretized values of chrominance color components, and the similarity to skin map (Table 4). The color features (with shape and texture features) are utilized to calculate the joint posterior probability of pixel locations to be part of a hand region. The values of H, S, Cb, Cr, and Sskin fall within the range [0 1] (values of Cb and Cr are normalized). Each of these components is quantized into 10 intervals. The range of the ith interval is given by [(i−1)/10,i/10], where i=1,2,…,10. The features extracted are named as in Table 4. For example the hue values between 0 to 0.1 is a feature and is named as H1. Similarly there are 50 number of color features (Table 4). The prior probabilities for the presence of these features are calculated by counting the frequency of occurrence of the features in the skin colored hand area in images.8 The features those had maximum priors are H1, H2, S4, Cb5, Cr6, and Sskin10. The color features are common for all the hand postures (the position and frequency of the features may vary, however). Due to this, the color features are utilized only for focussing the attention on the hand, and not for the interclass discrimination of hand postures.
Table 4

Color based features. (a) Chrominance component features: Discretized chrominance color components (H, S, Cb, Cr) in HSI and YCbCr color spaces. Each component is discretized into 10 subcomponents as shown. (b) Skin similarity features: Discretized skin color similarity (3) values

(a)

H

S

Cb

Cr

Ranges

H1

S1

Cb1

Cr1

0.0–0.1

H2

S2

Cb2

Cr2

0.1–0.2

H3

S3

Cb3

Cr3

0.2–0.3

H4

S4

Cb4

Cr4

0.3–0.4

H5

S5

Cb5

Cr5

0.4–0.5

H6

S6

Cb6

Cr6

0.5–0.6

H7

S7

Cb7

Cr7

0.6–0.7

H8

S8

Cb8

Cr8

0.7–0.8

H9

S9

Cb9

Cr9

0.8–0.9

H10

S10

Cb10

Cr10

0.9–1.0

(b)

Sskin

Ranges

Sskin1

0.0–0.1

Sskin2

0.1–0.2

Sskin3

0.2–0.3

Sskin4

0.3–0.4

Sskin5

0.4–0.5

Sskin6

0.5–0.6

Sskin7

0.6–0.7

Sskin8

0.7–0.8

Sskin9

0.8–0.9

Sskin10

0.9–1.0

3.2.2 Shape and Texture based Features

The shape and texture descriptors (Sect. 2.1.1) are extracted from the similarity to skin map. In the proposed algorithm, 15 prototype patches with 4 patch sizes, are extracted from each of the 10 classes. The 15 patches are from 15 different images of the same class (experiments are also run using patches extracted from a single image per class, but better accuracy is achieved in the earlier case as it provides better invariance). Each of the extracted patch contains the four orientations (0, 45, 90, 135). The total number of patches is 600, leading to 600 shape and texture based features.

3.3 Feature Based Visual Attention and Saliency Map Generation

The feature based visual attention is implemented utilizing a combination of low level (color) and high level (shape and texture) features. Figure 7 shows the Bayes net utilized in the proposed system, which is developed based on the model proposed in Chikkerur et al. (2010). The Bayes Net Toolbox (BNT) (Murphy 2003) is utilized to implement the Bayes net. The proposed probabilistic model is given by (4) (the LHS represents the joint probability of the Bayes net shown in Fig. 7 and the RHS represents the probability of hand P(O), the probability of location P(L), and the conditional probabilities corresponding to the nodes in the Bayes net).
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Equ4_HTML.gif
(4)
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Equ5_HTML.gif
(5)
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig7_HTML.gif
Fig. 7

Bayes net used in the proposed system. O—the object (hand), L—the location of the hand, I—the image, Fs1 to FsN1N1 binary random variables that represent the presence or absence of shape and texture features, Fc1 to FcN2N2 binary random variables that represent the presence or absence of color features, Xs1 to XsN1—the position of N1 shape and texture based features, Xc1 to XcN2—the position of N2 color based features

The feature-based attention depends on the task-based priors and evidences. The posterior probabilities of locations (5), which serve as a saliency map, are calculated using the top-down priors and bottom-up evidences. The priors and evidences are calculated from the training and testing images respectively (Table 5). A belief propagation algorithm (Pearl 1988) is utilized for the calculation of the posterior probabilities. The evidences from the test images for shape-texture features (P(I/Xsi)) and color features (P(I/Xcj)) are modulated by the preferences for the features (the learned priors), P(Fsi/O) and P(Fcj/O) respectively. The locations of the preferred features can be identified from the posterior probability P(L/I), which represents the saliency map.
Table 5

Description of the conditional probabilities (priors, evidences, and the posterior probability)

Conditional probability

Represents

Calculation

P(Fsi/O)

The top-down shape and texture feature priors; the probability of shape and texture features being present, given the presence of hand

By counting the frequency of occurrence of features* within the training images (maximum one count per image)

P(Fcj/O)

The top-down color feature priors; the probability of color features being present, given the presence of hand

By counting the frequency of occurrence of features (Table 4) within the hand region in the training images (400 images, 1 image per class per subject, are considered)

P(I/Xsi)

Bottom-up evidence for the shape and texture features; provides the likelihood that a particular location is active for the shape and texture features

By the shape and texture feature extraction of test images (Sect. 2.1.1)

P(I/Xcj)

Bottom-up evidence for the color features; provides the likelihood that a particular location is active for the color features

By the color feature extraction of test images (Sect. 3.2.1)

P(L/I)

Posterior probabilities of location, which acts as the saliency map

By the belief propagation algorithm (Pearl 1988)

*A feature is present if it is above a threshold value. Otherwise it is absent. In the proposed algorithm, the threshold is set at 75 % of the maximum value of corresponding feature in the training data

3.4 Hand Segmentation and Classification

The hand region is segmented using the saliency map generated. For the segmentation, a bounding box is created around the most salient (top 30 %) locations in the image. The shape and texture based features of the hand region are extracted next. The same prototype patches selected earlier (Sect. 2.1.1) are utilized for the feature extraction. The C2 SMFs are extracted by taking maximum over positions of the C2 response matrices, similar to that done in Serre et al. (2007). That is, the value of the best match between a stored prototype and the input image is kept and the rest are discarded. An SVM classifier with linear kernel is utilized for the classification.

Figure 8 provides a pictorial summary of the proposed system, showing the image pre-processing, feature extraction, attention, and classification stages.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig8_HTML.gif
Fig. 8

An overview of the attention based hand posture detection and recognition system

4 Experimental Results and Discussion

The proposed algorithm is tested using a 10 class complex background hand posture dataset.

4.1 The Dataset and the Experimental Setup

As the number of available hand posture datasets is limited, a new 10 class dataset namely NUS hand posture dataset-II (Fig. 910) is developed.9 The hand postures were shot in and around National University of Singapore (NUS), against complex natural backgrounds, with various hand shapes and sizes. The postures were obtained from 40 subjects, with various ethnicities, against different natural complex backgrounds. The subjects include both males and females in the age range of 22 to 56 years. The subjects were asked to show the 10 hand postures, 5 times each. They were asked to loosen the hand muscle after each shot, in order to incorporate the natural variations in the postures. The dataset consists of 3 subsets (Table 6).
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig9_HTML.jpg
Fig. 9

Sample images from NUS hand posture dataset-II (data subset A), showing posture classes 1 to 10

https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig10_HTML.jpg
Fig. 10

Sample images (class 9) from NUS hand posture dataset-II (data subset A) , showing the variations in hand posture sizes and appearances

Table 6

Different subsets in NUS hand posture dataset-II

Subset

Consists

A

2000 hand posture color images (40 subjects, 10 classes, 5 images per class per subject, image size: 160×120) with complex backgrounds

B

750 hand posture color images (15 subjects, 10 classes, 5 images per class per subject, image size: 320×240) with noises like body/face of the posturer or the presence of a group of human in the background

C

2000 background images without the hand postures (used for testing hand detection capability)

The proposed algorithm is tested in two aspects: hand posture detection and recognition. The hand posture detection capability is tested using data subsets A and C. The hand posture recognition capability is tested using data subsets A and B.

4.2 Hand Posture Detection

The hand postures are detected by thresholding the saliency map. To calculate the detection accuracy, saliency map is created using the posterior probabilities of locations, for the set of hand posture and the background images. If the posterior probability is above a threshold value, the presence of hand is detected. Figure 11 shows the Receiver Operating Characteristics (ROC) of the hand detection task (the curve is plotted by decreasing the threshold) by the three systems; (a) system with shape, texture, and color attention, (b) system with shape and texture attention alone, and, (c) the system with color attention alone. On comparison, the system with shape, texture, and color attention provided better performance.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig11_HTML.gif
Fig. 11

Receiver Operating Characteristics of the hand detection task. The graph is plotted by decreasing the threshold of the posterior probabilities of locations to be a hand region. Utilization of only shape-texture features provided reasonable detection performance (green) whereas utilization of only color features lead to poor performance (red) (due to the presence of skin colored backgrounds). However the algorithm provided the best performance (blue) when the color features are combined with shape-texture features (Color figure online)

4.3 Hand Region Segmentation

Figure 12 shows the segmentation of hand region using the skin color similarity and the saliency map. The segmentation using skin color similarity performed well when the background does not contain skin colored regions (Fig. 12, column 1). However natural scenes may contain many skin colored objects (more than 70 % of the images in the dataset under consideration have skin colored regions in the background). The segmentation using skin color similarity fails in such cases (Fig. 12, column 2 and 3). The proposed attention based system succeeded in the segmentation of images with complex background, irrespective of whether it contained skin colored regions or not.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig12_HTML.jpg
Fig. 12

Segmentation of hand region using the similarity to skin map and the saliency map. Each column shows the segmentation of an image. Row 1 shows the original image, row 2 shows the corresponding similarity to skin map (darker regions represent better similarity) with segmentation by thresholding, row 3 shows the saliency map (only the top 30 % is shown), and row 4 shows the segmentation using the saliency map. The background in image 1 (column 1) does not contain any skin colored area. The segmentation using skin similarity map succeeds for this image. Image 2 and 3 (column 2 and 3 respectively) backgrounds contain skin colored area. The skin color based segmentation partially succeeds for image 2, and it fails for image 3 (which contains more skin colored background regions, compared to that in image 2). The segmentation using the saliency map (row 4) succeeds in all the 3 cases (Color figure online)

Figure 13 shows 50 sample images (5 from each class) from the dataset and the corresponding saliency maps. The hand regions are segmented using these saliency maps, in a way similar to that shown in Fig. 12.
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig13_HTML.jpg
Fig. 13

Different sample images from the dataset and the corresponding saliency maps. Five sample images from each class are shown. The hand region in an image is segmented using the corresponding saliency map

4.4 Hand Posture Recognition

The proposed hand posture recognition algorithm is tested using 10 fold cross validation10 using the data subset A. The recognition accuracies for the four cases; (a) with shape, texture, and color based attention, (b) with shape and texture based attention, (c) with color based attention, and, (d) without attention are reported (Table 7). On comparison, the best recognition rate (94.36 %) is achieved when the shape, texture, and color based feature attention is utilized.
Table 7

Hand posture recognition accuracies: data subset A

Method

Accuracy (%)

Proposed system

Attention using shape, texture, and color features

94.36

Attention using shape and texture features alone

87.72

Attention using color features alone

81.75

C2 features without attention (Serre et al. 2007)

75.71

Elastic graph matching (EGM)* (Triesch and Malsburg 2001)

69.80

*The EGM algorithm in Triesch and Malsburg (2001) is implemented as it is, for the comparative study. Sample divisions utilized to test the proposed approach and the EGM are the same

When the attention is implemented using shape and texture features, the algorithm provided good improvement in the accuracy (87.72 %) compared to that achieved (75.71 %) by the algorithm proposed in Serre et al. (2007). The color feature attention alone provided lesser accuracy (81.75 %) compared to that provided by shape and texture attention. The lesser accuracy with color feature attention is due to the skin colored pixels in the background. The color features are extracted using point processing, whereas the shape-texture features are extracted using neighborhood processing. This is another reason for the lesser accuracy with color feature attention. However when color features are combined with the shape and texture features, it resulted in the best accuracy (94.36 %).

Table 7 also shows a comparison of the accuracy provided by the proposed algorithm with that provided by the EGM algorithm (Triesch and Malsburg 2001). The EGM algorithm provided only 69.80 % recognition accuracy in spite of the high computational complexity of graph matching. The EGM algorithm performs poor when the complex background of the image contains skin colored objects. A majority of the samples misclassified by the EGM algorithm are the images with skin colored complex backgrounds. The proposed algorithm has robustness to skin colored backgrounds as it utilizes shape and texture patterns with color features. The shape-texture selectivity of the feature extraction system is improved as the prototype patches are extracted from the geometrically significant and textured positions of the hand postures.

4.5 Performance with Human Skin and Body Parts as Noises

The NUS hand posture dataset-II, subset B is developed to test the recognition capability of the proposed algorithm against backgrounds containing human as a noise. The data subset B contains images with noises like body/face of the posturer or a group of other humans in the background (Fig. 14).
https://static-content.springer.com/image/art%3A10.1007%2Fs11263-012-0560-5/MediaObjects/11263_2012_560_Fig14_HTML.jpg
Fig. 14

Sample images from NUS hand posture dataset-II, data subset B. The subset contains images with human skin and body parts as noises

Training of the algorithm is carried out using 200 images (4 subjects) from data subset A and the testing is done using data subset B (Table 6). As the proposed algorithm combines shape-texture features with color features, it could detect the hand region in these images, in spite of the noise due to other skin colored human body parts (arm/face of the posturer or other human in the background). Table 8 shows the recognition accuracy and its comparison with that provided by original C2 features and EGM. The proposed algorithm provided a recognition rate of 93.07 %, better than that provided by the compared methods.
Table 8

Hand posture recognition accuracies: Data subset B

Method

Accuracy (%)

Proposed system*

93.07

C2 features without attention (Serre et al. 2007)

68.40

Elastic graph matching (EGM) (Triesch and Malsburg 2001)

62.13

*Attention using shape, texture, and color features. The training is carried out using 200 images from data subset A and testing is done using data subset B

4.6 Comparison of the Recognition Time

Table 9 provides a comparison of the average recognition time of the proposed algorithm with that of the EGM algorithm (image size: 160×120 pixels, implement-ed in MATLAB computing platform). The proposed algorithm has a lesser recognition time compared to EGM algorithm. However the response time of the proposed algorithm (which is limited by the shape and texture feature extraction system) is to be improved further for real-time applications.
Table 9

Comparison of the recognition time

Method

Time

Proposed system

2.65 s

Elastic graph matching (EGM) (Triesch and Malsburg 2001)

6.19 s

5 Conclusion

An attention based system is proposed for the recognition of hand postures against complex backgrounds. A combination of high and low level image features is utilized to detect the hand, and to focus the attention on the hand region. A saliency map is generated using Bayesian inference. The postures are classified using the shape and texture based features of the hand region with an SVM classifier. The proposed algorithm is tested with a 10 class complex background dataset, the NUS hand posture dataset-II.

The proposed algorithm has a person independent performance. It provided good hand posture detection and recognition accuracy in spite of variations in hand sizes. The algorithm provided reliable performance against cluttered natural environments including skin colored complex backgrounds. The proposed algorithm is tested with color based attention alone, with shape and texture based attention alone, and with the combination of color, shape, and texture attention. On comparison, the algorithm provided the best recognition accuracy when the combination of color, shape, and texture attention is utilized.

The proposed feature attention based algorithm can be extended for the recognition of dynamic gestures and human body postures in cluttered natural environments. The utilization of color features may not be effective in the case of human body postures due to clothing on the body. However a body posture provides more reliable texture features compared to that of a hand posture. Another possible future work is the modification of the algorithm to improve its processing speed and computational complexity.

Footnotes
1

Graph matching is considered to be one of the most complex algorithms in vision based object recognition (Bienenstock and Malsburg 1987). The complexity is due to its combinatorial nature.

 
2

The dataset is available for free download: http://www.ece.nus.edu.sg/stfpage/elepv/NUS-HandSet/.

 
3

V1, V2, V3, V4, and V5 are the visual areas in the visual cortex. V1 is the primary visual cortex. V2 to V5 are the secondary visual areas, and are collectively termed as the extrastriate visual cortex.

 
4

Refer Serre et al. (2007) for further explanation of S1 and C1 stages (layer 1 and 2).

 
5

The number of prototype patches and orientations are tunable parameters in the system. Computational complexity increases with these parameters. The reported values provided optimal results (considering the accuracy and computational complexity).

 
6

The luminance color components are not utilized as these components are sensitive to skin color as well as lighting.

 
7

The dataset consists of hand postures by 40 subjects, with different ethnic origins.

 
8

400 images (1 image per class per subject) are considered. During the training phase the hand area is selected manually.

 
9

The dataset is available for academic research purposes: http://www.ece.nus.edu.sg/stfpage/elepv/NUS-HandSet/.

 
10

For cross validation the dataset is divided into 10 subsets each containing 200 images, the data from 4 subjects.

 

Acknowledgements

The authors would like to thank Ms. Ma Zin Thu Shein for taking part in the shooting of NUS hand posture dataset-II. Also the authors express their appreciation to all the 40 subjects volunteered for the development of the dataset.

Copyright information

© Springer Science+Business Media, LLC 2012